[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-TsinghuaC3I--Awesome-RL-for-LRMs":3,"tool-TsinghuaC3I--Awesome-RL-for-LRMs":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",153609,2,"2026-04-13T11:34:59",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":72,"owner_avatar_url":73,"owner_bio":74,"owner_company":75,"owner_location":75,"owner_email":75,"owner_twitter":75,"owner_website":76,"owner_url":77,"languages":78,"stars":83,"forks":84,"last_commit_at":85,"license":86,"difficulty_score":87,"env_os":88,"env_gpu":89,"env_ram":89,"env_deps":90,"category_tags":93,"github_topics":94,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":102,"updated_at":103,"faqs":104,"releases":140},7121,"TsinghuaC3I\u002FAwesome-RL-for-LRMs","Awesome-RL-for-LRMs","A Survey of Reinforcement Learning for Large Reasoning Models","Awesome-RL-for-LRMs 是一份由清华大学团队维护的开源综述资源库，专注于“大推理模型中的强化学习”这一前沿领域。随着大语言模型在复杂逻辑推理任务上的需求激增，如何通过强化学习（RL）有效提升模型的推理能力成为关键挑战。该资源库系统性地梳理了相关研究进展，涵盖了从奖励机制设计（如生成式奖励、稠密奖励）、策略优化算法（如基于批评家或无批评家的方法），到前沿模型应用等核心维度。\n\n它主要解决了研究人员和开发者在面对海量碎片化论文时，难以快速把握技术脉络、分类体系及最新突破的痛点。通过结构化的论文列表和清晰的分类导航，Awesome-RL-for-LRMs 帮助用户高效定位关于无监督奖励、在线训练及多智能体协作等关键技术的具体成果。\n\n这份资源特别适合人工智能领域的研究人员、算法工程师以及对大模型推理机制感兴趣的技术爱好者使用。其独特亮点在于不仅提供静态的论文清单，还保持高频更新，及时收录如 SSRL（无需外部搜索的智能体搜索）、MARTI（多智能体强化训练框架）等最新开源项目与理论突破，是探索大模型推理能力进化路径的实用指南。","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTsinghuaC3I_Awesome-RL-for-LRMs_readme_e10d984f3e5a.png\" style=\"width: 70%;\"\u002F>\n\n## A Survey of Reinforcement Learning for Large Reasoning Models\n\n[![Awesome](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAwesome-0066CC?style=for-the-badge&logo=awesome-lists&logoColor=white)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome) [![Survey](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.08827)  [![Github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAwesome--RL--for--LRMs-000000?style=for-the-badge&logo=github&logoColor=white)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-Reasoning-Recipes)  [![HF Papers](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHF--Paper-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2509.08827)  [![Twitter](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTwitter-%23000000.svg?style=for-the-badge&logo=x&logoColor=white)](https:\u002F\u002Fx.com\u002FOkhayIea\u002Fstatus\u002F1965989894163235111)\n\n\u003C\u002Fdiv>\n\n> We welcome everyone to open an issue for any related work we haven’t discussed, and we’ll try to address it in the next release!\n\n\n## 🎉 News\n\n- **[2025-11-05]** 🔥 Excited to release our paper list about **Memory for Agents**, covering breakthroughs in Context Management and Learning from Experience powering self-improving AI agents. Check it out: [GitHub](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-Memory-for-Agents)\n- **[2025-10]** 🎉 Honored to give talks at [BAAI](https:\u002F\u002Fevent.baai.ac.cn\u002Factivities\u002F961), [Qingke Talk](https:\u002F\u002Fqingkeai.online\u002Farchives\u002F0h3Cm8Bi) and Tencent Wiztalk! Here are the [slides](Survey@RL4LRM-v1.pdf).\n- **[2025-09-18]** 🎉 We update the full list of papers in the category structure of the survey!\n- **[2025-09-12]** 🎉 Our survey was ranked **#1 Paper of the Day** on 🤗 [Hugging Face Daily Papers](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2509.08827)!\n- **[2025-09-11]** 🔥 Excited to release our **RL for LRMs Survey**! We’ll be updating the full list of papers in with a new category structure soon. Check it out: [Paper](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2509.08827).\n- **[2025-08-15]** 🔥 Introducing **SSRL**: an investigation for Agentic Search RL without reliance on external search engine. Check it out: [GitHub](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FSSRL) and [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10874).\n- **[2025-05-27]** 🔥 Introducing **MARTI**: A Framework for LLM-based Multi-Agent Reinforced Training and Inference. Check it out: [Github](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMARTI).\n- **[2025-04-23]** 🔥 Introducing **TTRL**: an open-source solution for online RL on data without ground-truth labels, especially test data. Check it out: [Github](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FTTRL) and [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16084).\n- **[2025-03-20]** 🔥 We are excited to introduce collection of papers and projects on RL for reasoning models!\n\n\n## 🎈 Citation\n\nIf you find this survey helpful, please cite our work:\n\n```bibtex\n@article{zhang2025survey,\n  title={A survey of reinforcement learning for large reasoning models},\n  author={Zhang, Kaiyan and Zuo, Yuxin and He, Bingxiang and Sun, Youbang and Liu, Runze and Jiang, Che and Fan, Yuchen and Tian, Kai and Jia, Guoli and Li, Pengfei and others},\n  journal={arXiv preprint arXiv:2509.08827},\n  year={2025}\n}\n```\n\n## 📖 Contents\n- [A Survey of Reinforcement Learning for Large Reasoning Models](#a-survey-of-reinforcement-learning-for-large-reasoning-models)\n- [🎉 News](#-news)\n- [🎈 Citation](#-citation)\n- [📖 Contents](#-contents)\n- [🗺️ Overview](#️-overview)\n- [📄 Paper List](#-paper-list)\n  - [Frontier Models](#frontier-models)\n  - [Reward Design](#reward-design)\n    - [Generative Rewards](#generative-rewards)\n    - [Dense Rewards](#dense-rewards)\n    - [Unsupervised Rewards](#unsupervised-rewards)\n    - [Rewards Shaping](#rewards-shaping)\n  - [Policy Optimization](#policy-optimization)\n    - [Policy Gradient Objective](#policy-gradient-objective)\n    - [Critic-based Algorithms](#critic-based-algorithms)\n    - [Critic-Free Algorithms](#critic-free-algorithms)\n    - [Off-policy Optimization](#off-policy-optimization)\n    - [Off-policy Optimization (Exp replay)](#off-policy-optimization-exp-replay)\n    - [Regularization Objectives](#regularization-objectives)\n  - [Sampling Strategy](#sampling-strategy)\n    - [Dynamic and Structured Sampling](#dynamic-and-structured-sampling)\n    - [Sampling Hyper-Parameters](#sampling-hyper-parameters)\n  - [Training Resource](#training-resource)\n    - [Static Corpus (Code)](#static-corpus-code)\n    - [Static Corpus (STEM)](#static-corpus-stem)\n    - [Static Corpus (Math)](#static-corpus-math)\n    - [Static Corpus (Agent)](#static-corpus-agent)\n    - [Static Corpus (Mix)](#static-corpus-mix)\n    - [Dynamic Environment (Rule-based)](#dynamic-environment-rule-based)\n    - [Dynamic Environment (Code-based)](#dynamic-environment-code-based)\n    - [Dynamic Environment (Game-based)](#dynamic-environment-game-based)\n    - [Dynamic Environment (Model-based)](#dynamic-environment-model-based)\n    - [Dynamic Environment (Ensemble-based)](#dynamic-environment-ensemble-based)\n    - [RL Infrastructure (Primary)](#rl-infrastructure-primary)\n    - [RL Infrastructure (Secondary)](#rl-infrastructure-secondary)\n  - [Applications](#applications)\n    - [Coding Agent](#coding-agent)\n    - [Search Agent](#search-agent)\n    - [Browser-Use Agent](#browser-use-agent)\n    - [DeepResearch Agent](#deepresearch-agent)\n    - [GUI\\&Computer Agent](#guicomputer-agent)\n    - [Recommendation Agent](#recommendation-agent)\n    - [Agent (Others)](#agent-others)\n    - [Code Generation](#code-generation)\n    - [Software Engineering](#software-engineering)\n    - [Multimodal Understanding](#multimodal-understanding)\n    - [Multimodal Generation](#multimodal-generation)\n    - [Robotics Tasks](#robotics-tasks)\n    - [Multi-Agent Systems](#multi-agent-systems)\n    - [Scientific Tasks](#scientific-tasks)\n- [🌟 Acknowledgment](#-acknowledgment)\n- [✨ Star History](#-star-history)\n\n\n## 🗺️ Overview\n\nOur survey provides a comprehensive examination of **Reinforcement Learning for Large Reasoning Models**.\n\n\u003Cp align=\"center\">\n   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTsinghuaC3I_Awesome-RL-for-LRMs_readme_ad4973300f6a.png\" alt=\"Overview of RL for LRMs Survey\" style=\"width: 100%;\">\n\u003C\u002Fp>\n\nWe organize the survey into five main sections:\n\n1. \u003Cu>Foundational Components:\u003C\u002Fu> Reward design, policy optimization, and sampling strategies\n2. \u003Cu>Foundational Problems:\u003C\u002Fu> Key debates and challenges in RL for LRMs\n3. \u003Cu>Training Resources:\u003C\u002Fu> Static corpora, dynamic environments, and infrastructure\n4. \u003Cu>Applications:\u003C\u002Fu> Real-world implementations across diverse domains\n5. \u003Cu>Future Directions:\u003C\u002Fu> Emerging research opportunities and challenges\n\n## 📄 Paper List\n\n### Frontier Models\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `Intern-S1` | Intern-S1: A Scientific Multimodal Foundation Model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.15763v1) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FIntern-S1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FInternLM\u002FIntern-S1) |\n| 2025-08 | `GLM-4.5` | GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.06471) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzai-org\u002FGLM-4.5?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-4.5) |\n| 2025-08 | `gpt-oss` | gpt-oss-120b & gpt-oss-20b Model Card | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10925) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopenai\u002Fgpt-oss?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgpt-oss) |\n| 2025-08 | `InternVL3.5` | InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18265) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) |\n| 2025-07 | `Kimi K2` | Kimi K2: Open Agentic Intelligence | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20534) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMoonshotAI\u002FKimi-K2?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-K2) |\n| 2025-07 | `Step 3` | Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19427) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fstepfun-ai\u002FStep3?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep3) |\n| 2025-07 | `GLM-4.1V-Thinking` | GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzai-org\u002FGLM-V?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-V) |\n| 2025-07 | `Skywork-R1V3` | Skywork-R1V3 Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06167) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-R1V?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-R1V) |\n| 2025-07 | `GLM-4.5V` | GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzai-org\u002FGLM-V?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-V) |\n| 2025-06 | `Magistral` | Magistral | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10910) | - |\n| 2025-06 | `Minimax-M1` | MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13585) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FMiniMax-M1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FMiniMax-M1) |\n| 2025-05 | `MiMo` | MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07608) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FXiaomiMiMo\u002FMiMo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo) |\n| 2025-05 | `Qwen3` | Qwen3 Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.09388) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3) |\n| 2025-05 | `Llama-Nemotron-Ultra` | Llama-Nemotron: Efficient Reasoning Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00949) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) |\n| 2025-05 | `INTELLECT-2` | INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07291) | - |\n| 2025-05 | `Hunyuan-TurboS` | Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15431) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent\u002FHunyuan-TurboS?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTencent\u002FHunyuan-TurboS) |\n| 2025-05 | `Skywork OR-1` | Skywork Open Reasoner 1 Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22312) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-OR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-OR1) |\n| 2025-04 | `Phi-4 Reasoning` | Phi-4-reasoning Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21318) | - |\n| 2025-04 | `Skywork-R1V2` | Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16656) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-R1V?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-R1V) |\n| 2025-04 | `InternVL3` | InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10479) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) |\n| 2025-03 | `ORZ` | Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24290) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero) |\n| 2025-01 | `DeepSeek-R1` | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12948) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) |\n| - | `QwQ` | QwQ-32B: Embracing the Power of Reinforcement Learning | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwq-32b\u002F) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwQ?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwQ) |\n| - | `Seed-OSS` | Seed-OSS Open-Source Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002Fseed-oss) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByteDance-Seed\u002Fseed-oss?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002Fseed-oss) |\n| - | `ERNIE-4.5-Thinking` | ERNIE 4.5 Technical Report | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fernie.baidu.com\u002Fblog\u002Fpublication\u002FERNIE_Technical_Report.pdf) | - |\n\n\n### Reward Design\n#### Generative Rewards\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `CAPO` | CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02298) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fandyclsr\u002FCAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fandyclsr\u002FCAPO) |\n| 2025-08 | `CompassVerifier` | CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.03686) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopen-compass\u002FCompassVerifier?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FCompassVerifier) |\n| 2025-08 | `Cooper` | Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05613) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzju-real\u002Fcooper?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzju-real\u002Fcooper) |\n| 2025-08 | `ReviewRL` | ReviewRL: Towards Automated Scientific Review with RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10308) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FMARTI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMARTI) |\n| 2025-08 | `Rubicon` | Reinforcement Learning with Rubric Anchors | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12790) | - |\n| 2025-08 | `RuscaRL` | Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16949) | - |\n| 2025-07 | `OMNI-THINKER` | OMNI-THINKER: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.14783) | - |\n| 2025-07 | `URPO` | URPO: A Unified Reward & Policy Optimization Framework for Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.17515) | - |\n| 2025-07 | `RaR` | Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.17746) | - |\n| 2025-07 | `RLCF` | Checklists Are Better Than Reward Models For Aligning Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.18624) | - |\n| 2025-07 | `PCL` | Post-Completion Learning for Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20252) | - |\n| 2025-07 | `K2` | KIMI K2: OPEN AGENTIC INTELLIGENCE | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20534) | - |\n| 2025-07 | `LIBRA` | LIBRA: ASSESSING AND IMPROVING REWARD MODEL BY LEARNING TO THINK | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21645) | - |\n| 2025-07 | `TP-GRPO` | Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23317) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcs-holder\u002Ftp_grpo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcs-holder\u002Ftp_grpo) |\n| 2025-06 | `RewardAnything` | RewardAnything: Generalizable Principle-Following Reward Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.03637) | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fzhuohaoyu.github.io\u002FRewardAnything\u002F) |\n| 2025-06 | `Writing-Zero` | Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.00103) | - |\n| 2025-06 | `Critique-GRPO` | Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.03106) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhangxy-2019\u002Fcritique-GRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhangxy-2019\u002Fcritique-GRPO) |\n| 2025-06 | `PAG` | PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10406) | - |\n| 2025-06 | `GRAM` | GRAM: A Generative Foundation Reward Model for Reward Generalization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14175) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNiuTrans\u002FGRAM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNiuTrans\u002FGRAM) |\n| 2025-06 | `ProxyReward` | From General to Targeted Rewards: Surpassing GPT-4 in Open-Ended Long-Context Generation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.16024) | - |\n| 2025-06 | `QA-LIGN` | QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08123) | - |\n| 2025-05 | `RM-R1` | RM-R1: Reward Modeling as Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.02387) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRM-R1-UIUC\u002FRM-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRM-R1-UIUC\u002FRM-R1) |\n| 2025-05 | `J1` | J1: Incentivizing Thinking in LLM-as-a-Judge via RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10320) | - |\n| 2025-05 | `TinyV` | TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14625) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fuw-nsl\u002FTinyV?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fuw-nsl\u002FTinyV) |\n| 2025-05 | `General-Reasoner` | General-reasoner: Advancing llm reasoning across all domains | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14652) | - |\n| 2025-05 | `RRM` | Reward Reasoning Model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14674) | - |\n| 2025-05 | `RL Tango` | RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15034) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkaiwenzha\u002Frl-tango?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fkaiwenzha\u002Frl-tango) |\n| 2025-05 | `Think-RM` | Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16265) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIlgeeHong\u002FThink-RM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FIlgeeHong\u002FThink-RM) |\n| 2025-04 | `JudgeLRM` | JudgeLRM: Large Reasoning Models as a Judge | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00050) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNuoJohnChen\u002FJudgeLRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNuoJohnChen\u002FJudgeLRM) |\n| 2025-04 | `GenPRM` | GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00891) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRyanLiu112\u002FGenPRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRyanLiu112\u002FGenPRM) |\n| 2025-04 | `DeepSeek-GRM` | Inference-Time Scaling for Generalist Reward Modeling | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.02495) | - |\n| 2025-04 | `AIR` | AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.03612) | - |\n| 2025-04 | `Pairwise-RL` | A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.04950) | - |\n| 2025-04 | `xVerify` | xVerify: Efficient Answer Verifier for Reasoning Model Evaluations | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10481) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIAAR-Shanghai\u002FxVerify?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FIAAR-Shanghai\u002FxVerify) |\n| 2025-04 | `Seed-Thinking-v1.5` | Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13914) | - |\n| 2025-04 | `ThinkPRM` | Process Reward Models That Think | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16828) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmukhal\u002Fthinkprm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmukhal\u002Fthinkprm) |\n| 2025-03 | - | Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23829) | - |\n| 2025-02 | - | Self-rewarding correction for mathematical reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19613) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRLHFlow\u002FSelf-rewarding-reasoning-LLM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FSelf-rewarding-reasoning-LLM) |\n| 2024-10 | `GenRM` | Generative Reward Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12832) | - |\n| 2024-08 | `CLoud` | Critique-out-Loud Reward Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.11791) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzankner\u002FCLoud?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzankner\u002FCLoud) |\n| 2024-08 | `Generative Verifier` | Generative Verifiers: Reward Modeling as Next-Token Prediction | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15240) | - |\n| 2024-01 | `Self-Rewarding LM` | Self-Rewarding Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10020) | - |\n| 2023-10 | `Auto-J` | Generative Judge for Evaluating Alignment | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05470) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002Fauto-j?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002Fauto-j) |\n| 2023-06 | `Judge LLM-as-a-Judge` | Judging llm-as-a-judge with mt-bench and chatbot arena | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05685) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flm-sys\u002FFastChat?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat) |\n\n#### Dense Rewards\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `Tree-GRPO` | Tree Search for LLM Agent Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.21240) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAMAP-ML\u002FTree-GRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAMAP-ML\u002FTree-GRPO) |\n| 2025-09 | `AttnRL` | Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.26628) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRyanLiu112\u002FAttnRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRyanLiu112\u002FAttnRL) |\n| 2025-09 | `TARL` | Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.14480) | - |\n| 2025-09 | `PROF` | Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03403) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenluye99\u002FPROF?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChenluye99\u002FPROF) |\n| 2025-09 | `HICRA` | Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03646) | - |\n| 2025-08 | `KlearReasoner` | Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.07629) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKwai-Klear\u002FKlearReasoner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKwai-Klear\u002FKlearReasoner) |\n| 2025-08 | `CAPO` | CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02298) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fandyclsr\u002FCAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fandyclsr\u002FCAPO) |\n| 2025-08 | `GTPO & GRPO-S` | GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.04349) | - |\n| 2025-08 | `VSRM` | Promoting Efficient Reasoning with Verifiable Stepwise Reward | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10293) | - |\n| 2025-08 | `G-RA` | Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10548) | - |\n| 2025-08 | `SSPO` | SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12604) | - |\n| 2025-08 | `AIRL-S` | Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.14313) | - |\n| 2025-08 | `TreePO` | TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17445) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmultimodal-art-projection\u002FTreePO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmultimodal-art-projection\u002FTreePO) |\n| 2025-08 | `MUA-RL` | MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18669) | - |\n| 2025-07 | `SPRO` | Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01551) | - |\n| 2025-07 | `FR3E` | First Return, Entropy-Eliciting Explore | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.07017) | - |\n| 2025-07 | `ARPO` | Agentic Reinforced Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19849) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUC-NLPIR\u002FARPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUC-NLPIR\u002FARPO) |\n| 2025-07 | `TP-GRPO` | Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23317) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcs-holder\u002Ftp_grpo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcs-holder\u002Ftp_grpo) |\n| 2025-06 | `TreeRPO` | TreeRPO: Tree Relative Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05183) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangzhch6\u002FTreeRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyangzhch6\u002FTreeRPO) |\n| 2025-06 | `TreeRL` | TreeRL: LLM Reinforcement Learning with On-Policy Tree Search | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11902) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FTreeRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FTreeRL) |\n| 2025-06 | `Entropy Advantage` | Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14758) | - |\n| 2025-06 | `ReasonFlux-PRM` | ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.18896) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FReasonFlux?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FReasonFlux) |\n| 2025-05 | `S-GRPO` | S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07686) | - |\n| 2025-05 | `GiGPO` | Group-in-Group Policy Optimization for LLM Agent Training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10978) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlangfengQ\u002Fverl-agent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FlangfengQ\u002Fverl-agent) |\n| 2025-05 | - | Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11821) | - |\n| 2025-05 | `Tango` | RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15034) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkaiwenzha\u002Frl-tango?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fkaiwenzha\u002Frl-tango) |\n| 2025-05 | `StepSearch` | StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15107) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZillwang\u002FStepSearch?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FZillwang\u002FStepSearch) |\n| 2025-05 | - | Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15922) | - |\n| 2025-05 | `Tool-Star` | Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16410) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdongguanting\u002FTool-Star?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdongguanting\u002FTool-Star) |\n| 2025-05 | `SPA-RL` | SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20732) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWangHanLinHenry\u002FSPA-RL-Agent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FWangHanLinHenry\u002FSPA-RL-Agent) |\n| 2025-05 | `SPO` | Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Mode | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23564) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIFrameResearch\u002FSPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAIFrameResearch\u002FSPO) |\n| 2025-04 | `GenPRM` | GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00891) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRyanLiu112\u002FGenPRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRyanLiu112\u002FGenPRM) |\n| 2025-04 | `PURE` | Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15275) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCJReinforce\u002FPURE?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FCJReinforce\u002FPURE) |\n| 2025-03 | `MRT` | Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07572) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCMU-AIRe\u002FMRT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FCMU-AIRe\u002FMRT) |\n| 2025-03 | `SWEET-RL` | SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15478) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Fsweet_rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsweet_rl) |\n| 2025-02 | `PRIME` | Process Reinforcement through Implicit Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FPRIME?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME) |\n| 2024-12 | `Implicit PRM` | Free Process Rewards without Process Labels | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01981) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FImplicitPRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FImplicitPRM) |\n| 2024-10 | `VinePPO` | VinePPO: Refining Credit Assignment in RL Training of LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01679) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMcGill-NLP\u002FVinePPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMcGill-NLP\u002FVinePPO) |\n| 2024-10 | `PAV` | Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08146) | - |\n| 2024-04 | - | From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12358) | - |\n| 2024-03 | `GELI` | Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.11330) | - |\n| 2023-12 | `Math-Shepherd` | Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08935) | - |\n| 2023-05 | `PRM800K` | Let's Verify Step by Step | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.20050) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopenai\u002Fprm800k?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fprm800k) |\n| 2022-11 | - | Solving math word problems with process- and outcome-based feedback | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.14275) | - |\n\n#### Unsupervised Rewards\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `Vision-Zero` | Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25541) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwangqinsi1\u002FVision-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwangqinsi1\u002FVision-Zero) |\n| 2025-08 | `Co-Reward` | Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.00410) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftmlr-group\u002FCo-Reward?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ftmlr-group\u002FCo-Reward) |\n| 2025-08 | `SQLM` | Self-Questioning Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.03682) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flili-chen\u002Fself-questioning-lm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flili-chen\u002Fself-questioning-lm) |\n| 2025-08 | `R-zero` | R-Zero: Self-Evolving Reasoning LLM from Zero Data | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05004) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChengsong-Huang\u002FR-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChengsong-Huang\u002FR-Zero) |\n| 2025-08 | `ETTRL` | ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11356) | - |\n| 2025-07 | `RLSF` | Post-Training Large Language Models via Reinforcement Learning from Self-Feedback | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21931) | - |\n| 2025-06 | `RLSC` | Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06395) | - |\n| 2025-06 | `RPT` | Reinforcement Pre-Training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08007) | - |\n| 2025-06 | `CoVo` | Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08745) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsastpg\u002FCoVo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsastpg\u002FCoVo) |\n| 2025-06 | `SEAL` | Self-Adapting Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10943) | - |\n| 2025-06 | `Spurious Rewards` | Spurious Rewards: Rethinking Training Signals in RLVR | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10947) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fruixin31\u002FSpurious_Rewards?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fruixin31\u002FSpurious_Rewards) |\n| 2025-06 | `No Free Lunch` | No Free Lunch: Rethinking Internal Feedback for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17219) | - |\n| 2025-05 | `Absolute Zero` | Absolute Zero: Reinforced Self-play Reasoning with Zero Data | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03335) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLeapLabTHU\u002FAbsolute-Zero-Reasoner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLeapLabTHU\u002FAbsolute-Zero-Reasoner) |\n| 2025-05 | `EM-RL` | The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15134) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshivamag125\u002FEM_PT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fshivamag125\u002FEM_PT) |\n| 2025-05 | `SSR-Zero` | SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16637) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKelaxon\u002FSSR-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKelaxon\u002FSSR-Zero) |\n| 2025-05 | - | Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19439) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinsightLLM\u002Frl-without-gt?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FinsightLLM\u002Frl-without-gt) |\n| 2025-05 | `RLIF` | Learning to Reason without External Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19590) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsunblaze-ucb\u002FIntuitor?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsunblaze-ucb\u002FIntuitor) |\n| 2025-05 | `SeRL` | SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20347) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwantbook-book\u002FSeRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwantbook-book\u002FSeRL) |\n| 2025-05 | `SRT` | Can Large Reasoning Models Self-Train? | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21444) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftajwarfahim\u002Fsrt?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ftajwarfahim\u002Fsrt) |\n| 2025-05 | `RENT-RL` | Maximizing Confidence Alone Improves Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22660) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsatrams\u002Frent-rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsatrams\u002Frent-rl) |\n| 2025-04 | `EMPO` | Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.05812) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQingyangZhang\u002FEMPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQingyangZhang\u002FEMPO) |\n| 2025-04 | `TRANS-ZERO` | TRANS-ZERO: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14669) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNJUNLP\u002Ftrans0?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNJUNLP\u002Ftrans0) |\n| 2025-04 | `TTRL` | TTRL: Test-Time Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16084) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FTTRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FTTRL) |\n| 2025-04 | `One-Shot-RLVR` | Reinforcement Learning for Reasoning in Large Language Models with One Training Example | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20571) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fypwang61\u002FOne-Shot-RLVR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fypwang61\u002FOne-Shot-RLVR) |\n| 2025-02 | `CAGSR` | A Self-Supervised Reinforcement Learning Approach for Fine-Tuning Large Language Models Using Cross-Attention Signals | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10482) | - |\n| 2024-07 | `MINIMO` | Learning Formal Mathematics From Intrinsic Motivation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00695) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgpoesia\u002Fminimo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fgpoesia\u002Fminimo) |\n\n#### Rewards Shaping\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `CDE` | CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.09675) | - |\n| 2025-09 | `DARLING` | Jointly Reinforcing Diversity and Quality in Language Model Generations | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02534) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Fdarling?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdarling) |\n| 2025-09 | `DRER` | Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06024) | - |\n| 2025-09 | `OBE` | Outcome-based Exploration for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06941) | - |\n| 2025-08 | `Pass@kTraining` | Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10751) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FPassk_Training?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FPassk_Training) |\n| 2025-05 | `PKPO` | Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15201) | - |\n| 2025-05 | `rl-without-gt` | Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19439) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinsightLLM\u002Frl-without-gt?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FinsightLLM\u002Frl-without-gt) |\n| 2025-03 | `CrossDomain-RLVR` | Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23829) | - |\n| 2025-01 | `DeepSeek-R1` | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12948) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) |\n| 2024-09 | `Qwen2.5-Math` | Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12122) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2.5-Math?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Math) |\n\n### Policy Optimization\n#### Policy Gradient Objective\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2017-07 | `PPO` | Proximal policy optimization algorithms | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06347) | - |\n| - | `PG` | Policy gradient methods for reinforcement learning with function approximation. | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F1999\u002Ffile\u002F464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf) | - |\n| - | `REINFORCE` | Simple statistical gradient-following algorithms for connectionist reinforcement learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1007\u002FBF00992696) | - |\n| - | `TRPO` | Trust region policy optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fproceedings.mlr.press\u002Fv37\u002Fschulman15.pdf) | - |\n\n#### Critic-based Algorithms\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `VL-DAC` | Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.04280) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcorl-team\u002FVL-DAC?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcorl-team\u002FVL-DAC) |\n| 2025-08 | `VRPO` | VRPO:Rethinking Value Modeling for Robust RL Training under Noisy Supervision | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.03058) | - |\n| 2025-05 | `VerIPO` | VerIPO: Long Reasoning Video-R1 Model with Iterative Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19000) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHITsz-TMG\u002FVerIPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FHITsz-TMG\u002FVerIPO) |\n| 2025-04 | `VAPO` | Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.05118?) | - |\n| 2025-03 | `VCPPO` | What’s Behind PPO’s Collapse in Long-CoT? Value Optimization Holds the Secret | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.01491) | - |\n| 2025-03 | `Open reasoner-zero` | open reasoner-zero: An open source approach to scaling up reinforcement learning on the base model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.24290) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero) |\n| 2025-02 | `PRIME` | PROCESS REINFORCEMENT THROUGH IMPLICIT REWARDS | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.01456) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FPRIME?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME) |\n| 2024-12 | `Implicit PRM` | FREE PROCESS REWARDS WITHOUT PROCESS LABELS | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.01981) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flifan-yuan\u002FImplicitPRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flifan-yuan\u002FImplicitPRM) |\n| 2023-12 | `Math-shepherd` | Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08935) | - |\n| 2015-06 | `GAE` | High-dimensional continuous control using generalized advantage estimation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1506.02438) | - |\n| - | `Autopsv` | Autopsv: Automated process-supervised verifier. | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Ffile\u002F9246aa822579d9b29a140ecdac36ad60-Paper-Conference.pdf) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frookie-joe\u002FAutoPSV?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Frookie-joe\u002FAutoPSV) |\n\n#### Critic-Free Algorithms\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `UPGE` | Towards a Unified View o fLarge Language Model Post-Training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.04419) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FUnify-Post-Training?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FUnify-Post-Training) |\n| 2025-09 | `SPO` | Single-stream Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.13232) | - |\n| 2025-08 | `LitePPO` | Part I: Tricks or Traps? A Deep Dive into RLfor LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.08221v1) | - |\n| 2025-07 | `R1-RE` | R1-RE: Cross-Domain Relation Extraction with RLVR | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04642) | - |\n| 2025-07 | `GSPO` | Group Sequence Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.18071) | - |\n| 2025-06 | `CISPO` | MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.13585) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FMiniMax-M1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FMiniMax-M1) |\n| 2025-05 | `KRPO` | Kalman Filter Enhanced Group Relative Policy Optimization for Language Model Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07527) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbillhhh\u002FKRPO_LLMs_RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fbillhhh\u002FKRPO_LLMs_RL) |\n| 2025-05 | `CPGD` | CPGD:Toward Stable Rule-based Reinforcement Learning for Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.12504) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FModalMinds\u002FMM-EUREKA?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FModalMinds\u002FMM-EUREKA) |\n| 2025-05 | `NFT` | Bridging Supervised Learning and Reinforcement Learning in Math Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.18116) | - |\n| 2025-05 | `Clip-Cov\u002FKL-Cov` | The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.22617) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FEntropy-Mechanism-of-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FEntropy-Mechanism-of-RL) |\n| 2025-03 | `OpenVLThinker` | OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.17352) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyihedeng9\u002FOpenVLThinker?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyihedeng9\u002FOpenVLThinker) |\n| 2025-03 | `DAPO` | DAPO: an Open-Source LLM Reinforcement Learning System at Scale | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.14476) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBytedTsinghua-SIA\u002FDAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FBytedTsinghua-SIA\u002FDAPO) |\n| 2025-03 | `Dr. GRPO` | Understanding R1-Zero-Like Training: A Critical Perspectiv | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.20783) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsail-sg\u002Funderstand-r1-zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Funderstand-r1-zero) |\n| 2025-01 | `Kimi k1.5` | Kimi k1.5: Scaling Reinforcement Learning with LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.12599) | - |\n| 2024-02 | `RLOO` | Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.14740) | - |\n| 2024-02 | `GRPO` | DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.03300) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-Math?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-Math) |\n| 2023-10 | `ReMax` | ReMax: A Simple, Effective, and Efficient Method for Aligning Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.10505) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliziniu\u002FReMax?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fliziniu\u002FReMax) |\n| - | `REINFORCE` | Simple statistical gradient-following algorithms for connectionist reinforcement learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1007\u002FBF00992696) | - |\n| - | `REINFORCE++` | REINFORCE++: An Efficient RLHF Algorithm with Robustnessto Both Prompt and Reward Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F387487679_REINFORCE_An_Efficient_RLHF_Algorithm_with_Robustnessto_Both_Prompt_and_Reward_Models) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenRLHF\u002FOpenRLHF?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF) |\n| - | `VinePPO` | VINEPPO: UNLOCKING RL POTENTIAL FOR LLM REASONING THROUGH REFINED CREDIT ASSIGNMENT | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fopenreview.net\u002Fpdf?id=5mJrGtXVwz) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMcGill-NLP\u002FVinePPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMcGill-NLP\u002FVinePPO) |\n| - | `FlashRL` | Fast RL training with Quantized Rollouts | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Ffengyao.notion.site\u002Fflash-rl) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyaof20\u002FFlash-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyaof20\u002FFlash-RL) |\n\n#### Off-policy Optimization\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `BRIDGE` | Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.06948) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChanLiang\u002FBRIDGE?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChanLiang\u002FBRIDGE) |\n| 2025-09 | `HPT` | Towards a Unified View of Large Language Model Post-Training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.04419) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FUnify-Post-Training?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FUnify-Post-Training) |\n| 2025-08 | `DFT` | On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.05629) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyongliang-wu\u002FDFT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyongliang-wu\u002FDFT) |\n| 2025-08 | `RED` | Recall-Extend Dynamics: Enhancing Small Language Models through Controlled Exploration and Refined Offline Integration | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.16677) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmillioniron\u002FOpenRLHF-Millioniron-?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmillioniron\u002FOpenRLHF-Millioniron-) |\n| 2025-07 | `Prefix‑RFT` | Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.01679) | - |\n| 2025-07 | `ReMix` | Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.06892) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAnitaLeungxx\u002FReMix-Reincarnated-Mix-policy-Proximal-Policy-Gradient?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAnitaLeungxx\u002FReMix-Reincarnated-Mix-policy-Proximal-Policy-Gradient) |\n| 2025-06 | `ReLIFT` | Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.07527) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTheRoadQaQ\u002FReLIFT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTheRoadQaQ\u002FReLIFT) |\n| 2025-06 | `BREAD` | BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.17211) | - |\n| 2025-06 | `SRFT` | SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.19767) | - |\n| 2025-05 | `AMPO` | Adaptive Thinking via Mode Policy Optimization for Social Language Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.02156) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMozerWang\u002FAMPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMozerWang\u002FAMPO) |\n| 2025-05 | `UFT` | UFT: Unifying Supervised and Reinforcement Fine-Tuning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.16984) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliumy2010\u002FUFT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fliumy2010\u002FUFT) |\n| 2025-04 | `LUFFY` | Learning to Reason under Off-Policy Guidance | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.14945) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FElliottYan\u002FLUFFY?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FElliottYan\u002FLUFFY) |\n| 2025-03 | `SPO` | Soft Policy Optimization: Online Off-Policy RL for Sequence Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.05453) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIFrameResearch\u002FSPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAIFrameResearch\u002FSPO) |\n| 2025-03 | `TOPR` | TAPERED OFF-POLICY REINFORCE Stable and efficient reinforcement learning for LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.14286) | - |\n| 2024-05 | `IFT` | Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.11870) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FIntuitive-Fine-Tuning?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FIntuitive-Fine-Tuning) |\n| 2023-05 | `DPO` | Direct Preference Optimization: Your Language Model is Secretly a Reward Model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.18290) | - |\n| 2015-11 | - | Fixed point quantization of deep convolutional networks | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1511.06393) | - |\n| - | - | Your Efficient RL Framework Secretly Brings You Off-Policy RL Training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Ffengyao.notion.site\u002Foff-policy-rl) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyaof20\u002Fverl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyaof20\u002Fverl) |\n\n#### Off-policy Optimization (Exp replay)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `SAPO` | Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.08721) | - |\n| 2025-09 | `SEELE` | Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06923) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChillingDream\u002Fseele?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChillingDream\u002Fseele) |\n| 2025-08 | `Memory-R1` | Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.19828) | - |\n| 2025-07 | `RLEP` | RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.07451) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKwai-Klear\u002FRLEP?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKwai-Klear\u002FRLEP) |\n| 2025-06 | `EFRame` | EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.22200) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F597358816\u002FEFRame?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002F597358816\u002FEFRame) |\n| 2025-05 | `ARPO` | ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.16282) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FARPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FARPO) |\n| 2025-04 | - | Improving RL Exploration for LLM Reasoning through Retrospective Replay | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.14363) | - |\n\n#### Regularization Objectives\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-10 | `ASPO` | ASPO: Asymmetric Importance Sampling Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.06062) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwizard-III\u002FArcher2.0?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwizard-III\u002FArcher2.0) |\n| 2025-09 | `CE-GPPO` | CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2509.20712) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKwai-Klear\u002FCE-GPPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKwai-Klear\u002FCE-GPPO) |\n| 2025-09 | `CDE` | CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.09675) | - |\n| 2025-09 | `DPH RL` | The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.07430) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fseamoke\u002FDPH-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fseamoke\u002FDPH-RL) |\n| 2025-09 | `empgseed-seed` | Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.09265) | - |\n| 2025-07 | `Archer` | Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.15778) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwizard-III\u002FArcherCodeR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwizard-III\u002FArcherCodeR) |\n| 2025-06 | `Bingo` | Bingo: Boosting Efficient Reasoning of LLMs via Dynamic and Significance-based Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08125) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FBingo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FBingo) |\n| 2025-06 | `HighEntropy RL` | Beyond the 80\u002F20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.01939) | - |\n| 2025-06 | `Entropy RL` | Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14758) | - |\n| 2025-06 | `ALP RL` | Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05256) | - |\n| 2025-05 | `DisCO` | DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12366) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOptimization-AI\u002FDisCO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOptimization-AI\u002FDisCO) |\n| 2025-05 | `Skywork OR1` | Skywork Open Reasoner 1 Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.22312) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-OR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-OR1) |\n| 2025-05 | `Entropy Mechanism` | The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.22617) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FEntropy-Mechanism-of-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FEntropy-Mechanism-of-RL) |\n| 2025-05 | `ProRL` | ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.24864) | - |\n| 2025-05 | `Short RL` | Efficient RL Training for Reasoning Models via Length-Aware Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.1228) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flblankl\u002FShort-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flblankl\u002FShort-RL) |\n| 2025-03 | `DAPO` | DAPO: An Open-Source LLM Reinforcement Learning System at Scale | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.14476) | - |\n| 2025-03 | `L1` | L1: Controlling how long a reasoning model thinks with reinforcement learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.04697) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcmu-l3\u002Fl1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcmu-l3\u002Fl1) |\n\n### Sampling Strategy\n#### Dynamic and Structured Sampling\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-10 | `EEPO` | EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.05837) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChanLiang\u002FEEPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChanLiang\u002FEEPO) |\n| 2025-09 | `AttnRL` | Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.26628) | - |\n| 2025-09 | `DACE` | Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.00125) | - |\n| 2025-09 | `Parallel-R1` | Parallel-R1: Towards Parallel Thinking via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.07980) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhengkid\u002FParallel-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhengkid\u002FParallel-R1) |\n| 2025-08 | `G^2RPO-A` | G^2RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidanc | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.13023) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FT-Lab-CUHKSZ\u002FG2RPO-A?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FT-Lab-CUHKSZ\u002FG2RPO-A) |\n| 2025-08 | `RuscaRL` | Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16949) | - |\n| 2025-08 | `TreePO` | TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17445) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmultimodal-art-projection\u002FTreePO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmultimodal-art-projection\u002FTreePO) |\n| 2025-07 | `ARPO` | Agentic Reinforced Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19849) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUC-NLPIR\u002FARPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUC-NLPIR\u002FARPO) |\n| 2025-06 | `TreeRPO` | TreeRPO: Tree Relative Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05183) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangzhch6\u002FTreeRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyangzhch6\u002FTreeRPO) |\n| 2025-06 | `E2H` | Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06632) | - |\n| 2025-06 | `TreeRL` | TreeRL: LLM Reinforcement Learning with On-Policy Tree Search | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11902) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FTreeRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FTreeRL) |\n| 2025-05 | `ToTRL` | ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12717) | - |\n| 2025-03 | `DARS` | DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14269) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvaibhavagg303\u002FDARS-Agent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fvaibhavagg303\u002FDARS-Agent) |\n| 2025-03 | `DAPO` | DAPO: An Open-Source LLM Reinforcement Learning System at Scale | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14476) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBytedTsinghua-SIA\u002FDAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FBytedTsinghua-SIA\u002FDAPO) |\n| 2025-02 | `PRIME` | Process Reinforcement through Implicit Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FPRIME?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME) |\n| - | `POLARIS` | POLARIS: A POst-training recipe for scaling reinforcement Learning on Advanced ReasonIng modelS | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fhkunlp.github.io\u002Fblog\u002F2025\u002FPolaris\u002F) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenxinAn-fdu\u002FPOLARIS?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChenxinAn-fdu\u002FPOLARIS) |\n\n#### Sampling Hyper-Parameters\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `GFPO` | Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.09726) | - |\n| 2025-06 | `AceReason-Nemotron 1.1` | AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13284) | - |\n| 2025-06 | `T-PPO` | Truncated Proximal Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15050) | - |\n| 2025-06 | `Confucius3-Math` | Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.18330) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnetease-youdao\u002FConfucius3-Math?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fnetease-youdao\u002FConfucius3-Math) |\n| 2025-05 | `E3-RL4LLMs` | Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.18573) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLiaoMengqi\u002FE3-RL4LLMs?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLiaoMengqi\u002FE3-RL4LLMs) |\n| 2025-05 | `AceReason-Nemotron` | AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16400) | - |\n| 2025-05 | `Pro-RL` | ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24864) | - |\n| 2025-03 | - | Output Length Effect on DeepSeek-R1's Safety in Forced Thinking | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01923) | - |\n| 2025-03 | `DAPO` | DAPO: An Open-Source LLM Reinforcement Learning System at Scale | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14476) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBytedTsinghua-SIA\u002FDAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FBytedTsinghua-SIA\u002FDAPO) |\n| 2025-02 | `PRIME` | Process Reinforcement through Implicit Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FPRIME?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME) |\n| 2025-02 | - | Training Language Models to Reason Efficiently | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04463) | - |\n| - | `DeepScaleR` | DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fpretty-radio-b75.notion.site\u002FDeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frllm-org\u002Frllm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Frllm-org\u002Frllm) |\n| - | `POLARIS` | POLARIS: A POst-training recipe for scaling reinforcement Learning on Advanced ReasonIng modelS | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fhonorable-payment-890.notion.site\u002FPOLARIS-A-POst-training-recipe-for-scaling-reinforcement-Learning-on-Advanced-ReasonIng-modelS-1dfa954ff7c38094923ec7772bf447a1) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenxinAn-fdu\u002FPOLARIS?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChenxinAn-fdu\u002FPOLARIS) |\n\n### Training Resource\n#### Static Corpus (Code)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-05 | `rStar-Coder` | rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21297) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FrStar?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FrStar) |\n| 2025-04 | `Z1` | Z1: Efficient Test-time Scaling with Code | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00810) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fefficientscaling\u002FZ1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fefficientscaling\u002FZ1) |\n| 2025-04 | `OpenCodeReasoning` | OpenCodeReasoning: Advancing Data Distillation for Competitive Coding | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.01943) | - |\n| 2025-04 | `LeetCodeDataset` | LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14655) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnewfacade\u002FLeetCodeDataset?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fnewfacade\u002FLeetCodeDataset) |\n| 2025-03 | `KodCode` | KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02951) | - |\n| 2025-01 | `SWE-Fixer` | SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05040) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FSWE-Fixer?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FInternLM\u002FSWE-Fixer) |\n| 2024-12 | `SWE-Gym` | Training Software Engineering Agents and Verifiers with SWE-Gym | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.21139) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSWE-Gym\u002FSWE-Gym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSWE-Gym\u002FSWE-Gym) |\n| - | `Code-R1` | Code-R1: Reproducing R1 for Code with Reliable Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fgithub.com\u002Fganler\u002Fcode-r1) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fganler\u002Fcode-r1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fganler\u002Fcode-r1) |\n| - | `codeforces-cots` | CodeForces CoTs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopen-r1\u002Fcodeforces-cots) | - |\n| - | `DeepCoder` | DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fwww.together.ai\u002Fblog\u002Fdeepcoder) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fagentica-project\u002Frllm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fagentica-project\u002Frllm) |\n\n#### Static Corpus (STEM)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `SSMR-Bench` | Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.04059) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLinzwcs\u002FAutoMusicTheoryQA?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLinzwcs\u002FAutoMusicTheoryQA) |\n| 2025-09 | `Loong` | Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03059) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcamel-ai\u002Floong?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcamel-ai\u002Floong) |\n| 2025-07 | `MegaScience` | MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.16812) | - |\n| 2025-06 | `ReasonMed` | ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09513) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuSun-Work\u002FReasonMed?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FYuSun-Work\u002FReasonMed) |\n| 2025-05 | `ChemCoTDataset` | Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21318) | - |\n| 2025-02 | `NaturalReasoning` | NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13124) | - |\n| 2025-01 | `SCP-116K` | SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15587) | - |\n\n#### Static Corpus (Math)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-07 | `MiroMind-M1-RL-62K` | MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.14683) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiroMindAI\u002FMiroMind-M1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMiroMindAI\u002FMiroMind-M1) |\n| 2025-04 | `DeepMath` | DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11456) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzwhe99\u002FDeepMath?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzwhe99\u002FDeepMath) |\n| 2025-04 | `OpenMathReasoning` | AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16891) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FNeMo-Skills?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Skills) |\n| 2025-03 | `STILL-3-RL` | An Empirical Study on Eliciting and Improving R1-like Reasoning Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.04548) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FSlow_Thinking_with_LLMs?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FSlow_Thinking_with_LLMs) |\n| 2025-03 | `Light-R1` | Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10460) | - |\n| 2025-03 | `DAPO` | DAPO: An Open-Source LLM Reinforcement Learning System at Scale | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14476) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBytedTsinghua-SIA\u002FDAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FBytedTsinghua-SIA\u002FDAPO) |\n| 2025-03 | `OpenReasoningZero` | Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24290) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero) |\n| 2025-02 | `PRIME` | Process Reinforcement through Implicit Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FPRIME?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME) |\n| 2025-02 | `LIMO` | Limo: Less is more for reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03387) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002FLIMO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FLIMO) |\n| 2025-02 | `LIMR` | Limr: Less is more for rl scaling | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11886) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002FLIMR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FLIMR) |\n| 2025-02 | `Big-MATH` | Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17387) | - |\n| - | `NuminaMath 1.5` | Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](http:\u002F\u002Ffaculty.bicmr.pku.edu.cn\u002F~dongbin\u002FPublications\u002Fnumina_dataset.pdf) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fproject-numina\u002Faimo-progress-prize?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fproject-numina\u002Faimo-progress-prize) |\n| - | `OpenR1-Math` | Open R1: A fully open reproduction of DeepSeek-R1 | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fopen-r1) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuggingface\u002Fopen-r1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fopen-r1) |\n| - | `DeepScaleR` | DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fpretty-radio-b75.notion.site\u002FDeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2) | - |\n\n#### Static Corpus (Agent)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `ASearcher` | Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.07976) | - |\n| 2025-07 | `WebShaper` | WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.15061) | - |\n| 2025-05 | `ZeroSearch` | ZeroSearch: Incentivize the Search Capability of LLMs without Searching | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.04588) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FZeroSearch?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FZeroSearch) |\n| 2025-04 | `ToolRL` | ToolRL: Reward is All Tool Learning Needs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13958) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fqiancheng0\u002FToolRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fqiancheng0\u002FToolRL) |\n| 2025-03 | `Search-R1` | Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09516) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPeterGriffinJin\u002FSearch-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPeterGriffinJin\u002FSearch-R1) |\n| 2025-03 | `ToRL` | ToRL: Scaling Tool-Integrated RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23383) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002FToRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FToRL) |\n| - | `MicroThinker` | MiroVerse V0.1: A Reproducible, Full-Trajectory, Ever-Growing Deep Research Dataset | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmiromind-ai\u002FMiroVerse-v0.1) | - |\n| 2025-03 | `DeepRetrieval` | DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00223) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpat-jj\u002FDeepRetrieval?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fpat-jj\u002FDeepRetrieval) |\n\n\n#### Static Corpus (Mix)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `Graph-R1` | Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problem | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17387) | - |\n| 2025-06 | `RewardAnything` | RewardAnything: Generalizable Principle-Following Reward Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.03637) | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fzhuohaoyu.github.io\u002FRewardAnything\u002F) |\n| 2025-06 | `guru-RL-92k` | Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14965) | - |\n| 2025-05 | `Llama-Nemotron-PT` | Llama-Nemotron: Efficient Reasoning Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00949) | - |\n| 2025-05 | `SkyWork OR1` | Skywork Open Reasoner 1 Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22312) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-OR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-OR1) |\n| 2025-03 | `OpenVLThinker` | OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.17352) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyihedeng9\u002FOpenVLThinker?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyihedeng9\u002FOpenVLThinker) |\n| - | `AM-DS-R1-0528-Distilled` | AM-DeepSeek-R1-0528-Distilled | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fgithub.com\u002Fa-m-team\u002Fa-m-models) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fa-m-team\u002Fa-m-models?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fa-m-team\u002Fa-m-models) |\n| - | `dolphin-r1` | Dolphin R1 Dataset | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQuixiAI\u002Fdolphin-r1) | - |\n| - | `SYNTHETIC-1\u002F2` | SYNTHETIC-1 Release: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1 | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fwww.primeintellect.ai\u002Fblog\u002Fsynthetic-1-release) | - |\n\n#### Dynamic Environment (Rule-based)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-06 | `ProtoReasoning` | ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15211) | - |\n| 2025-05 | `SynLogic` | SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19641) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FSynLogic?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FSynLogic) |\n| 2025-05 | `Reasoning Gym` | REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24760) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopen-thought\u002Freasoning-gym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fopen-thought\u002Freasoning-gym) |\n| 2025-05 | `Enigmata` | Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19914) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBytedTsinghua-SIA\u002FEnigmata?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FBytedTsinghua-SIA\u002FEnigmata) |\n| 2025-02 | `AutoLogi` | AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16906) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F8188zq\u002FAutoLogi?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002F8188zq\u002FAutoLogi) |\n| 2025-02 | `Logic-RL` | Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14768) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUnakar\u002FLogic-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FUnakar\u002FLogic-RL) |\n\n#### Dynamic Environment (Code-based)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-06 | `AgentCPM-GUI` | AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01391) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenBMB\u002FAgentCPM-GUI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FAgentCPM-GUI) |\n| 2025-06 | `MedAgentGym` | MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04405) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwshi83\u002FMedAgentGym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwshi83\u002FMedAgentGym) |\n| 2025-05 | `MLE-Dojo` | MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07782) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMLE-Dojo\u002FMLE-Dojo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMLE-Dojo\u002FMLE-Dojo) |\n| 2025-05 | `SWE-rebench` | SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20411) | - |\n| 2025-05 | `ZeroGUI` | ZeroGUI: Automating Online GUI Learning at Zero Human Cost | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23762) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FZeroGUI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FZeroGUI) |\n| 2025-04 | `R2E-Gym` | R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07164) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FR2E-Gym\u002FR2E-Gym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FR2E-Gym\u002FR2E-Gym) |\n| 2025-03 | `ReSearch` | ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.19470) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAgent-RL\u002FReCall?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAgent-RL\u002FReCall) |\n| 2025-02 | `MLGym` | MLGym: A New Framework and Benchmark for Advancing AI Research Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14499) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002FMLGym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FMLGym) |\n| 2024-07 | `AppWorld` | AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18901) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FStonyBrookNLP\u002Fappworld?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FStonyBrookNLP\u002Fappworld) |\n\n#### Dynamic Environment (Game-based)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `PuzzleJAX` | PuzzleJAX: A Benchmark for Reasoning and Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16821) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsmearle\u002Fscript-doctor?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsmearle\u002Fscript-doctor) |\n| 2025-06 | `Play to Generalize` | Play to Generalize: Learning to Reason Through Game Play | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08011) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyunfeixie233\u002FViGaL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyunfeixie233\u002FViGaL) |\n| 2025-06 | `Optimus-3` | Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10357) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJiuTian-VL\u002FOptimus-3?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FJiuTian-VL\u002FOptimus-3) |\n| 2025-05 | `lmgame-Bench` | lmgame-Bench: How Good are LLMs at Playing Games? | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15146) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flmgame-org\u002FGamingAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flmgame-org\u002FGamingAgent) |\n| 2025-05 | `G1` | G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13426) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fchenllliang\u002FG1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fchenllliang\u002FG1) |\n| 2025-05 | `Code2Logic` | Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13886) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftongjingqi\u002Fcode2logic?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ftongjingqi\u002Fcode2logic) |\n| 2025-05 | `KORGym` | KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14552) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmultimodal-art-projection\u002FKORGym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmultimodal-art-projection\u002FKORGym) |\n| 2025-04 | `Cross-env-coop` | Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.12714) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKJha02\u002FcrossEnvCooperation?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKJha02\u002FcrossEnvCooperation) |\n| 2022-03 | `ScienceWorld` | ScienceWorld: Is your Agent Smarter than a 5th Grader? | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.07540) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002FScienceWorld?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fallenai\u002FScienceWorld) |\n| 2020-10 | `ALFWorld` | ALFWorld: Aligning Text and Embodied Environments for Interactive Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.03768) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falfworld\u002Falfworld?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Falfworld\u002Falfworld) |\n\n#### Dynamic Environment (Model-based)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-06 | `SwS` | SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08989) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMasterVito\u002FSwS?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMasterVito\u002FSwS) |\n| 2025-06 | `SPIRAL` | SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.24119) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fspiral-rl\u002Fspiral?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fspiral-rl\u002Fspiral) |\n| 2025-05 | `Absolute Zero` | Absolute Zero: Reinforced Self-play Reasoning with Zero Data | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03335) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLeapLabTHU\u002FAbsolute-Zero-Reasoner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLeapLabTHU\u002FAbsolute-Zero-Reasoner) |\n| 2025-04 | `TextArena` | TextArena | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11442) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLeonGuertler\u002FTextArena?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLeonGuertler\u002FTextArena) |\n| 2025-03 | `SWEET-RL` | SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15478) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Fsweet_rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsweet_rl) |\n| - | `Genie 3` | Genie 3: A new frontier for world models | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fdeepmind.google\u002Fdiscover\u002Fblog\u002Fgenie-3-a-new-frontier-for-world-models\u002F) | - |\n\n#### Dynamic Environment (Ensemble-based)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `InternBootcamp` | InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.08636) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternBootcamp?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternBootcamp) |\n| - | `SYNTHETIC-2` | SYNTHETIC-2 Release: Four Million Collaboratively Generated Reasoning Traces | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fwww.primeintellect.ai\u002Fblog\u002Fsynthetic-2-release#synthetic-2-dataset) | - |\n\n#### RL Infrastructure (Primary)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-06 | `ROLL` | Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06122) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falibaba\u002FROLL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL) |\n| 2025-05 | `AReaL` | AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24298) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinclusionAI\u002FAReaL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FAReaL) |\n| 2024-09 | `veRL` | HybridFlow: A Flexible and Efficient RLHF Framework | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.19256v2) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvolcengine\u002Fverl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) |\n| 2024-05 | `OpenRLHF` | OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.11143) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenRLHF\u002FOpenRLHF?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF) |\n| - | `TRL` | Transformer Reinforcement Learning | - | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuggingface\u002Ftrl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl) |\n| - | `NeMo-RL` | Nemo RL: A Scalable and Efficient Post-Training Library | - | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA-NeMo\u002FRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FRL) |\n| - | `slime` | slime: An SGLang-Native Post-Training Framework for RL Scaling | - | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002Fslime?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime) |\n| - | `RLinf` | RLinf: Reinforcement Learning Infrastructure for Agentic AI | - | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRLinf\u002FRLinf?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRLinf\u002FRLinf) |\n\n#### RL Infrastructure (Secondary)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `RL-Factory` | RLFactory: A Plug-and-Play Reinforcement Learning Post-Training Framework for LLM Multi-Turn Tool-Use | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06980) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSimple-Efficient\u002FRL-Factory?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSimple-Efficient\u002FRL-Factory) |\n| 2025-09 | `verl-tool` | VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01055) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTIGER-AI-Lab\u002Fverl-tool?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTIGER-AI-Lab\u002Fverl-tool) |\n| 2025-09 | `dLLM-RL` | Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06949) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FdLLM-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FdLLM-RL) |\n| 2025-08 | `agent-lightning` | Agent Lightning: Train ANY AI Agents with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.03680) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fagent-lightning?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fagent-lightning) |\n| 2025-05 | `verl-agent` | Group-in-Group Policy Optimization for LLM Agent Training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10978) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlangfengQ\u002Fverl-agent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FlangfengQ\u002Fverl-agent) |\n| 2025-04 | `VLM-R1` | VLM-R1: A stable and generalizable R1-style Large Vision-Language Model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07615) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fom-ai-lab\u002FVLM-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fom-ai-lab\u002FVLM-R1) |\n| - | `rllm` | rLLM: A Framework for Post-Training Language Agents | - | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fagentica-project\u002Frllm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fagentica-project\u002Frllm) |\n| - | `EasyR1` | EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework | - | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhiyouga\u002FEasyR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FEasyR1) |\n| - | `verifiers` | Verifiers: Reinforcement Learning with LLMs in Verifiable Environments | - | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwillccbb\u002Fverifiers?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwillccbb\u002Fverifiers) |\n| - | `prime-rl` | PRIME-RL: Decentralized RL Training at Scale | - | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPrimeIntellect-ai\u002Fprime-rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPrimeIntellect-ai\u002Fprime-rl) |\n| - | `MARTI` | A Framework for LLM-based Multi-Agent Reinforced Training and Inference | - | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FMARTI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMARTI) |\n\n### Applications\n#### Coding Agent\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | - | Reinforcement Learning for Machine Learning Engineering Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01684) | - |\n| 2025-09 | - | Advancing SLM Tool-Use Capability using Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.04518) | - |\n| 2025-09 | `SimpleTIR` | SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.02479) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fltzheng\u002FSimpleTIR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fltzheng\u002FSimpleTIR) |\n| 2025-09 | - | The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.09677) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flong-horizon-execution\u002Fmeasuring-execution?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flong-horizon-execution\u002Fmeasuring-execution) |\n| 2025-08 | `GLM-4.5` | GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.06471) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzai-org\u002FGLM-4.5?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-4.5) |\n| 2025-08 | `FormaRL` | FormaRL: Enhancing Autoformalization with no Labeled Data | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18914) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUNLP-MT\u002FFormaRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUNLP-MT\u002FFormaRL) |\n| 2025-08 | `RLTR` | Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.19598) | - |\n| 2025-07 | `ARPO` | Agentic Reinforced Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19849) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdongguanting\u002FARPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdongguanting\u002FARPO) |\n| 2025-07 | `Kimi K2` | Kimi K2: Open Agentic Intelligence | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.20534) | - |\n| 2025-07 | `AutoTIR` | AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21836) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fweiyifan1023\u002FAutoTIR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fweiyifan1023\u002FAutoTIR) |\n| 2025-06 | `CoRT` | CoRT: Code-integrated Reasoning within Thinking | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09820) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChengpengLi1003\u002FCoRT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChengpengLi1003\u002FCoRT) |\n| 2025-05 | `EvoScale` | Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.23604) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsatori-reasoning\u002FSatori-SWE?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsatori-reasoning\u002FSatori-SWE) |\n| 2025-03 | `ToRL` | ToRL: Scaling Tool-Integrated RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23383) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002FToRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FToRL) |\n| 2025-02 | `SWE-RL` | SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.18449) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Fswe-rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fswe-rl) |\n| - | `Qwen3-Coder` | Qwen3-Coder: Agentic Coding in the World. | - | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3-Coder?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-Coder) |\n\n#### Search Agent\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `SSRL` | SSRL: Self-Search Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10874) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FSSRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FSSRL) |\n| 2025-07 | `WebSailor` | WebSailor: Navigating Super-human Reasoning for Web Agent | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.02592) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FWebAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FWebAgent) |\n| 2025-07 | `WebShaper` | WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.15061) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FWebAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FWebAgent) |\n| 2025-05 | `ZeroSearch` | ZeroSearch: Incentivize the Search Capability of LLMs without Searching | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.04588) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FZeroSearch?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FZeroSearch) |\n| 2025-05 | `SEM` | SEM: Reinforcement Learning for Search-Efficient Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07903) | - |\n| 2025-05 | `S3` | s3: You Don't Need That Much Data to Train a Search Agent via RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14146) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpat-jj\u002Fs3?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fpat-jj\u002Fs3) |\n| 2025-05 | `StepSearch` | StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15107) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZillwang\u002FStepSearch?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FZillwang\u002FStepSearch) |\n| 2025-05 | `R1-Searcher++` | R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17005) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FR1-Searcher-plus?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FR1-Searcher-plus) |\n| 2025-04 | `ReZero` | ReZero: Enhancing LLM search ability by trying one-more-time | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11001) | - |\n| 2025-03 | `DeepRetrieval` | DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00223) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpat-jj\u002FDeepRetrieval?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fpat-jj\u002FDeepRetrieval) |\n| 2025-03 | `Search-R1` | Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09516) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPeterGriffinJin\u002FSearch-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPeterGriffinJin\u002FSearch-R1) |\n| 2025-03 | `R1-Searcher` | R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.05592) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FR1-Searcher?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FR1-Searcher) |\n\n\n#### Browser-Use Agent\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-05 | `WebAgent-R1` | WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16421) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fweizhepei\u002FWebAgent-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fweizhepei\u002FWebAgent-R1) |\n| 2025-05 | `WebDancer` | WebDancer: Towards Autonomous Information Seeking Agency | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22648) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FWebAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FWebAgent) |\n| 2025-04 | `DeepResearcher` | DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.03160) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002FDeepResearcher?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FDeepResearcher) |\n| 2024-11 | `Web-RL` | WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02337) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FWebRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FWebRL) |\n| 2021-12 | `WebGPT` | WebGPT: Browser-assisted question-answering with human feedback | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.09332) | - |\n\n#### DeepResearch Agent\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `SFR-DeepResearch` | SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.06283) | - |\n| 2025-09 | `DeepDive` | DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.10446) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FDeepDive?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FDeepDive) |\n| 2025-08 | `Webwatcher` | Webwatcher: Breaking new frontiers of vision-language deep research agent | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.05748) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FWebAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FWebAgent) |\n| 2025-08 | `ASearcher` | Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.07976) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinclusionAI\u002FASearcher?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FASearcher) |\n| 2025-08 | `Atom-searcher` | Atom-searcher: Enhancing agentic deep research via fine-grained atomic thought reward | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.12800) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fantgroup\u002FResearch-Venus?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fantgroup\u002FResearch-Venus) |\n| 2025-08 | `MedResearcher-R1` | Medreseacher-r1: Expert-level medical deep researcher via a knowledge-informed trajectory synthesis framework | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.14880) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAQ-MedAI\u002FMedResearcher-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAQ-MedAI\u002FMedResearcher-R1) |\n| 2025-06 | `Jan-nano` | Jan-nano Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.22760) | - |\n| 2025-04 | `WebThinker` | WebThinker: Empowering Large Reasoning Models with Deep Research Capability | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21776) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsunnynexus\u002FWebThinker?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsunnynexus\u002FWebThinker) |\n| - | `Kimi-Researcher` | Kimi-Researcher-End-to-End RL Training for Emerging Agentic Capabilities | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fmoonshotai.github.io\u002FKimi-Researcher\u002F) | - |\n| - | `Mirothinker` | Mirothinker: An open-source agentic model series trained for deep research and complex, long-horizon problem solving | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fmiromind.ai\u002Fblog\u002Fmiromind-open-deep-research) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiroMindAI\u002FMiroThinker?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMiroMindAI\u002FMiroThinker) |\n\n#### GUI&Computer Agent\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `UI-TARS 2` | UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02544) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FUI-TARS-desktop?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FUI-TARS-desktop) |\n| 2025-08 | `GUI-RC` | Test-Time Reinforcement Learning for GUI Grounding via Region Consistency | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05615) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzju-real\u002Fgui-rcpo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzju-real\u002Fgui-rcpo) |\n| 2025-08 | `Os-r1` | OS-R1: Agentic Operating System Kernel Tuning with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12551) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLHY-24\u002FOS-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLHY-24\u002FOS-R1) |\n| 2025-08 | `ComputerRL` | ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.14040) | - |\n| 2025-08 | `Mobile-Agent-v3` | Mobile-Agent-v3: Fundamental Agents for GUI Automation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.15144) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FMobileAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FMobileAgent) |\n| 2025-08 | `SWIRL` | SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.20018) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLqf-HFNJU\u002FSWIRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLqf-HFNJU\u002FSWIRL) |\n| 2025-08 | `InquireMobile` | InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.19679) | - |\n| 2025-07 | `MobileGUI-RL` | MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.05720) | - |\n| 2025-06 | `GUI-Critic-R1` | Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04614) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FMobileAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FMobileAgent) |\n| 2025-06 | `GUI-Reflection` | GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08012) | - |\n| 2025-06 | `Mobile-R1` | Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.20332) | - |\n| 2025-05 | `UIShift` | UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12493) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUbiquitousLearning\u002FUIShift?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FUbiquitousLearning\u002FUIShift) |\n| 2025-05 | `GUI-G1` | GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15810) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuqi-Zhou\u002FGUI-G1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FYuqi-Zhou\u002FGUI-G1) |\n| 2025-05 | `ARPO` | ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16282) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FARPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FARPO) |\n| 2025-05 | `ZeroGUI` | ZeroGUI: Automating Online GUI Learning at Zero Human Cost | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23762) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FZeroGUI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FZeroGUI) |\n| 2025-04 | `GUI-R1` | GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10458) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fritzz-ai\u002FGUI-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fritzz-ai\u002FGUI-R1) |\n| 2025-03 | `UI-R1` | UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.21620) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flll6gg\u002FUI-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flll6gg\u002FUI-R1) |\n| 2025-01 | `UI-TARS` | UI-TARS: Pioneering Automated GUI Interaction with Native Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12326) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002Fui-tars?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Fui-tars) |\n\n#### Recommendation Agent\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-07 | `Shop-R1` | Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.17842) | - |\n| 2025-03 | `Rec-R1` | Rec-R1: Bridging LLMs and Recommendation Systems via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24289) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flinjc16\u002FRec-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flinjc16\u002FRec-R1) |\n\n#### Agent (Others)\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-07 | `OpenTable-R1` | OpenTable-R1: A Reinforcement Learning Augmented Tool Agent for Open-Domain Table Question Answering | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.03018) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTabibitoQZP\u002FOpenTableR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTabibitoQZP\u002FOpenTableR1) |\n| 2025-07 | `LaViPlan` | LaViPlan : Language-Guided Visual Path Planning with RLVR | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.12911) | - |\n| 2025-06 | `Drive-R1` | Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.18234) | - |\n| - | `EPO` | EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.22576) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Faiming-lab\u002Fgrape?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FWujiangXu\u002FEPO) |\n\n#### Code Generation\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `Proof2Silicon` | Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.06239) | - |\n| 2025-09 | `AR$^2$` | AR$^2$: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.03537) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhhhuang\u002FARAR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhhhuang\u002FARAR) |\n| 2025-09 | `Dream-Coder` | Dream-Coder 7B: An Open Diffusion Language Model for Code | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.01142) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDreamLM\u002FDream-Coder?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FDreamLM\u002FDream-Coder) |\n| 2025-08 | `MSRL` | Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.13587) | - |\n| 2025-07 | `CogniSQL-R1-Zero` | CogniSQL-R1-Zero: Lightweight Reinforced Reasoning for Efficient SQL Generation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.06013) | - |\n| 2025-07 | `Leanabell-Prover-V2` | Leanabell-Prover-V2: Verifier-integrated Reasoning for Formal Theorem Proving via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.08649) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLeanabell-LM\u002FLeanabell-Prover-V2?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLeanabell-LM\u002FLeanabell-Prover-V2) |\n| 2025-07 | `StepFun-Prover` | StepFun-Prover Preview: Let's Think and Verify Step by Step | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.20199) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fstepfun-ai\u002FStepFun-Prover-Preview?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStepFun-Prover-Preview) |\n| 2025-06 | `MedAgentGym` | MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.04405) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwshi83\u002FMedAgentGym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwshi83\u002FMedAgentGym) |\n| 2025-05 | `Fortune` | Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23667) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Ffortune?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ffortune) |\n| 2025-05 | `VeriReason` | VeriReason: Reinforcement Learning with Testbench Feedback for Reasoning-Enhanced Verilog Generation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.11849) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNellyW8\u002FVeriReason?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNellyW8\u002FVeriReason) |\n| 2025-05 | `ReEX-SQL` | ReEx-SQL: Reasoning with Execution-Aware Reinforcement Learning for Text-to-SQL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12768) | - |\n| 2025-05 | `AceReason-Nemotron` | AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.16400) | - |\n| 2025-05 | `SkyWork OR1` | Skywork Open Reasoner 1 Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.22312) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-OR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-OR1) |\n| 2025-05 | `CodeV-R1` | CodeV-R1: Reasoning-Enhanced Verilog Generation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.24183) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fiprc-dip\u002FCodeV-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fiprc-dip\u002FCodeV-R1) |\n| 2025-05 | `AReaL` | AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.24298) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinclusionAI\u002FAReaL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FAReaL) |\n| 2025-04 | `SQL-R1` | SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.08600) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDataArcTech\u002FSQL-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FDataArcTech\u002FSQL-R1) |\n| 2025-04 | `Kimina-Prover` | Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.11354) | - |\n| 2025-04 | `DeepSeek-Prover-V2` | DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.21801) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-Prover-V2?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-Prover-V2) |\n| 2025-03 | `Reasoning-SQL` | Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.23157) | - |\n| - | `code-r1` | Code-R1: Reproducing R1 for Code with Reliable Rewards | - | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fganler\u002Fcode-r1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fganler\u002Fcode-r1) |\n| - | `Open-R1` | Open-R1: a fully open reproduction of DeepSeek-R1 | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fopen-r1) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuggingface\u002Fopen-r1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fopen-r1) |\n| - | `DeepCoder` | Deepcoder: A fully open-source 14b coder at o3-mini level | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fpretty-radio-b75.notion.site\u002FDeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fagentica-project\u002Frllm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fagentica-project\u002Frllm) |\n\n#### Software Engineering\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `UTRL` | Learning to Generate Unit Test via Adversarial Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.21107) | - |\n| 2025-07 | `RePaCA` | RePaCA: Leveraging Reasoning Large Language Models for Static Automated Patch Correctness Assessment | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.22580) | - |\n| 2025-07 | `Repair-R1` | Repair-R1: Better Test Before Repair | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.22853) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTomsawyerhu\u002FAPR-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTomsawyerhu\u002FAPR-RL) |\n| 2025-06 | `CURE` | Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.03136) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FCURE?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FCURE) |\n| 2025-05 | `REAL` | Training Language Models to Generate Quality Code with Program Analysis Feedback | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.22704) | - |\n| 2025-05 | `Afterburner` | Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.23387) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FElfsong\u002FAfterburner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FElfsong\u002FAfterburner) |\n| 2024-09 | `RepoGenReflex` | RepoGenReflex: Enhancing Repository-Level Code Completion with Verbal Reinforcement and Retrieval-Augmented Generation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.13122) | - |\n| 2024-07 | `RLCoder` | RLCoder: Reinforcement Learning for Repository-Level Code Completion | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.19487) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDeepSoftwareAnalytics\u002FRLCoder?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FDeepSoftwareAnalytics\u002FRLCoder) |\n\n#### Multimodal Understanding\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `Vision-Zero` | Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25541) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwangqinsi1\u002FVision-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwangqinsi1\u002FVision-Zero) |\n| 2025-09 | `ReAd-R` | AdsQA: Towards Advertisement Video Understanding | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.08621) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FAdsQA?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAdsQA) |\n| 2025-09 | `Keye` | Kwai Keye-VL 1.5 Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01563) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKwai-Keye\u002FKeye?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKwai-Keye\u002FKeye) |\n| 2025-08 | `Sifthinker` | Sifthinker: Spatially-aware image focus for visual reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.06259) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhangquanchen\u002FSIFThinker?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhangquanchen\u002FSIFThinker) |\n| 2025-07 | `Long-RL` | Scaling rl to long videos | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.07966) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FLong-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FLong-RL) |\n| 2025-06 | `RefSpatial` | Roborefer: Towards spatial referring with reasoning in vision-language models for robotics | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04308) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZhoues\u002FRoboRefer?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FZhoues\u002FRoboRefer) |\n| 2025-06 | `Ego-R1` | Ego-R1: Chain-of-tool-thought for ultra-long egocentric video reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13654) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fegolife-ai\u002FEgo-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fegolife-ai\u002FEgo-R1) |\n| 2025-05 | `VerIPO` | VerIPO: Long Reasoning Video-R1 Model with Iterative Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19000) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHITsz-TMG\u002FVerIPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FHITsz-TMG\u002FVerIPO) |\n| 2025-05 | `Openthinkimg` | Openthinkimg: Learning to think with images via visual tool reinforcement learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.08617) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhaochen0110\u002FOpenThinkIMG?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhaochen0110\u002FOpenThinkIMG) |\n| 2025-05 | `Visual Planning` | Visual Planning: Let's think only with images | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11409) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyix8\u002FVisualPlanning?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyix8\u002FVisualPlanning) |\n| 2025-05 | `VideoRFT` | Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12434) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQiWang98\u002FVideoRFT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQiWang98\u002FVideoRFT) |\n| 2025-05 | `Deepeyes` | Deepeyes: Incentivizing\" thinking with images\" via reinforcement learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14362) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVisual-Agent\u002FDeepEyes?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FVisual-Agent\u002FDeepEyes) |\n| 2025-05 | `Visionary-R1` | Visionary-R1: Mitigating shortcuts in visual reasoning with reinforcement learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14677) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmaifoundations\u002FVisionary-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmaifoundations\u002FVisionary-R1) |\n| 2025-05 | `CoF` | Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15436) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxtong-zhang\u002FChain-of-Focus?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fxtong-zhang\u002FChain-of-Focus) |\n| 2025-05 | `GRIT` | GRIT: Teaching mllms to think with images | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15879) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Feric-ai-lab\u002FGRIT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Feric-ai-lab\u002FGRIT) |\n| 2025-05 | `Pixel Reasoner` | Pixel Reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15966) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTIGER-AI-Lab\u002FPixel-Reasoner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTIGER-AI-Lab\u002FPixel-Reasoner) |\n| 2025-05 | - | Don’t look only once: Towards multimodal interactive reasoning with selective visual revisitation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.18842) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjun297\u002Fv1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fjun297\u002Fv1) |\n| 2025-05 | `Ground-R1` | Ground-R1: Incentivizing grounded visual reasoning via reinforcement learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20272) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzzzhhzzz\u002FGround-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzzzhhzzz\u002FGround-R1) |\n| 2025-05 | `TACO` | TACO: Think-answer consistency for optimized long-chain reasoning and efficient data learning via reinforcement learning in lvlms | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20777) | - |\n| 2025-05 | `Qwen-LA` | Qwen look again: Guiding vision-language reasoning models to re-attention visual information | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23558) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLiar406\u002FLook_Again?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLiar406\u002FLook_Again) |\n| 2025-05 | `TW-GRPO` | Reinforcing video reasoning with focused thinking | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24718) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flongmalongma\u002FTW-GRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flongmalongma\u002FTW-GRPO) |\n| 2025-05 | `Spatial-MLLM` | Spatial-MLLM: Boosting mllm capabilities in visual-based spatial intelligence | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.23747) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdiankun-wu\u002FSpatial-MLLM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdiankun-wu\u002FSpatial-MLLM) |\n| 2025-04 | `R1-Zero-VSI` | Improved visual-spatial reasoning via r1-zero-like training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00883) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhijie-group\u002FR1-Zero-VSI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhijie-group\u002FR1-Zero-VSI) |\n| 2025-04 | `Spacer` | Spacer: Reinforcing mllms in video spatial reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.01805) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOuyangKun10\u002FSpaceR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOuyangKun10\u002FSpaceR) |\n| 2025-04 | `Videochat-R1` | Videochat-R1: Enhancing spatio-temporal perception via reinforcement fine-tuning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.06958) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FVideoChat-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FVideoChat-R1) |\n| 2025-04 | `VLM-R1` | VLM-R1: A stable and generalizable r1-style large vision-language model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07615) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fom-ai-lab\u002FVLM-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fom-ai-lab\u002FVLM-R1) |\n| 2025-03 | `OpenVLThinker` | OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.17352) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyihedeng9\u002FOpenVLThinker?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyihedeng9\u002FOpenVLThinker) |\n| 2025-03 | `Visual-RFT` | Visual-RFT: Visual reinforcement fine-tuning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01785) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLiuziyu77\u002FVisual-RFT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT) |\n| 2025-03 | `Vision-R1` | Vision-R1: Incentivizing reasoning capability in multimodal large language models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.06749) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOsilly\u002FVision-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOsilly\u002FVision-R1) |\n| 2025-03 | `VisRL` | VisRL: Intention-Driven Visual Perception via Reinforced Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07523) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhangquanchen\u002FVisRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhangquanchen\u002FVisRL) |\n| 2025-03 | `Metaspatial` | Metaspatial: Reinforcing 3d spatial reasoning in vlms for the metaverse | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18470) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPzySeere\u002FMetaSpatial?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPzySeere\u002FMetaSpatial) |\n| 2025-03 | `Video-R1` | Video-R1: Reinforcing video reasoning in mllms | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.21776) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftulerfeng\u002FVideo-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ftulerfeng\u002FVideo-R1) |\n| 2025-09 | `Vision-Zero` | Vision-Zero: Scalable VLM Self-Improvement via strategic Gamified Self-Play | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25541) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwangqinsi1\u002FVision-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)]([https:\u002F\u002Fgithub.com\u002Ftulerfeng\u002FVideo-R1](https:\u002F\u002Fgithub.com\u002Fwangqinsi1\u002FVision-Zero)) |\n\n#### Multimodal Generation\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `IGPO` | Inpainting-Guided Policy Optimization for Diffusion Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.10396) | - |\n| 2025-08 | `Qwen-Image` | Qwen-Image Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02324) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen-Image?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-Image) |\n| 2025-08 | `TempFlow-GRPO` | TempFlow-GRPO: When timing matters for grpo in flow models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.04324) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FShredded-Pork\u002FTempFlow-GRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FShredded-Pork\u002FTempFlow-GRPO) |\n| 2025-07 | `MixGRPO` | MixGRPO: Unlocking flow-based grpo efficiency with mixed ode-sde | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21802) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent-Hunyuan\u002FMixGRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTencent-Hunyuan\u002FMixGRPO) |\n| 2025-06 | `FocusDiff` | Focusdiff: Advancing fine-grained text-image alignment for autoregressive visual generation through rl | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05501) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwendell0218\u002FFocusDiff?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwendell0218\u002FFocusDiff) |\n| 2025-06 | `SUDER` | Reinforcing multimodal understanding and generation with dual self-rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07963) | - |\n| 2025-05 | `T2I-R1` | T2I-R1: Reinforcing image generation with collaborative semantic-level and token-level cot | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00703) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCaraJ7\u002FT2I-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FCaraJ7\u002FT2I-R1) |\n| 2025-05 | `Flow-GRPO` | Flow-GRPO: Training flow matching models via online rl | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05470) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyifan123\u002Fflow_grpo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo) |\n| 2025-05 | `DanceGRPO` | DanceGRPO: Unleashing grpo on visual generation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07818) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FXueZeyue\u002FDanceGRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FXueZeyue\u002FDanceGRPO) |\n| 2025-05 | `GoT-R1` | GoT-R1: Unleashing reasoning capability of mllm for visual generation with reinforcement learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17022) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgogoduan\u002FGoT-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fgogoduan\u002FGoT-R1) |\n| 2025-05 | `ULM-R1` | Co-Reinforcement learning for unified multimodal understanding and generation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17534) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmm-vl\u002FULM-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmm-vl\u002FULM-R1) |\n| 2025-05 | `RePrompt` | Reprompt: Reasoning-augmented reprompting for text-to-image generation via reinforcement learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17540) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDKI_LLM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDKI_LLM) |\n| 2025-05 | `InfLVG` | InfLVG: Reinforce inference-time consistent long video generation with grpo | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17574) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMAPLE-AIGC\u002FInfLVG?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMAPLE-AIGC\u002FInfLVG) |\n| 2025-05 | `Reasongen-R1` | Reasongen-R1: Cot for autoregressive image generation models through sft and rl | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24875) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFranklin-Zhang0\u002FReasonGen-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FFranklin-Zhang0\u002FReasonGen-R1) |\n| 2025-04 | `PhysAR` | Reasoning physical video generation with diffusion timestep tokens via reinforcement learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15932) | - |\n\n#### Robotics Tasks\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `SimpleVLA-RL` | SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.09674) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FSimpleVLA-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FSimpleVLA-RL) |\n| 2025-06 | `TGRPO` | TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08440) | - |\n| 2025-05 | `ReinboT` | ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07395) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCOST-97\u002FreinboT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FCOST-97\u002FreinboT) |\n| 2025-05 | `RIPT-VLA` | InteractivePost-Trainingfor Vision-Language-ActionModels | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.17016) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAriostgx\u002Fript-vla?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAriostgx\u002Fript-vla) |\n| 2025-05 | `VLA-RL` | VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.18719) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGuanxingLu\u002Fvlarl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGuanxingLu\u002Fvlarl) |\n| 2025-05 | `RFTF` | RFTF: Reinforcement Fine-tuning for Embodied Agents with Temporal Feedback | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19767) | - |\n| 2025-05 | `VLA Generalization` | What can rl bring to vla generalization? an empirical study | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19789) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgen-robot\u002FRL4VLA?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fgen-robot\u002FRL4VLA) |\n| 2025-02 | `ConRFT` | ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05450) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcccedric\u002Fconrft?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcccedric\u002Fconrft) |\n| 2024-11 | `GRAPE` | GRAPE: Generalizing Robot Policy via Preference Alignment | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.19309) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Faiming-lab\u002Fgrape?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Faiming-lab\u002Fgrape) |\n| - | `RLinf` | RLinf: Reinforcement Learning Infrastructure for Agentic AI | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Frlinf.readthedocs.io\u002Fen\u002Flatest\u002F) | - |\n| - | `EPO` | EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.22576) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Faiming-lab\u002Fgrape?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FWujiangXu\u002FEPO) |\n\n#### Multi-Agent Systems\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-10 | `AgentFlow` | In-the-Flow Agentic System Optimization for Effective Planning and Tool Use | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2510.05592) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flupantech\u002FAgentFlow?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flupantech\u002FAgentFlow) |\n| 2025-09 | `SoftRankPO,` | Learning to Deliberate: Meta-policy Collaboration for Agentic LLMs with Multi-agent Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03817) | - |\n| 2025-09 | `BFS-Prover-V2` | Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06493) | - |\n| 2025-08 | `MAGRPO` | LLM Collaboration With Multi-Agent Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.04652) | - |\n| 2025-06 | `AlphaEvolve` | AlphaEvolve: A coding agent for scientific and algorithmic discovery | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.13131) | - |\n| 2025-06 | `JoyAgents-R1` | JoyAgents-R1: Joint Evolution Dynamics for Versatile Multi-LLM Agents with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.19846) | - |\n| 2025-03 | `ReMA` | ReMA: Learning to Meta-think for LLMs with Multi-agent Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.09501) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fziyuwan\u002FReMA-public?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fziyuwan\u002FReMA-public) |\n| 2025-02 | `CTRL` | Teaching Language Models to Critique via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.03492) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHKUNLP\u002Fcritic-rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FHKUNLP\u002Fcritic-rl) |\n| 2025-02 | `Maporl` | MAPoRL2: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.18439) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fchanwoo-park-official\u002FMAPoRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fchanwoo-park-official\u002FMAPoRL) |\n| 2023-11 | `LLaMAC` | Controlling large language model-based agents for large-scale decision-making: An actor-critic approach | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.13884) | - |\n\n#### Scientific Tasks\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `Baichuan-M2` | Baichuan-M2: Scaling Medical Capability with Large Verifier System | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02208) | - |\n| 2025-08 | `CX-Mind` | CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.03733) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWenjieLisjtu\u002FCX-Mind?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FWenjieLisjtu\u002FCX-Mind) |\n| 2025-08 | `MORE-CLEAR` | MORE-CLEAR: Multimodal Offline Reinforcement learning for Clinical notes Leveraged Enhanced State Representation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.07681) | - |\n| 2025-08 | `ARMed` | Breaking Reward Collapse: Adaptive Reinforcement for Open-ended Medical Reasoning with Enhanced Semantic Discrimination | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12957) | - |\n| 2025-08 | `ProMed` | ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.13514) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhxxding\u002FProMed?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhxxding\u002FProMed) |\n| 2025-08 | `OwkinZero` | OwkinZero: Accelerating Biological Discovery with AI | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16315) | - |\n| 2025-08 | `MolReasoner` | MolReasoner: Toward Effective and Interpretable Reasoning for Molecular LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.02066) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F545487677\u002FMolReasoner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002F545487677\u002FMolReasoner) |\n| 2025-08 | `MedGR$^2$` | MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.20549) | - |\n| 2025-07 | `MedGround-R1` | MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.02994) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbio-mlhui\u002FMedGround-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fbio-mlhui\u002FMedGround-R1) |\n| 2025-07 | `MedGemma` | MedGemma Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.05201) | - |\n| 2025-06 | `MMedAgent-RL` | MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.00555v2) | - |\n| 2025-06 | `Cell-o1` | Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02911) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fncbi-nlp\u002Fcell-o1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fncbi-nlp\u002Fcell-o1) |\n| 2025-06 | `MedAgentGym` | MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04405v1) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwshi83\u002FMedAgentGym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwshi83\u002FMedAgentGym) |\n| 2025-06 | `Med-U1` | Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.12307) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMonncyann\u002FMed-U1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMonncyann\u002FMed-U1) |\n| 2025-06 | `MedVIE` | Efficient Medical VIE via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13363) | - |\n| 2025-06 | `LA-CDM` | Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13474) | - |\n| 2025-06 | `ether0` | Training a Scientific Reasoning Model for Chemistry | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17238) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFuture-House\u002Fether0?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FFuture-House\u002Fether0) |\n| 2025-06 | `Gazal-R1` | Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21594) | - |\n| 2025-05 | `DRG-Sapphire` | Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21908) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhanyin88\u002FDRG-Sapphire?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhanyin88\u002FDRG-Sapphire) |\n| 2025-05 | `BioReason` | BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23579) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbowang-lab\u002FBioReason?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fbowang-lab\u002FBioReason) |\n| 2025-05 | `EHRMIND` | Training LLMs for EHR-Based Reasoning Tasks via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24105) | - |\n| 2025-04 | `Open-Medical-R1` | Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13950) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQsingle\u002Fopen-medical-r1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQsingle\u002Fopen-medical-r1) |\n| 2025-04 | `ChestX-Reasoner` | ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20930v1) | - |\n| 2025-04 | `BoxMed-RL` | Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.18453) | - |\n| 2025-03 | `PPME` | Improving Interactive Diagnostic Ability of a Large Language Model Agent Through Clinical Experience Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16463) | - |\n| 2025-03 | `DOLA` | Autonomous Radiotherapy Treatment Planning Using DOLA: A Privacy-Preserving, LLM-Based Optimization Agent | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.17553) | - |\n| 2025-02 | `Baichuan-M1` | Baichuan-M1: Pushing the Medical Capability of Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12671) | - |\n| 2025-02 | `MedVLM-R1` | MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19634) | - |\n| 2025-02 | `Med-RLVR` | Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19655) | - |\n| 2025-01 | `MedXpertQA` | MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18362) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FMedXpertQA?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMedXpertQA) |\n| 2024-12 | `HuatuoGPT-o1` | HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18925) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFreedomIntelligence\u002FHuatuoGPT-o1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FHuatuoGPT-o1) |\n| - | `Pro-1` | Pro-1 | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fmichaelhla.com\u002Fblog\u002Fpro1.html) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmichaelhla\u002Fpro-1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmichaelhla\u002Fpro-1) |\n| - | `rbio` | rbio1 - training scientific reasoning LLMs with biological world models as soft verifiers | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2025.08.18.670981v3) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fczi-ai\u002Frbio?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fczi-ai\u002Frbio) |\n| - | `EPO` | EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.22576) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Faiming-lab\u002Fgrape?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FWujiangXu\u002FEPO) |\n\n## 🌟 Acknowledgment\n\nThis survey is extended and refined from the original **Awesome RL Reasoning Recipes** repo. We are deeply grateful to all contributors for their efforts, and we sincerely thank for their all interest in **Awesome RL Reasoning Recipes**. The contents of the previous repository are available [here](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-for-LRMs\u002Ftree\u002FTripleR).\n\n\n## ✨ Star History\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTsinghuaC3I_Awesome-RL-for-LRMs_readme_fbe8f31f32d2.png)](https:\u002F\u002Fwww.star-history.com\u002F#TsinghuaC3I\u002FAwesome-RL-for-LRMs&Date)\n","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTsinghuaC3I_Awesome-RL-for-LRMs_readme_e10d984f3e5a.png\" style=\"width: 70%;\"\u002F>\n\n## 大型推理模型的强化学习综述\n\n[![Awesome](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAwesome-0066CC?style=for-the-badge&logo=awesome-lists&logoColor=white)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome) [![Survey](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.08827)  [![Github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAwesome--RL--for--LRMs-000000?style=for-the-badge&logo=github&logoColor=white)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-Reasoning-Recipes)  [![HF Papers](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHF--Paper-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2509.08827)  [![Twitter](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTwitter-%23000000.svg?style=for-the-badge&logo=x&logoColor=white)](https:\u002F\u002Fx.com\u002FOkhayIea\u002Fstatus\u002F1965989894163235111)\n\n\u003C\u002Fdiv>\n\n> 我们欢迎各位就尚未提及的相关工作提出议题，我们将在下一次更新中予以回应！\n\n\n## 🎉 新闻\n\n- **[2025-11-05]** 🔥 很高兴发布关于**智能体记忆**的论文列表，涵盖上下文管理和经验学习方面的突破性进展，这些进展为自我改进的人工智能智能体提供了强大支持。请查看：[GitHub](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-Memory-for-Agents)\n- **[2025-10]** 🎉 荣幸受邀在[BAAI](https:\u002F\u002Fevent.baai.ac.cn\u002Factivities\u002F961)、[Qingke Talk](https:\u002F\u002Fqingkeai.online\u002Farchives\u002F0h3Cm8Bi)以及腾讯Wiztalk上发表演讲！以下是演讲幻灯片：[Slides](Survey@RL4LRM-v1.pdf)。\n- **[2025-09-18]** 🎉 我们已更新了综述中按类别结构整理的完整论文列表！\n- **[2025-09-12]** 🎉 我们的综述在🤗 [Hugging Face Daily Papers](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2509.08827)上被评为**今日最佳论文#1**！\n- **[2025-09-11]** 🔥 很高兴发布我们的**大型推理模型强化学习综述**！我们很快将用新的分类结构更新完整的论文列表。请查看：[论文](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2509.08827)。\n- **[2025-08-15]** 🔥 推出**SSRL**：一种无需依赖外部搜索引擎的智能体搜索强化学习研究。请查看：[GitHub](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FSSRL)和[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10874)。\n- **[2025-05-27]** 🔥 推出**MARTI**：一个基于大语言模型的多智能体强化训练与推理框架。请查看：[GitHub](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMARTI)。\n- **[2025-04-23]** 🔥 推出**TTRL**：一种开源解决方案，用于在没有真实标签的数据（尤其是测试数据）上进行在线强化学习。请查看：[GitHub](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FTTRL)和[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16084)。\n- **[2025-03-20]** 🔥 我们很高兴推出关于推理模型强化学习的论文和项目合集！\n\n\n## 🎈 引用\n\n如果您觉得本综述有所帮助，请引用我们的工作：\n\n```bibtex\n@article{zhang2025survey,\n  title={A survey of reinforcement learning for large reasoning models},\n  author={Zhang, Kaiyan and Zuo, Yuxin and He, Bingxiang and Sun, Youbang and Liu, Runze and Jiang, Che and Fan, Yuchen and Tian, Kai and Jia, Guoli and Li, Pengfei and others},\n  journal={arXiv preprint arXiv:2509.08827},\n  year={2025}\n}\n```\n\n## 📖 目录\n- [大型推理模型的强化学习综述](#a-survey-of-reinforcement-learning-for-large-reasoning-models)\n- [🎉 新闻](#-news)\n- [🎈 引用](#-citation)\n- [📖 目录](#-contents)\n- [🗺️ 概述](#️-overview)\n- [📄 论文列表](#-paper-list)\n  - [前沿模型](#frontier-models)\n  - [奖励设计](#reward-design)\n    - [生成式奖励](#generative-rewards)\n    - [密集奖励](#dense-rewards)\n    - [无监督奖励](#unsupervised-rewards)\n    - [奖励塑造](#rewards-shaping)\n  - [策略优化](#policy-optimization)\n    - [策略梯度目标](#policy-gradient-objective)\n    - [基于评论家的算法](#critic-based-algorithms)\n    - [无评论家算法](#critic-free-algorithms)\n    - [离策略优化](#off-policy-optimization)\n    - [离策略优化（经验回放）](#off-policy-optimization-exp-replay)\n    - [正则化目标](#regularization-objectives)\n  - [采样策略](#sampling-strategy)\n    - [动态与结构化采样](#dynamic-and-structured-sampling)\n    - [采样超参数](#sampling-hyper-parameters)\n  - [训练资源](#training-resource)\n    - [静态语料库（代码）](#static-corpus-code)\n    - [静态语料库（STEM）](#static-corpus-stem)\n    - [静态语料库（数学）](#static-corpus-math)\n    - [静态语料库（智能体）](#static-corpus-agent)\n    - [静态语料库（混合）](#static-corpus-mix)\n    - [动态环境（基于规则）](#dynamic-environment-rule-based)\n    - [动态环境（基于代码）](#dynamic-environment-code-based)\n    - [动态环境（基于游戏）](#dynamic-environment-game-based)\n    - [动态环境（基于模型）](#dynamic-environment-model-based)\n    - [动态环境（基于集成）](#dynamic-environment-ensemble-based)\n    - [强化学习基础设施（主要）](#rl-infrastructure-primary)\n    - [强化学习基础设施（次要）](#rl-infrastructure-secondary)\n  - [应用](#applications)\n    - [编程智能体](#coding-agent)\n    - [搜索智能体](#search-agent)\n    - [浏览器使用智能体](#browser-use-agent)\n    - [深度研究智能体](#deepresearch-agent)\n    - [GUI&计算机智能体](#guicomputer-agent)\n    - [推荐智能体](#recommendation-agent)\n    - [其他智能体](#agent-others)\n    - [代码生成](#code-generation)\n    - [软件工程](#software-engineering)\n    - [多模态理解](#multimodal-understanding)\n    - [多模态生成](#multimodal-generation)\n    - [机器人任务](#robotics-tasks)\n    - [多智能体系统](#multi-agent-systems)\n    - [科学任务](#scientific-tasks)\n- [🌟 致谢](#-acknowledgment)\n- [✨ 星标历史](#-star-history)\n\n\n## 🗺️ 概述\n\n本综述全面探讨了**大型推理模型的强化学习**。\n\n\u003Cp align=\"center\">\n   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTsinghuaC3I_Awesome-RL-for-LRMs_readme_ad4973300f6a.png\" alt=\"大型推理模型强化学习综述概览\" style=\"width: 100%;\">\n\u003C\u002Fp>\n\n我们将综述分为五个主要部分：\n\n1. \u003Cu>基础组件：\u003C\u002Fu> 奖励设计、策略优化和采样策略\n2. \u003Cu>基础问题：\u003C\u002Fu> 大型推理模型强化学习中的关键争论与挑战\n3. \u003Cu>训练资源：\u003C\u002Fu> 静态语料库、动态环境和基础设施\n4. \u003Cu>应用：\u003C\u002Fu> 不同领域的实际应用\n5. \u003Cu>未来方向：\u003C\u002Fu> 新兴的研究机遇与挑战\n\n## 📄 论文列表\n\n### 前沿模型\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `Intern-S1` | Intern-S1：科学多模态基础模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.15763v1) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FIntern-S1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FInternLM\u002FIntern-S1) |\n| 2025-08 | `GLM-4.5` | GLM-4.5：智能体、推理与编码（ARC）基础模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.06471) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzai-org\u002FGLM-4.5?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-4.5) |\n| 2025-08 | `gpt-oss` | gpt-oss-120b & gpt-oss-20b 模型卡片 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10925) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopenai\u002Fgpt-oss?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgpt-oss) |\n| 2025-08 | `InternVL3.5` | InternVL3.5：在通用性、推理能力和效率方面推进开源多模态模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18265) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) |\n| 2025-07 | `Kimi K2` | Kimi K2：开放的智能体式人工智能 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20534) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMoonshotAI\u002FKimi-K2?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-K2) |\n| 2025-07 | `Step 3` | Step-3：规模大但价格亲民——面向低成本解码的模型与系统协同设计 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19427) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fstepfun-ai\u002FStep3?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep3) |\n| 2025-07 | `GLM-4.1V-Thinking` | GLM-4.5V 和 GLM-4.1V-Thinking：迈向基于可扩展强化学习的多功能多模态推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzai-org\u002FGLM-V?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-V) |\n| 2025-07 | `Skywork-R1V3` | Skywork-R1V3 技术报告 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06167) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-R1V?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-R1V) |\n| 2025-07 | `GLM-4.5V` | GLM-4.5V 和 GLM-4.1V-Thinking：迈向基于可扩展强化学习的多功能多模态推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzai-org\u002FGLM-V?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-V) |\n| 2025-06 | `Magistral` | Magistral | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10910) | - |\n| 2025-06 | `Minimax-M1` | MiniMax-M1：利用闪电注意力高效扩展推理时计算资源 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13585) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FMiniMax-M1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FMiniMax-M1) |\n| 2025-05 | `MiMo` | MiMo：释放语言模型的推理潜力——从预训练到后训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07608) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FXiaomiMiMo\u002FMiMo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo) |\n| 2025-05 | `Qwen3` | Qwen3 技术报告 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.09388) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3) |\n| 2025-05 | `Llama-Nemotron-Ultra` | Llama-Nemotron：高效推理模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00949) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) |\n| 2025-05 | `INTELLECT-2` | INTELLECT-2：通过全球去中心化强化学习训练的推理模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07291) | - |\n| 2025-05 | `Hunyuan-TurboS` | Hunyuan-TurboS：通过 Mamba-Transformer 协同与自适应思维链推进大型语言模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15431) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent\u002FHunyuan-TurboS?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTencent\u002FHunyuan-TurboS) |\n| 2025-05 | `Skywork OR-1` | Skywork 开放推理器 1 技术报告 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22312) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-OR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-OR1) |\n| 2025-04 | `Phi-4 Reasoning` | Phi-4-reasoning 技术报告 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21318) | - |\n| 2025-04 | `Skywork-R1V2` | Skywork R1V2：用于推理的多模态混合强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16656) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-R1V?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-R1V) |\n| 2025-04 | `InternVL3` | InternVL3：探索开源多模态模型的高级训练与推理时策略 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10479) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) |\n| 2025-03 | `ORZ` | Open-Reasoner-Zero：在基础模型上扩展强化学习的开源方法 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24290) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero) |\n| 2025-01 | `DeepSeek-R1` | DeepSeek-R1：通过强化学习激励大型语言模型的推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12948) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) |\n| - | `QwQ` | QwQ-32B：拥抱强化学习的力量 | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwq-32b\u002F) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwQ?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwQ) |\n| - | `Seed-OSS` | Seed-OSS 开源模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002Fseed-oss) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByteDance-Seed\u002Fseed-oss?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002Fseed-oss) |\n| - | `ERNIE-4.5-Thinking` | ERNIE 4.5 技术报告 | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fernie.baidu.com\u002Fblog\u002Fpublication\u002FERNIE_Technical_Report.pdf) | - |\n\n### 奖励设计\n#### 生成式奖励\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `CAPO` | CAPO：通过可验证的生成式信用分配提升大模型推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02298) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fandyclsr\u002FCAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fandyclsr\u002FCAPO) |\n| 2025-08 | `CompassVerifier` | CompassVerifier：用于大模型评估与结果奖励的统一且鲁棒的验证器 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.03686) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopen-compass\u002FCompassVerifier?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FCompassVerifier) |\n| 2025-08 | `Cooper` | Cooper：在大语言模型强化学习中协同优化策略与奖励模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05613) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzju-real\u002Fcooper?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzju-real\u002Fcooper) |\n| 2025-08 | `ReviewRL` | ReviewRL：迈向基于强化学习的自动化科学评审 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10308) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FMARTI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMARTI) |\n| 2025-08 | `Rubicon` | 基于评分标准锚点的强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12790) | - |\n| 2025-08 | `RuscaRL` | 打破探索瓶颈：面向通用大模型推理的评分标准辅助强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16949) | - |\n| 2025-07 | `OMNI-THINKER` | OMNI-THINKER：通过混合奖励的多任务强化学习提升大模型跨领域泛化能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.14783) | - |\n| 2025-07 | `URPO` | URPO：面向大语言模型的统一奖励与策略优化框架 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.17515) | - |\n| 2025-07 | `RaR` | 评分标准即奖励：超越可验证领域的强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.17746) | - |\n| 2025-07 | `RLCF` | 对齐语言模型时，检查清单比奖励模型更有效 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.18624) | - |\n| 2025-07 | `PCL` | 针对语言模型的完成后学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20252) | - |\n| 2025-07 | `K2` | KIMI K2：开放的代理智能 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20534) | - |\n| 2025-07 | `LIBRA` | LIBRA：通过学会思考来评估和改进奖励模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21645) | - |\n| 2025-07 | `TP-GRPO` | 善于学习者会反思自己的思维：生成式PRM使大型推理模型成为更高效的数学学习者 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23317) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcs-holder\u002Ftp_grpo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcs-holder\u002Ftp_grpo) |\n| 2025-06 | `RewardAnything` | RewardAnything：可泛化的遵循原则的奖励模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.03637) | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fzhuohaoyu.github.io\u002FRewardAnything\u002F) |\n| 2025-06 | `Writing-Zero` | Writing-Zero：弥合不可验证任务与可验证奖励之间的鸿沟 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.00103) | - |\n| 2025-06 | `Critique-GRPO` | Critique-GRPO：利用自然语言和数值反馈推进大模型推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.03106) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhangxy-2019\u002Fcritique-GRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhangxy-2019\u002Fcritique-GRPO) |\n| 2025-06 | `PAG` | PAG：以策略作为生成式验证器的多轮强化大模型自我修正 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10406) | - |\n| 2025-06 | `GRAM` | GRAM：用于奖励泛化的生成式基础奖励模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14175) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNiuTrans\u002FGRAM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNiuTrans\u002FGRAM) |\n| 2025-06 | `ProxyReward` | 从通用到定向奖励：在开放式长上下文生成任务中超越GPT-4 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.16024) | - |\n| 2025-06 | `QA-LIGN` | QA-LIGN：通过宪法式分解问答对大模型进行对齐 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08123) | - |\n| 2025-05 | `RM-R1` | RM-R1：将奖励建模视为推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.02387) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRM-R1-UIUC\u002FRM-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRM-R1-UIUC\u002FRM-R1) |\n| 2025-05 | `J1` | J1：通过强化学习激励“法官型”大模型思考 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10320) | - |\n| 2025-05 | `TinyV` | TinyV：减少验证中的假阴性有助于提升大模型推理的强化学习效果 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14625) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fuw-nsl\u002FTinyV?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fuw-nsl\u002FTinyV) |\n| 2025-05 | `General-Reasoner` | General-reasoner：推动大模型在所有领域的推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14652) | - |\n| 2025-05 | `RRM` | 奖励推理模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14674) | - |\n| 2025-05 | `RL Tango` | RL Tango：为语言推理同时强化生成器与验证器 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15034) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkaiwenzha\u002Frl-tango?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fkaiwenzha\u002Frl-tango) |\n| 2025-05 | `Think-RM` | Think-RM：在生成式奖励模型中实现长 horizon 推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16265) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIlgeeHong\u002FThink-RM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FIlgeeHong\u002FThink-RM) |\n| 2025-04 | `JudgeLRM` | JudgeLRM：将大型推理模型用作评判者 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00050) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNuoJohnChen\u002FJudgeLRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNuoJohnChen\u002FJudgeLRM) |\n| 2025-04 | `GenPRM` | GenPRM：通过生成式推理扩展过程奖励模型的测试时计算规模 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00891) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRyanLiu112\u002FGenPRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRyanLiu112\u002FGenPRM) |\n| 2025-04 | `DeepSeek-GRM` | 通用奖励建模的推理时扩展 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.02495) | - |\n| 2025-04 | `AIR` | AIR：偏好数据集中标注、指令与响应对的系统性分析 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.03612) | - |\n| 2025-04 | `Pairwise-RL` | 用于RLHF的统一成对框架：连接生成式奖励建模与策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.04950) | - |\n| 2025-04 | `xVerify` | xVerify：高效推理模型评估用答案验证器 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10481) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIAAR-Shanghai\u002FxVerify?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FIAAR-Shanghai\u002FxVerify) |\n| 2025-04 | `Seed-Thinking-v1.5` | Seed1.5-Thinking：借助强化学习推进卓越推理模型的发展 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13914) | - |\n| 2025-04 | `ThinkPRM` | 会思考的进程奖励模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16828) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmukhal\u002Fthinkprm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmukhal\u002Fthinkprm) |\n| 2025-03 | - | 跨越奖励之桥：通过可验证奖励在不同领域扩展强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23829) | - |\n| 2025-02 | - | 数学推理的自奖励修正 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19613) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRLHFlow\u002FSelf-rewarding-reasoning-LLM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FSelf-rewarding-reasoning-LLM) |\n| 2024-10 | `GenRM` | 生成式奖励模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12832) | - |\n| 2024-08 | `CLoud` | 大声批评型奖励模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.11791) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzankner\u002FCLoud?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzankner\u002FCLoud) |\n| 2024-08 | `Generative Verifier` | 生成式验证器：将奖励建模视为下一个词预测 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15240) | - |\n| 2024-01 | `Self-Rewarding LM` | 自奖励语言模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10020) | - |\n| 2023-10 | `Auto-J` | 用于评估对齐情况的生成式评判者 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05470) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002Fauto-j?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002Fauto-j) |\n| 2023-06 | `Judge LLM-as-a-Judge` | 使用mt-bench和chatbot arena评判“法官型”大模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05685) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flm-sys\u002FFastChat?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat) |\n\n#### 密集奖励\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `Tree-GRPO` | 面向LLM智能体强化学习的树搜索 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.21240) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAMAP-ML\u002FTree-GRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAMAP-ML\u002FTree-GRPO) |\n| 2025-09 | `AttnRL` | 注意力作为指南针：推理模型中基于过程监督的强化学习的高效探索 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.26628) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRyanLiu112\u002FAttnRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRyanLiu112\u002FAttnRL) |\n| 2025-09 | `TARL` | 面向交互式多模态工具使用智能体的过程监督强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.14480) | - |\n| 2025-09 | `PROF` | 不止于正确性：通过强化学习训练协调过程与结果奖励 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03403) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenluye99\u002FPROF?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChenluye99\u002FPROF) |\n| 2025-09 | `HICRA` | 通过强化学习在LLM中涌现层次化推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03646) | - |\n| 2025-08 | `KlearReasoner` | Klear-Reasoner：通过梯度保持剪裁策略优化提升推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.07629) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKwai-Klear\u002FKlearReasoner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKwai-Klear\u002FKlearReasoner) |\n| 2025-08 | `CAPO` | CAPO：通过可验证的生成式信用分配增强LLM推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02298) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fandyclsr\u002FCAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fandyclsr\u002FCAPO) |\n| 2025-08 | `GTPO & GRPO-S` | GTPO和GRPO-S：结合策略熵的标记级和序列级奖励塑造 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.04349) | - |\n| 2025-08 | `VSRM` | 通过可验证的逐步奖励促进高效推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10293) | - |\n| 2025-08 | `G-RA` | 基于门控奖励稳定长期多轮强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10548) | - |\n| 2025-08 | `SSPO` | SSPO：用于过程监督和推理压缩的自追踪逐步偏好优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12604) | - |\n| 2025-08 | `AIRL-S` | 你的RL奖励函数就是最好的PRM用于搜索：统一RL与基于搜索的TTS | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.14313) | - |\n| 2025-08 | `TreePO` | TreePO：利用启发式树状建模弥合策略优化、有效性与推理效率之间的差距 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17445) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmultimodal-art-projection\u002FTreePO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmultimodal-art-projection\u002FTreePO) |\n| 2025-08 | `MUA-RL` | MUA-RL：面向代理式工具使用的多轮用户交互智能体强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18669) | - |\n| 2025-07 | `SPRO` | 通过重新定义逐步优势进行自我引导的过程奖励优化，以加强过程强化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01551) | - |\n| 2025-07 | `FR3E` | 首次返回，激发熵的探索 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.07017) | - |\n| 2025-07 | `ARPO` | 代理式强化策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19849) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUC-NLPIR\u002FARPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUC-NLPIR\u002FARPO) |\n| 2025-07 | `TP-GRPO` | 好的学习者会思考自己的思维：生成式PRM使大型推理模型成为更高效的数学学习者 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23317) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcs-holder\u002Ftp_grpo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcs-holder\u002Ftp_grpo) |\n| 2025-06 | `TreeRPO` | TreeRPO：树相对策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05183) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangzhch6\u002FTreeRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyangzhch6\u002FTreeRPO) |\n| 2025-06 | `TreeRL` | TreeRL：基于在线策略树搜索的LLM强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11902) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FTreeRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FTreeRL) |\n| 2025-06 | `Entropy Advantage` | 带有探索的推理：LLM强化学习的熵视角 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14758) | - |\n| 2025-06 | `ReasonFlux-PRM` | ReasonFlux-PRM：面向LLM长链式思维推理的轨迹感知PRM | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.18896) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FReasonFlux?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FReasonFlux) |\n| 2025-05 | `S-GRPO` | S-GRPO：通过强化学习实现推理模型的早期退出 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07686) | - |\n| 2025-05 | `GiGPO` | 面向LLM智能体训练的组内组策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10978) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlangfengQ\u002Fverl-agent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FlangfengQ\u002Fverl-agent) |\n| 2025-05 | - | 通过轮次级信用分配强化LLM智能体的多轮推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11821) | - |\n| 2025-05 | `Tango` | RL探戈：联合强化生成器与验证器进行语言推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15034) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkaiwenzha\u002Frl-tango?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fkaiwenzha\u002Frl-tango) |\n| 2025-05 | `StepSearch` | StepSearch：通过逐步近端策略优化激发LLM的搜索能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15107) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZillwang\u002FStepSearch?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FZillwang\u002FStepSearch) |\n| 2025-05 | - | 通过大型语言模型奖励分解，使对话智能体与全局反馈对齐 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15922) | - |\n| 2025-05 | `Tool-Star` | Tool-Star：通过强化学习赋能LLM驱动的多工具推理者 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16410) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdongguanting\u002FTool-Star?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdongguanting\u002FTool-Star) |\n| 2025-05 | `SPA-RL` | SPA-RL：通过逐步进展归因强化LLM智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20732) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWangHanLinHenry\u002FSPA-RL-Agent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FWangHanLinHenry\u002FSPA-RL-Agent) |\n| 2025-05 | `SPO` | 段落策略优化：大型语言模型RL中的有效段级信用分配 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23564) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIFrameResearch\u002FSPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAIFrameResearch\u002FSPO) |\n| 2025-04 | `GenPRM` | GenPRM：通过生成式推理扩展过程奖励模型的测试时计算 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00891) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRyanLiu112\u002FGenPRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRyanLiu112\u002FGenPRM) |\n| 2025-04 | `PURE` | 停止求和：最小形式的信用分配就是过程奖励模型进行推理所需要的全部 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15275) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCJReinforce\u002FPURE?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FCJReinforce\u002FPURE) |\n| 2025-03 | `MRT` | 通过元强化微调优化测试时计算 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07572) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCMU-AIRe\u002FMRT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FCMU-AIRe\u002FMRT) |\n| 2025-03 | `SWEET-RL` | SWEET-RL：在协作推理任务上训练多轮LLM智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15478) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Fsweet_rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsweet_rl) |\n| 2025-02 | `PRIME` | 通过隐式奖励进行过程强化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FPRIME?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME) |\n| 2024-12 | `Implicit PRM` | 无需过程标签即可获得免费的过程奖励 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01981) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FImplicitPRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FImplicitPRM) |\n| 2024-10 | `VinePPO` | VinePPO：改进LLM RL训练中的信用分配 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01679) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMcGill-NLP\u002FVinePPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMcGill-NLP\u002FVinePPO) |\n| 2024-10 | `PAV` | 奖励进展：扩展用于LLM推理的自动化过程验证器 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08146) | - |\n| 2024-04 | - | 从$r$到$Q^*$：你的语言模型其实是一个Q函数 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12358) | - |\n| 2024-03 | `GELI` | 通过将一次全局显式标注分解为局部隐式多模态反馈来改进对话智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.11330) | - |\n| 2023-12 | `Math-Shepherd` | Math-Shepherd：无需人工标注即可逐步验证并强化LLM | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08935) | - |\n| 2023-05 | `PRM800K` | 让我们一步步验证吧 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.20050) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopenai\u002Fprm800k?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fprm800k) |\n| 2022-11 | - | 利用过程与结果反馈解决数学应用题 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.14275) | - |\n\n#### 无监督奖励\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `Vision-Zero` | Vision-Zero：通过策略性游戏化自我博弈实现可扩展的多模态语言模型自我改进 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25541) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwangqinsi1\u002FVision-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwangqinsi1\u002FVision-Zero) |\n| 2025-08 | `Co-Reward` | Co-Reward：基于对比一致性的自监督强化学习用于大语言模型推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.00410) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftmlr-group\u002FCo-Reward?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ftmlr-group\u002FCo-Reward) |\n| 2025-08 | `SQLM` | 自提问语言模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.03682) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flili-chen\u002Fself-questioning-lm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flili-chen\u002Fself-questioning-lm) |\n| 2025-08 | `R-zero` | R-Zero：从零数据开始自我演化的推理型大语言模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05004) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChengsong-Huang\u002FR-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChengsong-Huang\u002FR-Zero) |\n| 2025-08 | `ETTRL` | ETTRL：通过熵机制在大语言模型测试时强化学习中平衡探索与利用 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11356) | - |\n| 2025-07 | `RLSF` | 基于自我反馈的强化学习对大语言模型进行后训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21931) | - |\n| 2025-06 | `RLSC` | 只需置信度：语言模型的小样本强化学习微调 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06395) | - |\n| 2025-06 | `RPT` | 强化预训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08007) | - |\n| 2025-06 | `CoVo` | 一致性路径通向真理：用于大语言模型推理的自奖励强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08745) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsastpg\u002FCoVo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsastpg\u002FCoVo) |\n| 2025-06 | `SEAL` | 自适应语言模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10943) | - |\n| 2025-06 | `Spurious Rewards` | 虚假奖励：重新思考 RLVR 中的训练信号 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10947) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fruixin31\u002FSpurious_Rewards?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fruixin31\u002FSpurious_Rewards) |\n| 2025-06 | `No Free Lunch` | 没有免费的午餐：重新思考大语言模型推理的内部反馈 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17219) | - |\n| 2025-05 | `Absolute Zero` | Absolute Zero：零数据下的强化自我博弈推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03335) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLeapLabTHU\u002FAbsolute-Zero-Reasoner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLeapLabTHU\u002FAbsolute-Zero-Reasoner) |\n| 2025-05 | `EM-RL` | 熵最小化在大语言模型推理中的不合理有效性 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15134) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshivamag125\u002FEM_PT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fshivamag125\u002FEM_PT) |\n| 2025-05 | `SSR-Zero` | SSR-Zero：用于机器翻译的简单自奖励强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16637) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKelaxon\u002FSSR-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKelaxon\u002FSSR-Zero) |\n| 2025-05 | - | 来自格式和长度的代理信号：无需真实答案的强化学习解决数学问题 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19439) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinsightLLM\u002Frl-without-gt?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FinsightLLM\u002Frl-without-gt) |\n| 2025-05 | `RLIF` | 学习在没有外部奖励的情况下进行推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19590) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsunblaze-ucb\u002FIntuitor?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsunblaze-ucb\u002FIntuitor) |\n| 2025-05 | `SeRL` | SeRL：针对有限数据的大语言模型自我博弈强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20347) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwantbook-book\u002FSeRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwantbook-book\u002FSeRL) |\n| 2025-05 | `SRT` | 大型推理模型能否自我训练？ | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21444) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftajwarfahim\u002Fsrt?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ftajwarfahim\u002Fsrt) |\n| 2025-05 | `RENT-RL` | 单纯最大化置信度即可提升推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22660) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsatrams\u002Frent-rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsatrams\u002Frent-rl) |\n| 2025-04 | `EMPO` | 正确的问题本身就是答案的一半：完全无监督的大语言模型推理激励机制 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.05812) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQingyangZhang\u002FEMPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQingyangZhang\u002FEMPO) |\n| 2025-04 | `TRANS-ZERO` | TRANS-ZERO：自我博弈激励大语言模型进行无平行语料的多语言翻译 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14669) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNJUNLP\u002Ftrans0?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNJUNLP\u002Ftrans0) |\n| 2025-04 | `TTRL` | TTRL：测试时强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16084) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FTTRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FTTRL) |\n| 2025-04 | `One-Shot-RLVR` | 使用单个训练样例对大语言模型进行推理强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20571) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fypwang61\u002FOne-Shot-RLVR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fypwang61\u002FOne-Shot-RLVR) |\n| 2025-02 | `CAGSR` | 一种使用交叉注意力信号对大语言模型进行微调的自监督强化学习方法 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10482) | - |\n| 2024-07 | `MINIMO` | 从内在动机出发学习形式数学 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00695) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgpoesia\u002Fminimo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fgpoesia\u002Fminimo) |\n\n#### 奖励塑造\n\n| 日期 | 名称 | 标题 | 论文 | GitHub |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `CDE` | CDE：用于大型语言模型高效强化学习的 curiosity-driven 探索 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.09675) | - |\n| 2025-09 | `DARLING` | 联合强化语言模型生成中的多样性和质量 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02534) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Fdarling?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdarling) |\n| 2025-09 | `DRER` | 通过 RL 增强的思维链重新思考大型语言模型中的推理质量 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06024) | - |\n| 2025-09 | `OBE` | 面向 LLM 推理的基于结果的探索 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06941) | - |\n| 2025-08 | `Pass@kTraining` | 用于自适应平衡大型推理模型探索与利用的 Pass@k 训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10751) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FPassk_Training?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FPassk_Training) |\n| 2025-05 | `PKPO` | Pass@K 策略优化：解决更难的强化学习问题 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15201) | - |\n| 2025-05 | `rl-without-gt` | 来自格式和长度的替代信号：无需真实答案的数学问题求解强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19439) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinsightLLM\u002Frl-without-gt?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FinsightLLM\u002Frl-without-gt) |\n| 2025-03 | `CrossDomain-RLVR` | 跨越奖励之桥：通过可验证奖励拓展跨不同领域的强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23829) | - |\n| 2025-01 | `DeepSeek-R1` | DeepSeek-R1：通过强化学习激励 LLM 的推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12948) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) |\n| 2024-09 | `Qwen2.5-Math` | Qwen2.5-math 技术报告：通过自我改进迈向数学专家模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12122) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2.5-Math?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Math) |\n\n\n\n### 策略优化\n#### 策略梯度目标\n\n| 日期 | 名称 | 标题 | 论文 | GitHub |\n|:-:|:-:|:-|:-:|:-:|\n| 2017-07 | `PPO` | 近端策略优化算法 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06347) | - |\n| - | `PG` | 带函数近似的强化学习策略梯度方法。 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F1999\u002Ffile\u002F464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf) | - |\n| - | `REINFORCE` | 用于联结主义强化学习的简单统计梯度跟踪算法 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1007\u002FBF00992696) | - |\n| - | `TRPO` | 信任区域策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fproceedings.mlr.press\u002Fv37\u002Fschulman15.pdf) | - |\n\n#### 基于评价器的算法\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `VL-DAC` | 在合成世界中利用强化学习提升视觉-语言模型训练，以实现现实世界的成功 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.04280) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcorl-team\u002FVL-DAC?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcorl-team\u002FVL-DAC) |\n| 2025-08 | `VRPO` | VRPO：在噪声监督下重新思考价值建模以进行稳健的强化学习训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.03058) | - |\n| 2025-05 | `VerIPO` | VerIPO：具有迭代策略优化的长推理视频-R1模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19000) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHITsz-TMG\u002FVerIPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FHITsz-TMG\u002FVerIPO) |\n| 2025-04 | `VAPO` | Vapo：用于高级推理任务的高效且可靠的强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.05118?) | - |\n| 2025-03 | `VCPPO` | PPO在长CoT中崩溃的背后是什么？价值优化才是关键 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.01491) | - |\n| 2025-03 | `Open reasoner-zero` | open reasoner-zero：一种在基础模型上扩展强化学习的开源方法 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.24290) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero) |\n| 2025-02 | `PRIME` | 通过隐式奖励进行过程强化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.01456) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FPRIME?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME) |\n| 2024-12 | `Implicit PRM` | 无需过程标签即可获得免费的过程奖励 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.01981) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flifan-yuan\u002FImplicitPRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flifan-yuan\u002FImplicitPRM) |\n| 2023-12 | `Math-shepherd` | Math-shepherd：无需人工标注即可逐步验证并强化LLM | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08935) | - |\n| 2015-06 | `GAE` | 使用广义优势估计进行高维连续控制 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1506.02438) | - |\n| - | `Autopsv` | Autopsv：自动化过程监督验证器。 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Ffile\u002F9246aa822579d9b29a140ecdac36ad60-Paper-Conference.pdf) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frookie-joe\u002FAutoPSV?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Frookie-joe\u002FAutoPSV) |\n\n#### 无评论家算法\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `UPGE` | 面向大型语言模型后训练的统一视角 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.04419) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FUnify-Post-Training?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FUnify-Post-Training) |\n| 2025-09 | `SPO` | 单流策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.13232) | - |\n| 2025-08 | `LitePPO` | 第一部分：技巧还是陷阱？深入探讨用于LLM推理的强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.08221v1) | - |\n| 2025-07 | `R1-RE` | R1-RE：基于RLVR的跨领域关系抽取 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04642) | - |\n| 2025-07 | `GSPO` | 群组序列策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.18071) | - |\n| 2025-06 | `CISPO` | MiniMax-M1：通过闪电注意力高效扩展推理时计算 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.13585) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FMiniMax-M1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FMiniMax-M1) |\n| 2025-05 | `KRPO` | 基于卡尔曼滤波增强的群组相对策略优化，用于语言模型推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07527) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbillhhh\u002FKRPO_LLMs_RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fbillhhh\u002FKRPO_LLMs_RL) |\n| 2025-05 | `CPGD` | CPGD：迈向稳定的语言模型规则基强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.12504) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FModalMinds\u002FMM-EUREKA?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FModalMinds\u002FMM-EUREKA) |\n| 2025-05 | `NFT` | 在数学推理中弥合监督学习与强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.18116) | - |\n| 2025-05 | `Clip-Cov\u002FKL-Cov` | 强化学习在推理型语言模型中的熵机制 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.22617) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FEntropy-Mechanism-of-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FEntropy-Mechanism-of-RL) |\n| 2025-03 | `OpenVLThinker` | OpenVLThinker：通过迭代SFT-RL循环实现复杂的视觉-语言推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.17352) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyihedeng9\u002FOpenVLThinker?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyihedeng9\u002FOpenVLThinker) |\n| 2025-03 | `DAPO` | DAPO：大规模开源LLM强化学习系统 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.14476) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBytedTsinghua-SIA\u002FDAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FBytedTsinghua-SIA\u002FDAPO) |\n| 2025-03 | `Dr. GRPO` | 理解类似R1-Zero的训练：一个关键视角 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.20783) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsail-sg\u002Funderstand-r1-zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Funderstand-r1-zero) |\n| 2025-01 | `Kimi k1.5` | Kimi k1.5：利用LLM扩展强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.12599) | - |\n| 2024-02 | `RLOO` | 回归基础：重新审视针对LLM人类反馈学习的强化风格优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.14740) | - |\n| 2024-02 | `GRPO` | DeepSeekMath：突破开放语言模型的数学推理极限 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.03300) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-Math?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-Math) |\n| 2023-10 | `ReMax` | ReMax：一种简单、有效且高效的大型语言模型对齐方法 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.10505) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliziniu\u002FReMax?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fliziniu\u002FReMax) |\n| - | `REINFORCE` | 用于联结主义强化学习的简单统计梯度跟随算法 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1007\u002FBF00992696) | - |\n| - | `REINFORCE++` | REINFORCE++：一种对提示和奖励模型均具有鲁棒性的高效RLHF算法 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F387487679_REINFORCE_An_Efficient_RLHF_Algorithm_with_Robustnessto_Both_Prompt_and_Reward_Models) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenRLHF\u002FOpenRLHF?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF) |\n| - | `VinePPO` | VINEPPO：通过精细化信用分配释放LLM推理的RL潜力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fopenreview.net\u002Fpdf?id=5mJrGtXVwz) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMcGill-NLP\u002FVinePPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMcGill-NLP\u002FVinePPO) |\n| - | `FlashRL` | 基于量化回放的快速RL训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Ffengyao.notion.site\u002Fflash-rl) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyaof20\u002FFlash-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyaof20\u002FFlash-RL) |\n\n#### 离策略优化\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `BRIDGE` | 超越两阶段训练：LLM 推理中的协作式 SFT 和 RL | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.06948) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChanLiang\u002FBRIDGE?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChanLiang\u002FBRIDGE) |\n| 2025-09 | `HPT` | 通往大语言模型后训练统一视角之路 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.04419) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FUnify-Post-Training?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FUnify-Post-Training) |\n| 2025-08 | `DFT` | 关于 SFT 的泛化：基于奖励校正的强化学习视角 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.05629) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyongliang-wu\u002FDFT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyongliang-wu\u002FDFT) |\n| 2025-08 | `RED` | 回忆-扩展动力学：通过可控探索与精细化离线整合提升小型语言模型性能 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.16677) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmillioniron\u002FOpenRLHF-Millioniron-?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmillioniron\u002FOpenRLHF-Millioniron-) |\n| 2025-07 | `Prefix‑RFT` | 使用前缀采样融合监督与强化微调 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.01679) | - |\n| 2025-07 | `ReMix` | 挤干湿海绵：面向大语言模型的高效离策略强化微调 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.06892) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAnitaLeungxx\u002FReMix-Reincarnated-Mix-policy-Proximal-Policy-Gradient?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAnitaLeungxx\u002FReMix-Reincarnated-Mix-policy-Proximal-Policy-Gradient) |\n| 2025-06 | `ReLIFT` | 学习强化学习无法做到的事：针对最难问题的交错式在线微调 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.07527) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTheRoadQaQ\u002FReLIFT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTheRoadQaQ\u002FReLIFT) |\n| 2025-06 | `BREAD` | BREAD：基于专家锚点的分支式 rollout，连接 SFT 和 RL 以实现推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.17211) | - |\n| 2025-06 | `SRFT` | SRFT：一种结合监督与强化微调的单阶段推理方法 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.19767) | - |\n| 2025-05 | `AMPO` | 通过模式策略优化实现社交语言代理的自适应思维 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.02156) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMozerWang\u002FAMPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMozerWang\u002FAMPO) |\n| 2025-05 | `UFT` | UFT：统一监督与强化微调 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.16984) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliumy2010\u002FUFT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fliumy2010\u002FUFT) |\n| 2025-04 | `LUFFY` | 在离策略指导下学习推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.14945) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FElliottYan\u002FLUFFY?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FElliottYan\u002FLUFFY) |\n| 2025-03 | `SPO` | 软策略优化：面向序列模型的在线离策略 RL | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.05453) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIFrameResearch\u002FSPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAIFrameResearch\u002FSPO) |\n| 2025-03 | `TOPR` | TAPERED OFF-POLICY REINFORCE：稳定高效的 LLM 强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.14286) | - |\n| 2024-05 | `IFT` | 直观微调：迈向将对齐简化为单一过程 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.11870) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FIntuitive-Fine-Tuning?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FIntuitive-Fine-Tuning) |\n| 2023-05 | `DPO` | 直接偏好优化：你的语言模型其实是一个奖励模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.18290) | - |\n| 2015-11 | - | 深度卷积网络的定点量化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1511.06393) | - |\n| - | - | 你的高效 RL 框架其实暗中为你带来了离策略 RL 训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Ffengyao.notion.site\u002Foff-policy-rl) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyaof20\u002Fverl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyaof20\u002Fverl) |\n\n#### 离策略优化（经验回放）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `SAPO` | 分享即关怀：通过集体强化学习经验共享实现高效的语言模型后训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.08721) | - |\n| 2025-09 | `SEELE` | 停留在最佳状态：基于能力自适应提示支架的响应式推理进化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06923) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChillingDream\u002Fseele?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChillingDream\u002Fseele) |\n| 2025-08 | `Memory-R1` | Memory-R1：通过强化学习增强大型语言模型智能体的记忆管理与利用能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.19828) | - |\n| 2025-07 | `RLEP` | RLEP：面向LLM推理的带经验回放的强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.07451) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKwai-Klear\u002FRLEP?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKwai-Klear\u002FRLEP) |\n| 2025-06 | `EFRame` | EFRame：基于探索-过滤-重放强化学习框架的深度推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.22200) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F597358816\u002FEFRame?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002F597358816\u002FEFRame) |\n| 2025-05 | `ARPO` | ARPO：带有经验回放的GUI智能体端到端策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.16282) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FARPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FARPO) |\n| 2025-04 | - | 通过回顾性回放提升LLM推理的强化学习探索 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.14363) | - |\n\n#### 正则化目标\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-10 | `ASPO` | ASPO：非对称重要性采样策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.06062) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwizard-III\u002FArcher2.0?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwizard-III\u002FArcher2.0) |\n| 2025-09 | `CE-GPPO` | CE-GPPO：强化学习中基于梯度保留剪裁的协调熵策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2509.20712) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKwai-Klear\u002FCE-GPPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKwai-Klear\u002FCE-GPPO) |\n| 2025-09 | `CDE` | CDE：用于大语言模型高效强化学习的 curiosity-driven 探索 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.09675) | - |\n| 2025-09 | `DPH RL` | 散度的选择：可验证奖励下缓解强化学习多样性崩溃被忽视的关键 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.07430) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fseamoke\u002FDPH-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fseamoke\u002FDPH-RL) |\n| 2025-09 | `empgseed-seed` | 利用不确定性：面向长时序 LLM 代理的熵调节策略梯度 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.09265) | - |\n| 2025-07 | `Archer` | 稳定知识，促进推理：RLVR 的双令牌约束 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.15778) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwizard-III\u002FArcherCodeR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwizard-III\u002FArcherCodeR) |\n| 2025-06 | `Bingo` | Bingo：通过动态且基于重要性的强化学习提升 LLM 的高效推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08125) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FBingo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FBingo) |\n| 2025-06 | `HighEntropy RL` | 超越 80\u002F20 法则：高熵少数令牌驱动 LLM 推理的有效强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.01939) | - |\n| 2025-06 | `Entropy RL` | 带探索的推理：LLM 强化学习的熵视角 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14758) | - |\n| 2025-06 | `ALP RL` | 恰到好处的思考：自适应长度惩罚强化学习实现高效推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05256) | - |\n| 2025-05 | `DisCO` | DisCO：利用判别式约束优化强化大型推理模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12366) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOptimization-AI\u002FDisCO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOptimization-AI\u002FDisCO) |\n| 2025-05 | `Skywork OR1` | Skywork Open Reasoner 1 技术报告 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.22312) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-OR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-OR1) |\n| 2025-05 | `Entropy Mechanism` | 推理语言模型强化学习的熵机制 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.22617) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FEntropy-Mechanism-of-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FEntropy-Mechanism-of-RL) |\n| 2025-05 | `ProRL` | ProRL：延长强化学习时间扩大大语言模型的推理边界 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.24864) | - |\n| 2025-05 | `Short RL` | 通过长度感知优化实现推理模型的高效强化学习训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.1228) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flblankl\u002FShort-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flblankl\u002FShort-RL) |\n| 2025-03 | `DAPO` | DAPO：大规模开源 LLM 强化学习系统 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.14476) | - |\n| 2025-03 | `L1` | L1：利用强化学习控制推理模型的思考时长 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.04697) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcmu-l3\u002Fl1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcmu-l3\u002Fl1) |\n\n\n\n### 采样策略\n#### 动态与结构化采样\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-10 | `EEPO` | EEPO：通过先采样后遗忘增强探索的策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.05837) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChanLiang\u002FEEPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChanLiang\u002FEEPO) |\n| 2025-09 | `AttnRL` | 注意力作为指南：推理模型中基于过程监督的强化学习高效探索 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.26628) | - |\n| 2025-09 | `DACE` | 知道何时探索：难度感知置信度作为大语言模型强化学习的指导 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.00125) | - |\n| 2025-09 | `Parallel-R1` | Parallel-R1：通过强化学习实现并行思维 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.07980) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhengkid\u002FParallel-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhengkid\u002FParallel-R1) |\n| 2025-08 | `G^2RPO-A` | G^2RPO-A：自适应引导下的引导式群体相对策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.13023) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FT-Lab-CUHKSZ\u002FG2RPO-A?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FT-Lab-CUHKSZ\u002FG2RPO-A) |\n| 2025-08 | `RuscaRL` | 打破探索瓶颈：用于通用大语言模型推理的评分标准支撑强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16949) | - |\n| 2025-08 | `TreePO` | TreePO：利用启发式树状建模弥合策略优化与推理效率之间的差距 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17445) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmultimodal-art-projection\u002FTreePO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmultimodal-art-projection\u002FTreePO) |\n| 2025-07 | `ARPO` | 主体式强化策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19849) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUC-NLPIR\u002FARPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUC-NLPIR\u002FARPO) |\n| 2025-06 | `TreeRPO` | TreeRPO：树状相对策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05183) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangzhch6\u002FTreeRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyangzhch6\u002FTreeRPO) |\n| 2025-06 | `E2H` | 由易到难的任务课程式强化学习提升大语言模型推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06632) | - |\n| 2025-06 | `TreeRL` | TreeRL：基于策略树搜索的大语言模型强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11902) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FTreeRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FTreeRL) |\n| 2025-05 | `ToTRL` | ToTRL：通过解谜解锁大语言模型思维树推理潜力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12717) | - |\n| 2025-03 | `DARS` | DARS：通过动态动作重采样和自适应树遍历提升编码智能体性能 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14269) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvaibhavagg303\u002FDARS-Agent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fvaibhavagg303\u002FDARS-Agent) |\n| 2025-03 | `DAPO` | DAPO：大规模开源大语言模型强化学习系统 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14476) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBytedTsinghua-SIA\u002FDAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FBytedTsinghua-SIA\u002FDAPO) |\n| 2025-02 | `PRIME` | 通过隐式奖励进行过程强化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FPRIME?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME) |\n| - | `POLARIS` | POLARIS：一种针对先进推理模型的强化学习规模化后训练方案 | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fhkunlp.github.io\u002Fblog\u002F2025\u002FPolaris\u002F) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenxinAn-fdu\u002FPOLARIS?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChenxinAn-fdu\u002FPOLARIS) |\n\n#### 采样超参数\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `GFPO` | 多采样少思考：用于简洁推理的分组过滤策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.09726) | - |\n| 2025-06 | `AceReason-Nemotron 1.1` | AceReason-Nemotron 1.1：通过SFT与RL协同推进数学与代码推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13284) | - |\n| 2025-06 | `T-PPO` | 截断近端策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15050) | - |\n| 2025-06 | `Confucius3-Math` | Confucius3-Math：面向中国中小学数学学习的轻量级高性能推理LLM | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.18330) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnetease-youdao\u002FConfucius3-Math?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fnetease-youdao\u002FConfucius3-Math) |\n| 2025-05 | `E3-RL4LLMs` | 提升LLM强化学习中的效率与探索能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.18573) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLiaoMengqi\u002FE3-RL4LLMs?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLiaoMengqi\u002FE3-RL4LLMs) |\n| 2025-05 | `AceReason-Nemotron` | AceReason-Nemotron：通过强化学习推进数学和代码推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16400) | - |\n| 2025-05 | `Pro-RL` | ProRL：延长型强化学习拓展大语言模型的推理边界 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24864) | - |\n| 2025-03 | - | 输出长度对DeepSeek-R1强制思维安全性的影响 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01923) | - |\n| 2025-03 | `DAPO` | DAPO：大规模开源LLM强化学习系统 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14476) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBytedTsinghua-SIA\u002FDAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FBytedTsinghua-SIA\u002FDAPO) |\n| 2025-02 | `PRIME` | 通过隐式奖励进行过程强化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FPRIME?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME) |\n| 2025-02 | - | 训练语言模型以高效推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04463) | - |\n| - | `DeepScaleR` | DeepScaleR：通过强化学习扩展，以1.5B模型超越O1预览 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fpretty-radio-b75.notion.site\u002FDeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frllm-org\u002Frllm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Frllm-org\u002Frllm) |\n| - | `POLARIS` | POLARIS：针对高级推理模型的训练后强化学习扩展方案 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fhonorable-payment-890.notion.site\u002FPOLARIS-A-POst-training-recipe-for-scaling-reinforcement-Learning-on-Advanced-ReasonIng-modelS-1dfa954ff7c38094923ec7772bf447a1) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenxinAn-fdu\u002FPOLARIS?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChenxinAn-fdu\u002FPOLARIS) |\n\n\n\n### 训练资源\n#### 静态语料库（代码）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-05 | `rStar-Coder` | rStar-Coder：利用大规模验证数据集扩展竞赛编程推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21297) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FrStar?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FrStar) |\n| 2025-04 | `Z1` | Z1：基于代码的高效测试时扩展 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00810) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fefficientscaling\u002FZ1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fefficientscaling\u002FZ1) |\n| 2025-04 | `OpenCodeReasoning` | OpenCodeReasoning：推进竞赛编程的数据蒸馏技术 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.01943) | - |\n| 2025-04 | `LeetCodeDataset` | LeetCodeDataset：用于代码大模型稳健评估与高效训练的时间序列数据集 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14655) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnewfacade\u002FLeetCodeDataset?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fnewfacade\u002FLeetCodeDataset) |\n| 2025-03 | `KodCode` | KodCode：多样化、高挑战性且可验证的合成编程数据集 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02951) | - |\n| 2025-01 | `SWE-Fixer` | SWE-Fixer：训练开源大模型以高效解决 GitHub 问题 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05040) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FSWE-Fixer?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FInternLM\u002FSWE-Fixer) |\n| 2024-12 | `SWE-Gym` | 使用 SWE-Gym 训练软件工程智能体和验证器 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.21139) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSWE-Gym\u002FSWE-Gym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSWE-Gym\u002FSWE-Gym) |\n| - | `Code-R1` | Code-R1：通过可靠奖励重现代码领域的 R1 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fgithub.com\u002Fganler\u002Fcode-r1) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fganler\u002Fcode-r1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fganler\u002Fcode-r1) |\n| - | `codeforces-cots` | CodeForces CoTs | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopen-r1\u002Fcodeforces-cots) | - |\n| - | `DeepCoder` | DeepCoder：O3-mini 级别的全开源 14B 编码器 | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fwww.together.ai\u002Fblog\u002Fdeepcoder) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fagentica-project\u002Frllm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fagentica-project\u002Frllm) |\n\n#### 静态语料库（STEM）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `SSMR-Bench` | 为评估与强化学习合成乐谱问题 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.04059) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLinzwcs\u002FAutoMusicTheoryQA?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLinzwcs\u002FAutoMusicTheoryQA) |\n| 2025-09 | `Loong` | Loong：通过验证器大规模合成长链式思维 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03059) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcamel-ai\u002Floong?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcamel-ai\u002Floong) |\n| 2025-07 | `MegaScience` | MegaScience：推动科学推理后训练数据集的前沿 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.16812) | - |\n| 2025-06 | `ReasonMed` | ReasonMed：一个由多智能体生成的 37 万条数据集，用于推进医学推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09513) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuSun-Work\u002FReasonMed?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FYuSun-Work\u002FReasonMed) |\n| 2025-05 | `ChemCoTDataset` | 超越化学问答：用模块化化学操作评估大模型的化学推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21318) | - |\n| 2025-02 | `NaturalReasoning` | NaturalReasoning：在真实场景中使用 280 万个高挑战性问题进行推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13124) | - |\n| 2025-01 | `SCP-116K` | SCP-116K：高质量的问题-解答数据集，以及高等教育科学领域自动化提取的通用流程 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15587) | - |\n\n#### 静态语料库（数学）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-07 | `MiroMind-M1-RL-62K` | MiroMind-M1：基于上下文感知多阶段策略优化的开源数学推理进展 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.14683) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiroMindAI\u002FMiroMind-M1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMiroMindAI\u002FMiroMind-M1) |\n| 2025-04 | `DeepMath` | DeepMath-103K：用于推进推理的大规模、具有挑战性、去污染且可验证的数学数据集 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11456) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzwhe99\u002FDeepMath?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzwhe99\u002FDeepMath) |\n| 2025-04 | `OpenMathReasoning` | AIMO-2 冠军方案：使用 OpenMathReasoning 数据集构建最先进的数学推理模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16891) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FNeMo-Skills?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Skills) |\n| 2025-03 | `STILL-3-RL` | 关于激发和改进 R1 类推理模型的实证研究 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.04548) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FSlow_Thinking_with_LLMs?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FSlow_Thinking_with_LLMs) |\n| 2025-03 | `Light-R1` | Light-R1：从零开始及更进一步的长 COT 课程化 SFT、DPO 和 RL | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10460) | - |\n| 2025-03 | `DAPO` | DAPO：大规模开源 LLM 强化学习系统 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14476) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBytedTsinghua-SIA\u002FDAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FBytedTsinghua-SIA\u002FDAPO) |\n| 2025-03 | `OpenReasoningZero` | Open-Reasoner-Zero：在基础模型上扩展强化学习的开源方法 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24290) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero) |\n| 2025-02 | `PRIME` | 通过隐式奖励进行过程强化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FPRIME?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME) |\n| 2025-02 | `LIMO` | Limo：少即是多，用于推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03387) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002FLIMO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FLIMO) |\n| 2025-02 | `LIMR` | Limr：少即是多，用于 RL 扩展 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11886) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002FLIMR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FLIMR) |\n| 2025-02 | `Big-MATH` | Big-Math：用于语言模型中强化学习的大规模高质量数学数据集 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17387) | - |\n| - | `NuminaMath 1.5` | Numinamath：ai4maths 领域内最大的公开数据集，包含 86 万对竞赛数学题目与解答 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](http:\u002F\u002Ffaculty.bicmr.pku.edu.cn\u002F~dongbin\u002FPublications\u002Fnumina_dataset.pdf) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fproject-numina\u002Faimo-progress-prize?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fproject-numina\u002Faimo-progress-prize) |\n| - | `OpenR1-Math` | Open R1：DeepSeek-R1 的完全开源复现 | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fopen-r1) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuggingface\u002Fopen-r1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fopen-r1) |\n| - | `DeepScaleR` | DeepScaleR：通过 RL 扩展超越 O1 预览版的 15 亿参数模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fpretty-radio-b75.notion.site\u002FDeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2) | - |\n\n#### 静态语料库（代理）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `ASearcher` | 超越十步：利用大规模异步强化学习解锁长 horizon 的智能体搜索 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.07976) | - |\n| 2025-07 | `WebShaper` | WebShaper：通过信息搜索形式化实现智能体式数据合成 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.15061) | - |\n| 2025-05 | `ZeroSearch` | ZeroSearch：无需实际搜索即可激励大模型的搜索能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.04588) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FZeroSearch?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FZeroSearch) |\n| 2025-04 | `ToolRL` | ToolRL：奖励是工具学习所需的一切 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13958) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fqiancheng0\u002FToolRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fqiancheng0\u002FToolRL) |\n| 2025-03 | `Search-R1` | Search-R1：利用强化学习训练大模型进行推理并调用搜索引擎 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09516) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPeterGriffinJin\u002FSearch-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPeterGriffinJin\u002FSearch-R1) |\n| 2025-03 | `ToRL` | ToRL：扩展工具集成的强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23383) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002FToRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FToRL) |\n| - | `MicroThinker` | MiroVerse V0.1：一个可复现、全轨迹、不断增长的深度研究数据集 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmiromind-ai\u002FMiroVerse-v0.1) | - |\n| 2025-03 | `DeepRetrieval` | DeepRetrieval：利用强化学习通过大语言模型“黑掉”真实搜索引擎和检索器 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00223) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpat-jj\u002FDeepRetrieval?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fpat-jj\u002FDeepRetrieval) |\n\n\n#### 静态语料库（混合）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `Graph-R1` | Graph-R1：用 NP-Hard 图问题释放大模型的推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17387) | - |\n| 2025-06 | `RewardAnything` | RewardAnything：可泛化的遵循原则的奖励模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.03637) | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fzhuohaoyu.github.io\u002FRewardAnything\u002F) |\n| 2025-06 | `guru-RL-92k` | 从跨领域视角重新审视用于大模型推理的强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14965) | - |\n| 2025-05 | `Llama-Nemotron-PT` | Llama-Nemotron：高效推理模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00949) | - |\n| 2025-05 | `SkyWork OR1` | Skywork 开放推理器 1 技术报告 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22312) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-OR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-OR1) |\n| 2025-03 | `OpenVLThinker` | OpenVLThinker：通过迭代 SFT-RL 循环实现复杂的视觉-语言推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.17352) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyihedeng9\u002FOpenVLThinker?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyihedeng9\u002FOpenVLThinker) |\n| - | `AM-DS-R1-0528-Distilled` | AM-DeepSeek-R1-0528-蒸馏版 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fgithub.com\u002Fa-m-team\u002Fa-m-models) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fa-m-team\u002Fa-m-models?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fa-m-team\u002Fa-m-models) |\n| - | `dolphin-r1` | 海豚 R1 数据集 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FQuixiAI\u002Fdolphin-r1) | - |\n| - | `SYNTHETIC-1\u002F2` | SYNTHETIC-1 发布：来自 Deepseek-R1 的两百万条协作生成的推理轨迹 | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fwww.primeintellect.ai\u002Fblog\u002Fsynthetic-1-release) | - |\n\n#### 动态环境（基于规则）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-06 | `ProtoReasoning` | ProtoReasoning：以原型为基础实现大模型的可泛化推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15211) | - |\n| 2025-05 | `SynLogic` | SynLogic：大规模合成可验证推理数据，用于学习逻辑推理及其他任务 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19641) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FSynLogic?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FSynLogic) |\n| 2025-05 | `Reasoning Gym` | REASONING GYM：具有可验证奖励的强化学习推理环境 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24760) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopen-thought\u002Freasoning-gym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fopen-thought\u002Freasoning-gym) |\n| 2025-05 | `Enigmata` | Enigmata：利用合成可验证谜题扩展大语言模型的逻辑推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19914) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBytedTsinghua-SIA\u002FEnigmata?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FBytedTsinghua-SIA\u002FEnigmata) |\n| 2025-02 | `AutoLogi` | AutoLogi：自动构建逻辑谜题，用于评估大语言模型的推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16906) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F8188zq\u002FAutoLogi?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002F8188zq\u002FAutoLogi) |\n| 2025-02 | `Logic-RL` | Logic-RL：基于规则的强化学习释放大模型的推理潜能 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14768) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUnakar\u002FLogic-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FUnakar\u002FLogic-RL) |\n\n#### 动态环境（代码类）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-06 | `AgentCPM-GUI` | AgentCPM-GUI：通过强化微调构建移动端可用的智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01391) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenBMB\u002FAgentCPM-GUI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FAgentCPM-GUI) |\n| 2025-06 | `MedAgentGym` | MedAgentGym：大规模训练基于代码的医学推理大模型智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04405) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwshi83\u002FMedAgentGym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwshi83\u002FMedAgentGym) |\n| 2025-05 | `MLE-Dojo` | MLE-Dojo：赋能机器学习工程领域大模型智能体的交互式环境 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07782) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMLE-Dojo\u002FMLE-Dojo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMLE-Dojo\u002FMLE-Dojo) |\n| 2025-05 | `SWE-rebench` | SWE-rebench：软件工程智能体的任务收集与去污评估自动化流水线 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20411) | - |\n| 2025-05 | `ZeroGUI` | ZeroGUI：零人力成本的在线 GUI 自动化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23762) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FZeroGUI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FZeroGUI) |\n| 2025-04 | `R2E-Gym` | R2E-Gym：程序化环境生成与混合验证器，助力开放权重软件工程智能体规模化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07164) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FR2E-Gym\u002FR2E-Gym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FR2E-Gym\u002FR2E-Gym) |\n| 2025-03 | `ReSearch` | ReSearch：通过强化学习让大模型学会搜索式推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.19470) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAgent-RL\u002FReCall?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAgent-RL\u002FReCall) |\n| 2025-02 | `MLGym` | MLGym：推进 AI 研究智能体的新框架与基准 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14499) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002FMLGym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FMLGym) |\n| 2024-07 | `AppWorld` | AppWorld：用于评测交互式编程智能体的可控应用与人物世界 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18901) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FStonyBrookNLP\u002Fappworld?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FStonyBrookNLP\u002Fappworld) |\n\n#### 动态环境（游戏类）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `PuzzleJAX` | PuzzleJAX：推理与学习的基准测试 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16821) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsmearle\u002Fscript-doctor?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsmearle\u002Fscript-doctor) |\n| 2025-06 | `Play to Generalize` | Play to Generalize：通过游戏玩乐学习推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08011) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyunfeixie233\u002FViGaL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyunfeixie233\u002FViGaL) |\n| 2025-06 | `Optimus-3` | Optimus-3：迈向具有可扩展任务专家的通用多模态 Minecraft 智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10357) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJiuTian-VL\u002FOptimus-3?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FJiuTian-VL\u002FOptimus-3) |\n| 2025-05 | `lmgame-Bench` | lmgame-Bench：大语言模型在玩游戏方面表现如何？ | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15146) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flmgame-org\u002FGamingAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flmgame-org\u002FGamingAgent) |\n| 2025-05 | `G1` | G1：通过强化学习自举视觉-语言模型的感知与推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13426) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fchenllliang\u002FG1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fchenllliang\u002FG1) |\n| 2025-05 | `Code2Logic` | Code2Logic：基于游戏代码的数据合成以增强 VLM 的通用推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13886) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftongjingqi\u002Fcode2logic?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ftongjingqi\u002Fcode2logic) |\n| 2025-05 | `KORGym` | KORGym：用于 LLM 推理评估的动态游戏平台 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14552) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmultimodal-art-projection\u002FKORGym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmultimodal-art-projection\u002FKORGym) |\n| 2025-04 | `Cross-env-coop` | 跨环境合作实现零样本多智能体协作 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.12714) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKJha02\u002FcrossEnvCooperation?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKJha02\u002FcrossEnvCooperation) |\n| 2022-03 | `ScienceWorld` | ScienceWorld：你的智能体比五年级学生更聪明吗？ | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.07540) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002FScienceWorld?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fallenai\u002FScienceWorld) |\n| 2020-10 | `ALFWorld` | ALFWorld：对齐文本与具身环境以实现交互式学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.03768) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falfworld\u002Falfworld?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Falfworld\u002Falfworld) |\n\n#### 动态环境（基于模型）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-06 | `SwS` | SwS：强化学习中基于自我意识的弱点驱动问题合成，用于 LLM 推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08989) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMasterVito\u002FSwS?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMasterVito\u002FSwS) |\n| 2025-06 | `SPIRAL` | SPIRAL：零和博弈中的自我博弈通过多智能体多回合强化学习激励推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.24119) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fspiral-rl\u002Fspiral?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fspiral-rl\u002Fspiral) |\n| 2025-05 | `Absolute Zero` | Absolute Zero：无数据条件下的强化自我博弈推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03335) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLeapLabTHU\u002FAbsolute-Zero-Reasoner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLeapLabTHU\u002FAbsolute-Zero-Reasoner) |\n| 2025-04 | `TextArena` | TextArena | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11442) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLeonGuertler\u002FTextArena?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLeonGuertler\u002FTextArena) |\n| 2025-03 | `SWEET-RL` | SWEET-RL：训练多回合 LLM 智能体进行协作式推理任务 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15478) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Fsweet_rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsweet_rl) |\n| - | `Genie 3` | Genie 3：世界模型的新前沿 | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fdeepmind.google\u002Fdiscover\u002Fblog\u002Fgenie-3-a-new-frontier-for-world-models\u002F) | - |\n\n#### 动态环境（基于集成）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `InternBootcamp` | InternBootcamp 技术报告：通过可验证的任务扩展提升大模型推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.08636) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternBootcamp?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternBootcamp) |\n| - | `SYNTHETIC-2` | SYNTHETIC-2 发布：四百万条协作生成的推理轨迹 | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fwww.primeintellect.ai\u002Fblog\u002Fsynthetic-2-release#synthetic-2-dataset) | - |\n\n#### 强化学习基础设施（主要）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-06 | `ROLL` | 大规模学习的强化学习优化：高效且易用的扩展库 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06122) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falibaba\u002FROLL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL) |\n| 2025-05 | `AReaL` | AReaL：面向语言推理的大规模异步强化学习系统 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24298) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinclusionAI\u002FAReaL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FAReaL) |\n| 2024-09 | `veRL` | HybridFlow：灵活高效的 RLHF 框架 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.19256v2) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvolcengine\u002Fverl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) |\n| 2024-05 | `OpenRLHF` | OpenRLHF：易用、可扩展且高性能的 RLHF 框架 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.11143) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenRLHF\u002FOpenRLHF?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF) |\n| - | `TRL` | Transformer 强化学习 | - | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuggingface\u002Ftrl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl) |\n| - | `NeMo-RL` | Nemo RL：可扩展且高效的后训练库 | - | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA-NeMo\u002FRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FRL) |\n| - | `slime` | slime：基于 SGLang 的原生后训练框架，用于强化学习扩展 | - | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002Fslime?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime) |\n| - | `RLinf` | RLinf：面向智能体 AI 的强化学习基础设施 | - | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRLinf\u002FRLinf?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRLinf\u002FRLinf) |\n\n#### 强化学习基础设施（次要）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `RL-Factory` | RLFactory：用于大语言模型多轮工具使用的即插即用强化学习后训练框架 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06980) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSimple-Efficient\u002FRL-Factory?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSimple-Efficient\u002FRL-Factory) |\n| 2025-09 | `verl-tool` | VerlTool：迈向具有工具使用的整体智能体强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01055) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTIGER-AI-Lab\u002Fverl-tool?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTIGER-AI-Lab\u002Fverl-tool) |\n| 2025-09 | `dLLM-RL` | 革新扩散型大语言模型的强化学习框架 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06949) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FdLLM-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FdLLM-RL) |\n| 2025-08 | `agent-lightning` | Agent Lightning：使用强化学习训练任何AI智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.03680) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fagent-lightning?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fagent-lightning) |\n| 2025-05 | `verl-agent` | 用于大语言模型智能体训练的组内策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10978) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlangfengQ\u002Fverl-agent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FlangfengQ\u002Fverl-agent) |\n| 2025-04 | `VLM-R1` | VLM-R1：稳定且可推广的R1风格大型视觉-语言模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07615) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fom-ai-lab\u002FVLM-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fom-ai-lab\u002FVLM-R1) |\n| - | `rllm` | rLLM：用于语言智能体后训练的框架 | - | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fagentica-project\u002Frllm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fagentica-project\u002Frllm) |\n| - | `EasyR1` | EasyR1：高效、可扩展的多模态强化学习训练框架 | - | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhiyouga\u002FEasyR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FEasyR1) |\n| - | `verifiers` | Verifiers：在可验证环境中使用大语言模型进行强化学习 | - | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwillccbb\u002Fverifiers?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwillccbb\u002Fverifiers) |\n| - | `prime-rl` | PRIME-RL：大规模去中心化强化学习训练 | - | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPrimeIntellect-ai\u002Fprime-rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPrimeIntellect-ai\u002Fprime-rl) |\n| - | `MARTI` | 基于大语言模型的多智能体强化训练与推理框架 | - | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FMARTI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMARTI) |\n\n\n\n### 应用\n#### 编码智能体\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | - | 面向机器学习工程代理的强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01684) | - |\n| 2025-09 | - | 利用强化学习提升SLM工具使用能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.04518) | - |\n| 2025-09 | `SimpleTIR` | SimpleTIR：面向多轮工具集成推理的端到端强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.02479) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fltzheng\u002FSimpleTIR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fltzheng\u002FSimpleTIR) |\n| 2025-09 | - | 收益递减的错觉：LLM中的长程执行测量 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.09677) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flong-horizon-execution\u002Fmeasuring-execution?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flong-horizon-execution\u002Fmeasuring-execution) |\n| 2025-08 | `GLM-4.5` | GLM-4.5：具身、推理与编码（ARC）基础模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.06471) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzai-org\u002FGLM-4.5?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-4.5) |\n| 2025-08 | `FormaRL` | FormaRL：无需标注数据的自动形式化增强 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18914) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUNLP-MT\u002FFormaRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUNLP-MT\u002FFormaRL) |\n| 2025-08 | `RLTR` | 不需良好答案即可鼓励良好过程：面向LLM代理规划的强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.19598) | - |\n| 2025-07 | `ARPO` | 具身强化策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19849) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdongguanting\u002FARPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdongguanting\u002FARPO) |\n| 2025-07 | `Kimi K2` | Kimi K2：开放具身智能 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.20534) | - |\n| 2025-07 | `AutoTIR` | AutoTIR：通过强化学习实现自主工具集成推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21836) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fweiyifan1023\u002FAutoTIR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fweiyifan1023\u002FAutoTIR) |\n| 2025-06 | `CoRT` | CoRT：思维中的代码集成推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09820) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChengpengLi1003\u002FCoRT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChengpengLi1003\u002FCoRT) |\n| 2025-05 | `EvoScale` | Satori-SWE：面向样本高效软件工程的进化式测试时缩放 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.23604) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsatori-reasoning\u002FSatori-SWE?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsatori-reasoning\u002FSatori-SWE) |\n| 2025-03 | `ToRL` | ToRL：工具集成强化学习的规模化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23383) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002FToRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FToRL) |\n| 2025-02 | `SWE-RL` | SWE-RL：基于开源软件进化的强化学习推进LLM推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.18449) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Fswe-rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fswe-rl) |\n| - | `Qwen3-Coder` | Qwen3-Coder：世界中的具身编程。 | - | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3-Coder?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-Coder) |\n\n#### 搜索代理\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `SSRL` | SSRL：自搜索强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10874) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FSSRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FSSRL) |\n| 2025-07 | `WebSailor` | WebSailor：面向网络智能体的超人类推理导航 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.02592) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FWebAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FWebAgent) |\n| 2025-07 | `WebShaper` | WebShaper：通过信息搜索形式化实现智能体式数据合成 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.15061) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FWebAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FWebAgent) |\n| 2025-05 | `ZeroSearch` | ZeroSearch：无需搜索即可激励大语言模型的搜索能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.04588) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FZeroSearch?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FZeroSearch) |\n| 2025-05 | `SEM` | SEM：用于搜索高效大型语言模型的强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07903) | - |\n| 2025-05 | `S3` | s3：通过强化学习训练搜索智能体并不需要那么多数据 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14146) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpat-jj\u002Fs3?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fpat-jj\u002Fs3) |\n| 2025-05 | `StepSearch` | StepSearch：通过分步近端策略优化激发大语言模型的搜索能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15107) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZillwang\u002FStepSearch?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FZillwang\u002FStepSearch) |\n| 2025-05 | `R1-Searcher++` | R1-Searcher++：通过强化学习激励大语言模型的动态知识获取能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17005) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FR1-Searcher-plus?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FR1-Searcher-plus) |\n| 2025-04 | `ReZero` | ReZero：通过再试一次来提升大语言模型的搜索能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11001) | - |\n| 2025-03 | `DeepRetrieval` | DeepRetrieval：利用强化学习，通过大语言模型攻破真实的搜索引擎和检索系统 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00223) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpat-jj\u002FDeepRetrieval?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fpat-jj\u002FDeepRetrieval) |\n| 2025-03 | `Search-R1` | Search-R1：通过强化学习训练大语言模型进行推理并利用搜索引擎 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09516) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPeterGriffinJin\u002FSearch-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPeterGriffinJin\u002FSearch-R1) |\n| 2025-03 | `R1-Searcher` | R1-Searcher：通过强化学习激励大语言模型的搜索能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.05592) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FR1-Searcher?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FR1-Searcher) |\n\n\n#### 浏览器使用型智能体\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-05 | `WebAgent-R1` | WebAgent-R1：通过端到端多轮强化学习训练网络智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16421) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fweizhepei\u002FWebAgent-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fweizhepei\u002FWebAgent-R1) |\n| 2025-05 | `WebDancer` | WebDancer：迈向自主信息搜索智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22648) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FWebAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FWebAgent) |\n| 2025-04 | `DeepResearcher` | DeepResearcher：在真实环境中通过强化学习扩展深度研究 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.03160) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002FDeepResearcher?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FDeepResearcher) |\n| 2024-11 | `Web-RL` | WebRL：通过自我进化在线课程强化学习训练大语言模型网络智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02337) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FWebRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FWebRL) |\n| 2021-12 | `WebGPT` | WebGPT：借助浏览器与人类反馈的问答系统 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.09332) | - |\n\n#### 深度研究型智能体\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `SFR-DeepResearch` | SFR-DeepResearch：面向自主推理单智能体的有效强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.06283) | - |\n| 2025-09 | `DeepDive` | DeepDive：利用知识图谱与多轮强化学习推进深度搜索智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.10446) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FDeepDive?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FDeepDive) |\n| 2025-08 | `Webwatcher` | Webwatcher：突破视觉-语言深度研究智能体的新边界 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.05748) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FWebAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FWebAgent) |\n| 2025-08 | `ASearcher` | 超越十轮：通过大规模异步强化学习解锁长 horizon 智能体式搜索 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.07976) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinclusionAI\u002FASearcher?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FASearcher) |\n| 2025-08 | `Atom-searcher` | Atom-searcher：通过细粒度原子思维奖励提升智能体式深度研究能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.12800) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fantgroup\u002FResearch-Venus?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fantgroup\u002FResearch-Venus) |\n| 2025-08 | `MedResearcher-R1` | Medreseacher-r1：基于知识驱动轨迹合成框架的专家级医学深度研究者 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.14880) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAQ-MedAI\u002FMedResearcher-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAQ-MedAI\u002FMedResearcher-R1) |\n| 2025-06 | `Jan-nano` | Jan-nano 技术报告 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.22760) | - |\n| 2025-04 | `WebThinker` | WebThinker：以深度研究能力赋能大型推理模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21776) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsunnynexus\u002FWebThinker?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsunnynexus\u002FWebThinker) |\n| - | `Kimi-Researcher` | Kimi-Researcher——用于新兴智能体能力的端到端强化学习训练 | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fmoonshotai.github.io\u002FKimi-Researcher\u002F) | - |\n| - | `Mirothinker` | Mirothinker：一个开源的智能体模型系列，专为深度研究及复杂长 horizon 问题解决而训练 | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fmiromind.ai\u002Fblog\u002Fmiromind-open-deep-research) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiroMindAI\u002FMiroThinker?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMiroMindAI\u002FMiroThinker) |\n\n#### GUI&计算机代理\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `UI-TARS 2` | UI-TARS-2 技术报告：通过多轮强化学习推进 GUI 智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02544) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FUI-TARS-desktop?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FUI-TARS-desktop) |\n| 2025-08 | `GUI-RC` | 基于区域一致性的 GUI 对齐测试时强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05615) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzju-real\u002Fgui-rcpo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzju-real\u002Fgui-rcpo) |\n| 2025-08 | `Os-r1` | OS-R1：利用强化学习进行智能体式操作系统内核调优 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12551) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLHY-24\u002FOS-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLHY-24\u002FOS-R1) |\n| 2025-08 | `ComputerRL` | ComputerRL：面向计算机使用智能体的端到端在线强化学习规模化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.14040) | - |\n| 2025-08 | `Mobile-Agent-v3` | Mobile-Agent-v3：用于 GUI 自动化的基础智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.15144) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FMobileAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FMobileAgent) |\n| 2025-08 | `SWIRL` | SWIRL：移动 GUI 控制中交错强化学习的分阶段工作流 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.20018) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLqf-HFNJU\u002FSWIRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLqf-HFNJU\u002FSWIRL) |\n| 2025-08 | `InquireMobile` | InquireMobile：通过强化微调教导基于 VLM 的移动智能体请求人类协助 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.19679) | - |\n| 2025-07 | `MobileGUI-RL` | MobileGUI-RL：在在线环境中通过强化学习推进移动 GUI 智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.05720) | - |\n| 2025-06 | `GUI-Critic-R1` | 三思而后行：用于 GUI 自动化事前错误诊断的 GUI-Critic-R1 模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04614) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FMobileAgent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FMobileAgent) |\n| 2025-06 | `GUI-Reflection` | GUI-Reflection：以自我反思行为赋能多模态 GUI 模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08012) | - |\n| 2025-06 | `Mobile-R1` | Mobile-R1：通过任务级奖励实现基于 VLM 的移动智能体的交互式强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.20332) | - |\n| 2025-05 | `UIShift` | UIShift：通过自监督强化学习提升基于 VLM 的 GUI 智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12493) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUbiquitousLearning\u002FUIShift?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FUbiquitousLearning\u002FUIShift) |\n| 2025-05 | `GUI-G1` | GUI-G1：理解 GUI 智能体中类似 R1-Zero 的视觉对齐训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15810) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuqi-Zhou\u002FGUI-G1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FYuqi-Zhou\u002FGUI-G1) |\n| 2025-05 | `ARPO` | ARPO：具有经验回放功能的 GUI 智能体端到端策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16282) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FARPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FARPO) |\n| 2025-05 | `ZeroGUI` | ZeroGUI：以零人力成本实现在线 GUI 学习自动化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23762) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FZeroGUI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FZeroGUI) |\n| 2025-04 | `GUI-R1` | GUI-R1：一种适用于 GUI 智能体的通用 R1 式视觉-语言行动模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10458) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fritzz-ai\u002FGUI-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fritzz-ai\u002FGUI-R1) |\n| 2025-03 | `UI-R1` | UI-R1：通过强化学习提升 GUI 智能体的高效动作预测 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.21620) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flll6gg\u002FUI-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flll6gg\u002FUI-R1) |\n| 2025-01 | `UI-TARS` | UI-TARS：以原生智能体开创 GUI 自动化交互 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12326) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002Fui-tars?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Fui-tars) |\n\n#### 推荐智能体\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-07 | `Shop-R1` | Shop-R1：通过强化学习奖励大模型模拟在线购物中的人类行为 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.17842) | - |\n| 2025-03 | `Rec-R1` | Rec-R1：通过强化学习连接大模型与推荐系统 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24289) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flinjc16\u002FRec-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flinjc16\u002FRec-R1) |\n\n#### 智能体（其他）\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-07 | `OpenTable-R1` | OpenTable-R1：用于开放域表格问答的强化学习增强工具智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.03018) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTabibitoQZP\u002FOpenTableR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTabibitoQZP\u002FOpenTableR1) |\n| 2025-07 | `LaViPlan` | LaViPlan：基于 RLVR 的语言引导视觉路径规划 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.12911) | - |\n| 2025-06 | `Drive-R1` | Drive-R1：通过强化学习在 VLM 中连接推理与规划以实现自动驾驶 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.18234) | - |\n| - | `EPO` | EPO：面向大模型智能体强化学习的熵正则化策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.22576) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Faiming-lab\u002Fgrape?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FWujiangXu\u002FEPO) |\n\n#### 代码生成\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `Proof2Silicon` | Proof2Silicon：基于强化学习的提示修复用于验证代码与硬件生成 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.06239) | - |\n| 2025-09 | `AR$^2$` | AR$^2$：面向大型语言模型抽象推理的对抗性强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.03537) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhhhuang\u002FARAR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhhhuang\u002FARAR) |\n| 2025-09 | `Dream-Coder` | Dream-Coder 7B：开源的代码扩散语言模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.01142) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDreamLM\u002FDream-Coder?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FDreamLM\u002FDream-Coder) |\n| 2025-08 | `MSRL` | 打破SFT平台期：用于图表到代码生成的多模态结构化强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.13587) | - |\n| 2025-07 | `CogniSQL-R1-Zero` | CogniSQL-R1-Zero：用于高效SQL生成的轻量级强化推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.06013) | - |\n| 2025-07 | `Leanabell-Prover-V2` | Leanabell-Prover-V2：通过强化学习实现的验证器集成推理，用于形式化定理证明 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.08649) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLeanabell-LM\u002FLeanabell-Prover-V2?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLeanabell-LM\u002FLeanabell-Prover-V2) |\n| 2025-07 | `StepFun-Prover` | StepFun-Prover预览版：让我们一步步思考并验证 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.20199) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fstepfun-ai\u002FStepFun-Prover-Preview?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStepFun-Prover-Preview) |\n| 2025-06 | `MedAgentGym` | MedAgentGym：大规模训练基于代码的医学推理LLM智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.04405) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwshi83\u002FMedAgentGym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwshi83\u002FMedAgentGym) |\n| 2025-05 | `Fortune` | Fortune：公式驱动的强化学习，用于语言模型中的符号表推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23667) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Ffortune?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ffortune) |\n| 2025-05 | `VeriReason` | VeriReason：利用测试平台反馈的强化学习，用于增强推理能力的Verilog生成 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.11849) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNellyW8\u002FVeriReason?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNellyW8\u002FVeriReason) |\n| 2025-05 | `ReEX-SQL` | ReEx-SQL：通过执行感知的强化学习实现文本到SQL的推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12768) | - |\n| 2025-05 | `AceReason-Nemotron` | AceReason-Nemotron：通过强化学习推进数学与代码推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.16400) | - |\n| 2025-05 | `SkyWork OR1` | Skywork开放推理器1技术报告 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.22312) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-OR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-OR1) |\n| 2025-05 | `CodeV-R1` | CodeV-R1：增强推理能力的Verilog生成 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.24183) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fiprc-dip\u002FCodeV-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fiprc-dip\u002FCodeV-R1) |\n| 2025-05 | `AReaL` | AREAL：面向语言推理的大规模异步强化学习系统 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.24298) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinclusionAI\u002FAReaL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FAReaL) |\n| 2025-04 | `SQL-R1` | SQL-R1：通过强化学习训练自然语言到SQL的推理模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.08600) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDataArcTech\u002FSQL-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FDataArcTech\u002FSQL-R1) |\n| 2025-04 | `Kimina-Prover` | Kimina-Prover预览版：迈向使用强化学习的大型形式化推理模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.11354) | - |\n| 2025-04 | `DeepSeek-Prover-V2` | DeepSeek-Prover-V2：通过强化学习进行子目标分解，推进形式数学推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.21801) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-Prover-V2?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-Prover-V2) |\n| 2025-03 | `Reasoning-SQL` | Reasoning-SQL：采用SQL定制的部分奖励的强化学习，用于增强文本到SQL的推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.23157) | - |\n| - | `code-r1` | Code-R1：以可靠奖励重现R1用于代码 | - | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fganler\u002Fcode-r1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fganler\u002Fcode-r1) |\n| - | `Open-R1` | Open-R1：DeepSeek-R1的完全开源复现 | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fopen-r1) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuggingface\u002Fopen-r1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fopen-r1) |\n| - | `DeepCoder` | Deepcoder：一个完全开源的o3-mini级别的14b编码器 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fpretty-radio-b75.notion.site\u002FDeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fagentica-project\u002Frllm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fagentica-project\u002Frllm) |\n\n#### 软件工程\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `UTRL` | 通过对抗性强化学习学习生成单元测试 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.21107) | - |\n| 2025-07 | `RePaCA` | RePaCA：利用推理型大语言模型进行静态自动化补丁正确性评估 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.22580) | - |\n| 2025-07 | `Repair-R1` | Repair-R1：修复前先测试 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.22853) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTomsawyerhu\u002FAPR-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTomsawyerhu\u002FAPR-RL) |\n| 2025-06 | `CURE` | 通过强化学习协同进化LLM编码器和单元测试器 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.03136) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FCURE?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FCURE) |\n| 2025-05 | `REAL` | 利用程序分析反馈训练语言模型生成高质量代码 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.22704) | - |\n| 2025-05 | `Afterburner` | Afterburner：强化学习助力自我改进的代码效率优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.23387) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FElfsong\u002FAfterburner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FElfsong\u002FAfterburner) |\n| 2024-09 | `RepoGenReflex` | RepoGenReflex：通过言语强化与检索增强生成提升仓库级代码补全能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.13122) | - |\n| 2024-07 | `RLCoder` | RLCoder：用于仓库级代码补全的强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.19487) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDeepSoftwareAnalytics\u002FRLCoder?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FDeepSoftwareAnalytics\u002FRLCoder) |\n\n#### 多模态理解\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `Vision-Zero` | Vision-Zero：通过策略性游戏化自对弈实现可扩展的多模态大模型自我改进 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25541) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwangqinsi1\u002FVision-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwangqinsi1\u002FVision-Zero) |\n| 2025-09 | `ReAd-R` | AdsQA：迈向广告视频理解 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.08621) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FAdsQA?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAdsQA) |\n| 2025-09 | `Keye` | Kwai Keye-VL 1.5 技术报告 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01563) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKwai-Keye\u002FKeye?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKwai-Keye\u002FKeye) |\n| 2025-08 | `Sifthinker` | Sifthinker：面向视觉推理的空间感知图像焦点 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.06259) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhangquanchen\u002FSIFThinker?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhangquanchen\u002FSIFThinker) |\n| 2025-07 | `Long-RL` | 将强化学习扩展到长视频 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.07966) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FLong-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FLong-RL) |\n| 2025-06 | `RefSpatial` | RoboRefer：面向机器人技术的视觉-语言模型中结合推理的空间指代 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04308) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZhoues\u002FRoboRefer?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FZhoues\u002FRoboRefer) |\n| 2025-06 | `Ego-R1` | Ego-R1：用于超长第一人称视频推理的工具链式思维 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13654) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fegolife-ai\u002FEgo-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fegolife-ai\u002FEgo-R1) |\n| 2025-05 | `VerIPO` | VerIPO：具有迭代策略优化的长推理视频-R1 模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19000) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHITsz-TMG\u002FVerIPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FHITsz-TMG\u002FVerIPO) |\n| 2025-05 | `Openthinkimg` | Openthinkimg：通过视觉工具强化学习学会用图像思考 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.08617) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhaochen0110\u002FOpenThinkIMG?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhaochen0110\u002FOpenThinkIMG) |\n| 2025-05 | `Visual Planning` | Visual Planning：让我们只用图像思考 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11409) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyix8\u002FVisualPlanning?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyix8\u002FVisualPlanning) |\n| 2025-05 | `VideoRFT` | Videorft：通过强化微调激励多模态大模型的视频推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12434) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQiWang98\u002FVideoRFT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQiWang98\u002FVideoRFT) |\n| 2025-05 | `Deepeyes` | Deepeyes：通过强化学习激励“用图像思考” | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14362) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVisual-Agent\u002FDeepEyes?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FVisual-Agent\u002FDeepEyes) |\n| 2025-05 | `Visionary-R1` | Visionary-R1：利用强化学习缓解视觉推理中的捷径问题 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14677) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmaifoundations\u002FVisionary-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmaifoundations\u002FVisionary-R1) |\n| 2025-05 | `CoF` | 链式聚焦：基于强化学习的自适应多模态推理视觉搜索与缩放 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15436) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxtong-zhang\u002FChain-of-Focus?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fxtong-zhang\u002FChain-of-Focus) |\n| 2025-05 | `GRIT` | GRIT：教导多模态大模型用图像思考 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15879) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Feric-ai-lab\u002FGRIT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Feric-ai-lab\u002FGRIT) |\n| 2025-05 | `Pixel Reasoner` | 像素推理者：以好奇心驱动的强化学习激励像素空间推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15966) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTIGER-AI-Lab\u002FPixel-Reasoner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTIGER-AI-Lab\u002FPixel-Reasoner) |\n| 2025-05 | - | 不要只看一次：迈向带有选择性视觉重访的多模态交互式推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.18842) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjun297\u002Fv1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fjun297\u002Fv1) |\n| 2025-05 | `Ground-R1` | Ground-R1：通过强化学习激励 grounded 视觉推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20272) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzzzhhzzz\u002FGround-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzzzhhzzz\u002FGround-R1) |\n| 2025-05 | `TACO` | TACO：在多模态大模型中通过强化学习实现思考与回答的一致性，以优化长链条推理和高效数据学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20777) | - |\n| 2025-05 | `Qwen-LA` | Qwen 再次关注：引导视觉-语言推理模型重新关注视觉信息 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23558) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLiar406\u002FLook_Again?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLiar406\u002FLook_Again) |\n| 2025-05 | `TW-GRPO` | 通过专注式思考强化视频推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24718) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flongmalongma\u002FTW-GRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flongmalongma\u002FTW-GRPO) |\n| 2025-05 | `Spatial-MLLM` | Spatial-MLLM：提升多模态大模型在基于视觉的空间智能方面的能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.23747) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdiankun-wu\u002FSpatial-MLLM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdiankun-wu\u002FSpatial-MLLM) |\n| 2025-04 | `R1-Zero-VSI` | 通过类似 R1-Zero 的训练改进视觉-空间推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00883) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhijie-group\u002FR1-Zero-VSI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhijie-group\u002FR1-Zero-VSI) |\n| 2025-04 | `Spacer` | Spacer：强化多模态大模型的视频空间推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.01805) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOuyangKun10\u002FSpaceR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOuyangKun10\u002FSpaceR) |\n| 2025-04 | `Videochat-R1` | Videochat-R1：通过强化微调提升时空感知能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.06958) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FVideoChat-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FVideoChat-R1) |\n| 2025-04 | `VLM-R1` | VLM-R1：一种稳定且可泛化的 R1 式大型视觉-语言模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07615) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fom-ai-lab\u002FVLM-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fom-ai-lab\u002FVLM-R1) |\n| 2025-03 | `OpenVLThinker` | OpenVLThinker：通过迭代 SFT-RL 循环实现复杂的视觉-语言推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.17352) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyihedeng9\u002FOpenVLThinker?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyihedeng9\u002FOpenVLThinker) |\n| 2025-03 | `Visual-RFT` | Visual-RFT：视觉强化微调 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01785) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLiuziyu77\u002FVisual-RFT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT) |\n| 2025-03 | `Vision-R1` | Vision-R1：激励多模态大语言模型的推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.06749) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOsilly\u002FVision-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOsilly\u002FVision-R1) |\n| 2025-03 | `VisRL` | VisRL：基于强化推理的意图驱动视觉感知 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07523) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhangquanchen\u002FVisRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhangquanchen\u002FVisRL) |\n| 2025-03 | `Metaspatial` | Metaspatial：为元宇宙中的视觉-语言模型强化 3D 空间推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18470) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPzySeere\u002FMetaSpatial?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPzySeere\u002FMetaSpatial) |\n| 2025-03 | `Video-R1` | Video-R1：强化多模态大模型的视频推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.21776) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftulerfeng\u002FVideo-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ftulerfeng\u002FVideo-R1) |\n\n#### 多模态生成\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `IGPO` | 基于修复引导的扩散大语言模型策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.10396) | - |\n| 2025-08 | `Qwen-Image` | Qwen-Image 技术报告 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02324) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen-Image?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-Image) |\n| 2025-08 | `TempFlow-GRPO` | TempFlow-GRPO：流模型中时间对 GRPO 的重要性 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.04324) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FShredded-Pork\u002FTempFlow-GRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FShredded-Pork\u002FTempFlow-GRPO) |\n| 2025-07 | `MixGRPO` | MixGRPO：通过混合 ODE-SDE 解锁基于流的 GRPO 效率 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21802) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent-Hunyuan\u002FMixGRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTencent-Hunyuan\u002FMixGRPO) |\n| 2025-06 | `FocusDiff` | Focusdiff：通过强化学习推进自回归视觉生成中的细粒度文本-图像对齐 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05501) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwendell0218\u002FFocusDiff?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwendell0218\u002FFocusDiff) |\n| 2025-06 | `SUDER` | 通过双重自我奖励强化多模态理解和生成 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07963) | - |\n| 2025-05 | `T2I-R1` | T2I-R1：通过语义级和标记级协同思维链强化图像生成 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00703) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCaraJ7\u002FT2I-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FCaraJ7\u002FT2I-R1) |\n| 2025-05 | `Flow-GRPO` | Flow-GRPO：通过在线强化学习训练流匹配模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05470) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyifan123\u002Fflow_grpo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo) |\n| 2025-05 | `DanceGRPO` | DanceGRPO：释放 GRPO 在视觉生成中的潜力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07818) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FXueZeyue\u002FDanceGRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FXueZeyue\u002FDanceGRPO) |\n| 2025-05 | `GoT-R1` | GoT-R1：利用强化学习释放多模态大语言模型在视觉生成中的推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17022) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgogoduan\u002FGoT-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fgogoduan\u002FGoT-R1) |\n| 2025-05 | `ULM-R1` | 面向统一多模态理解和生成的协同强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17534) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmm-vl\u002FULM-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmm-vl\u002FULM-R1) |\n| 2025-05 | `RePrompt` | Reprompt：通过强化学习实现推理增强型重提示技术用于文本到图像生成 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17540) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FDKI_LLM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDKI_LLM) |\n| 2025-05 | `InfLVG` | InfLVG：通过 GRPO 强化推理时一致性的长视频生成 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17574) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMAPLE-AIGC\u002FInfLVG?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMAPLE-AIGC\u002FInfLVG) |\n| 2025-05 | `Reasongen-R1` | Reasongen-R1：通过 SFT 和 RL 为自回归图像生成模型提供思维链 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24875) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFranklin-Zhang0\u002FReasonGen-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FFranklin-Zhang0\u002FReasonGen-R1) |\n| 2025-04 | `PhysAR` | 利用扩散时间步标记通过强化学习进行物理合理视频生成 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15932) | - |\n\n#### 机器人任务\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `SimpleVLA-RL` | SimpleVLA-RL：通过强化学习扩展VLA训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.09674) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FSimpleVLA-RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FSimpleVLA-RL) |\n| 2025-06 | `TGRPO` | TGRPO：基于轨迹级分组相对策略优化的视觉-语言-动作模型微调 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08440) | - |\n| 2025-05 | `ReinboT` | ReinboT：利用强化学习增强机器人视觉-语言操控能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07395) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCOST-97\u002FreinboT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FCOST-97\u002FreinboT) |\n| 2025-05 | `RIPT-VLA` | 视觉-语言-动作模型的交互式后训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.17016) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAriostgx\u002Fript-vla?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAriostgx\u002Fript-vla) |\n| 2025-05 | `VLA-RL` | VLA-RL：通过可扩展的强化学习实现精通且通用的机器人操控 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.18719) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGuanxingLu\u002Fvlarl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGuanxingLu\u002Fvlarl) |\n| 2025-05 | `RFTF` | RFTF：面向具身智能体的时序反馈强化微调 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19767) | - |\n| 2025-05 | `VLA泛化` | 强化学习能为VLA泛化带来什么？一项实证研究 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19789) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgen-robot\u002FRL4VLA?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fgen-robot\u002FRL4VLA) |\n| 2025-02 | `ConRFT` | ConRFT：基于一致性策略的VLA模型强化微调方法 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05450) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcccedric\u002Fconrft?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcccedric\u002Fconrft) |\n| 2024-11 | `GRAPE` | GRAPE：通过偏好对齐泛化机器人策略 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.19309) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Faiming-lab\u002Fgrape?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Faiming-lab\u002Fgrape) |\n| - | `RLinf` | RLinf：面向代理型AI的强化学习基础设施 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Frlinf.readthedocs.io\u002Fen\u002Flatest\u002F) | - |\n| - | `EPO` | EPO：面向LLM代理强化学习的熵正则化策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.22576) | [![GitHub Star数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Faiming-lab\u002Fgrape?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FWujiangXu\u002FEPO) |\n\n#### 多智能体系统\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-10 | `AgentFlow` | 流式智能体系统优化：用于高效规划与工具使用的框架 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2510.05592) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flupantech\u002FAgentFlow?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flupantech\u002FAgentFlow) |\n| 2025-09 | `SoftRankPO,` | 学会深思熟虑：基于多智能体强化学习的元策略协作，赋能智能体LLM | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03817) | - |\n| 2025-09 | `BFS-Prover-V2` | 扩展多回合离线强化学习与多智能体树搜索技术，应用于LLM步骤证明器 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06493) | - |\n| 2025-08 | `MAGRPO` | LLM与多智能体强化学习的协作 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.04652) | - |\n| 2025-06 | `AlphaEvolve` | AlphaEvolve：面向科学与算法发现的编程智能体 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.13131) | - |\n| 2025-06 | `JoyAgents-R1` | JoyAgents-R1：结合强化学习的多功能多LLM智能体联合进化动力学 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.19846) | - |\n| 2025-03 | `ReMA` | ReMA：利用多智能体强化学习让LLM学会元思考 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.09501) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fziyuwan\u002FReMA-public?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fziyuwan\u002FReMA-public) |\n| 2025-02 | `CTRL` | 通过强化学习教导语言模型进行批判性评估 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.03492) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHKUNLP\u002Fcritic-rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FHKUNLP\u002Fcritic-rl) |\n| 2025-02 | `Maporl` | MAPoRL2：基于强化学习的大型语言模型协同后训练 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.18439) | [![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fchanwoo-park-official\u002FMAPoRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fchanwoo-park-official\u002FMAPoRL) |\n| 2023-11 | `LLaMAC` | 基于大型语言模型的智能体在大规模决策中的控制：演员-评论家方法 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.13884) | - |\n\n#### 科学任务\n\n| 日期 | 名称 | 标题 | 论文 | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `Baichuan-M2` | 百川-M2：基于大型验证器系统的医疗能力扩展 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02208) | - |\n| 2025-08 | `CX-Mind` | CX-Mind：通过课程引导的强化学习实现胸片交错推理的开创性多模态大语言模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.03733) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWenjieLisjtu\u002FCX-Mind?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FWenjieLisjtu\u002FCX-Mind) |\n| 2025-08 | `MORE-CLEAR` | MORE-CLEAR：利用增强状态表示的临床笔记多模态离线强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.07681) | - |\n| 2025-08 | `ARMed` | 突破奖励崩溃：具有增强语义区分能力的开放式医疗推理自适应强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12957) | - |\n| 2025-08 | `ProMed` | ProMed：基于夏普利信息增益指导的主动式医疗LLM强化学习 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.13514) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhxxding\u002FProMed?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhxxding\u002FProMed) |\n| 2025-08 | `OwkinZero` | OwkinZero：用AI加速生物发现 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16315) | - |\n| 2025-08 | `MolReasoner` | MolReasoner：迈向分子LLM的有效且可解释的推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.02066) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F545487677\u002FMolReasoner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002F545487677\u002FMolReasoner) |\n| 2025-08 | `MedGR$^2$` | MedGR$^2$：通过生成式奖励学习突破医疗推理的数据壁垒 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.20549) | - |\n| 2025-07 | `MedGround-R1` | MedGround-R1：通过空间-语义奖励的群体相对策略优化推进医学图像定位 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.02994) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbio-mlhui\u002FMedGround-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fbio-mlhui\u002FMedGround-R1) |\n| 2025-07 | `MedGemma` | MedGemma技术报告 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.05201) | - |\n| 2025-06 | `MMedAgent-RL` | MMedAgent-RL：优化多智能体协作以实现多模态医疗推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.00555v2) | - |\n| 2025-06 | `Cell-o1` | Cell-o1：通过强化学习训练LLM解决单细胞推理难题 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02911) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fncbi-nlp\u002Fcell-o1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fncbi-nlp\u002Fcell-o1) |\n| 2025-06 | `MedAgentGym` | MedAgentGym：大规模训练基于代码的LLM代理进行医疗推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04405v1) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwshi83\u002FMedAgentGym?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwshi83\u002FMedAgentGym) |\n| 2025-06 | `Med-U1` | Med-U1：通过大规模强化学习激励LLM中的统一医疗推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.12307) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMonncyann\u002FMed-U1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMonncyann\u002FMed-U1) |\n| 2025-06 | `MedVIE` | 通过强化学习实现高效的医疗视觉-推理框架 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13363) | - |\n| 2025-06 | `LA-CDM` | 基于强化学习的语言代理用于假设驱动的临床决策制定 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13474) | - |\n| 2025-06 | `ether0` | 训练用于化学科学推理的模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17238) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFuture-House\u002Fether0?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FFuture-House\u002Fether0) |\n| 2025-06 | `Gazal-R1` | Gazal-R1：通过参数高效的两阶段训练实现最先进的医疗推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21594) | - |\n| 2025-05 | `DRG-Sapphire` | LLM中分布外推理的强化学习：诊断相关分组编码的实证研究 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21908) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhanyin88\u002FDRG-Sapphire?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fhanyin88\u002FDRG-Sapphire) |\n| 2025-05 | `BioReason` | BioReason：在DNA-LLM模型中激励多模态生物推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23579) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbowang-lab\u002FBioReason?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fbowang-lab\u002FBioReason) |\n| 2025-05 | `EHRMIND` | 通过强化学习训练LLM完成基于电子健康记录的推理任务 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24105) | - |\n| 2025-04 | `Open-Medical-R1` | Open-Medical-R1：如何在医学领域选择RLVR训练数据 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13950) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQsingle\u002Fopen-medical-r1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQsingle\u002Fopen-medical-r1) |\n| 2025-04 | `ChestX-Reasoner` | ChestX-Reasoner：通过逐步验证推进放射学基础模型 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20930v1) | - |\n| 2025-04 | `BoxMed-RL` | 像放射科医生一样思考：链式思维与强化学习用于可验证报告生成 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.18453) | - |\n| 2025-03 | `PPME` | 通过临床经验学习提升大型语言模型代理的交互式诊断能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16463) | - |\n| 2025-03 | `DOLA` | 使用DOLA进行自主放疗计划：一种保护隐私的基于LLM的优化代理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.17553) | - |\n| 2025-02 | `Baichuan-M1` | 百川-M1：推动大语言模型的医疗能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12671) | - |\n| 2025-02 | `MedVLM-R1` | MedVLM-R1：通过强化学习激励视觉-语言模型（VLMs）的医疗推理能力 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19634) | - |\n| 2025-02 | `Med-RLVR` | Med-RLVR：通过强化学习从3B基础模型中涌现医疗推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19655) | - |\n| 2025-01 | `MedXpertQA` | MedXpertQA：专家级医疗推理与理解的基准测试 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18362) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FMedXpertQA?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMedXpertQA) |\n| 2024-12 | `HuatuoGPT-o1` | 华佗GPT-o1：迈向基于LLM的复杂医疗推理 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18925) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFreedomIntelligence\u002FHuatuoGPT-o1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FHuatuoGPT-o1) |\n| - | `Pro-1` | Pro-1 | [![博客](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fmichaelhla.com\u002Fblog\u002Fpro1.html) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmichaelhla\u002Fpro-1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmichaelhla\u002Fpro-1) |\n| - | `rbio` | rbio1：以生物世界模型作为软验证器训练科学推理LLM | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2025.08.18.670981v3) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fczi-ai\u002Frbio?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fczi-ai\u002Frbio) |\n| - | `EPO` | EPO：用于LLM代理强化学习的熵正则化策略优化 | [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.22576) | [![GitHub Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Faiming-lab\u002Fgrape?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FWujiangXu\u002FEPO) |\n\n## 🌟 致谢\n\n本调查是在原始的 **Awesome RL Reasoning Recipes** 仓库基础上扩展和完善的。我们衷心感谢所有贡献者的辛勤付出，并对他们对 **Awesome RL Reasoning Recipes** 的持续关注表示诚挚的感谢。此前仓库的内容可在此处查看：[这里](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-for-LRMs\u002Ftree\u002FTripleR)。\n\n\n## ✨ 星标历史\n\n[![星标历史图表](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTsinghuaC3I_Awesome-RL-for-LRMs_readme_fbe8f31f32d2.png)](https:\u002F\u002Fwww.star-history.com\u002F#TsinghuaC3I\u002FAwesome-RL-for-LRMs&Date)","# Awesome-RL-for-LRMs 快速上手指南\n\n**Awesome-RL-for-LRMs** 并非一个可直接安装运行的软件库或框架，而是一个由清华大学 C3I 团队维护的**开源论文与项目综述列表**。它旨在系统性地梳理“大型推理模型（Large Reasoning Models, LRMs）中的强化学习（RL）”领域的最新进展、前沿模型、奖励设计、策略优化算法及应用场景。\n\n本指南将帮助开发者快速利用该资源追踪技术动态、查找相关代码实现及阅读核心论文。\n\n## 环境准备\n\n由于本项目本质为资料索引，无需特定的系统环境或复杂的依赖安装。您只需具备以下基础条件即可开始使用：\n\n*   **操作系统**：Windows \u002F macOS \u002F Linux 均可。\n*   **网络环境**：\n    *   能够访问 **GitHub** 以浏览项目列表和源代码。\n    *   能够访问 **arXiv** 或 **Hugging Face Papers** 以阅读论文。\n    *   *国内用户建议*：若访问 GitHub 或 arXiv 较慢，可配置本地 hosts 或使用学术加速镜像（如 arXiv 国内镜像站）。\n*   **工具要求**：现代浏览器（推荐 Chrome 或 Edge）。\n\n## 获取与浏览步骤\n\n您无需通过包管理器安装该项目，直接通过以下方式获取内容：\n\n### 1. 访问在线仓库（推荐）\n直接在浏览器中打开项目主页，查看分类整理的论文列表和最新动态：\n```bash\n# 在浏览器地址栏输入以下网址\nhttps:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-Reasoning-Recipes\n```\n*(注：原 README 中提供的链接指向 `TsinghuaC3I\u002FAwesome-RL-Reasoning-Recipes`，请以此为准)*\n\n### 2. 克隆本地浏览（可选）\n如果您希望离线查看或贡献内容，可以将仓库克隆到本地：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-Reasoning-Recipes.git\ncd Awesome-RL-Reasoning-Recipes\n```\n克隆后，您可以直接使用 Markdown 阅读器（如 VS Code、Typora）打开 `README.md` 文件进行浏览。\n\n### 3. 订阅动态\n关注项目的 **News** 板块或 Watch 该仓库，以获取关于新发布模型（如 GLM-4.5, Qwen3, Kimi K2 等）和新论文的分类更新通知。\n\n## 基本使用\n\n本项目的核心用法是**按图索骥**：根据您的需求查找对应的论文、技术报告或开源代码实现。\n\n### 场景一：查找特定领域的前沿模型\n如果您想了解最新的具身智能或推理大模型，请查阅 **Frontier Models** 章节。\n*   **操作**：在页面中找到对应模型名称（例如 `Qwen3` 或 `GLM-4.5`）。\n*   **行动**：点击表格中的 `Paper` 按钮阅读技术报告，点击 `Github` 按钮跳转至官方代码仓库进行复现或微调。\n\n### 场景二：研究特定 RL 算法实现\n如果您正在开发基于 RL 的推理模型，需要参考具体的奖励设计或策略优化算法：\n*   **操作**：利用目录跳转到 **Paper List** 下的子分类，例如：\n    *   `Reward Design` -> `Dense Rewards` (密集奖励设计)\n    *   `Policy Optimization` -> `Critic-Free Algorithms` (无评论家算法)\n    *   `Sampling Strategy` (采样策略)\n*   **行动**：列表中提供了该细分领域的相关论文链接及对应的开源项目地址（如有），可直接前往学习具体实现细节。\n\n### 场景三：寻找训练资源与环境\n如果您需要构建自己的 RL 训练环境或寻找数据集：\n*   **操作**：查阅 **Training Resource** 章节。\n*   **内容**：这里汇总了静态语料库（代码、数学、STEM 等）和动态环境（基于规则、代码、游戏等）的相关资源链接，以及 RL 基础设施（Infrastructure）的推荐方案。\n\n### 引用本项目\n如果在您的研究或工作中使用了该综述列表，请在您的论文中添加以下引用：\n```bibtex\n@article{zhang2025survey,\n  title={A survey of reinforcement learning for large reasoning models},\n  author={Zhang, Kaiyan and Zuo, Yuxin and He, Bingxiang and Sun, Youbang and Liu, Runze and Jiang, Che and Fan, Yuchen and Tian, Kai and Jia, Guoli and Li, Pengfei and others},\n  journal={arXiv preprint arXiv:2509.08827},\n  year={2025}\n}\n```","某顶尖 AI 实验室的研究团队正致力于开发下一代具备深度逻辑推理能力的大模型，急需通过强化学习（RL）优化模型的思维链路径。\n\n### 没有 Awesome-RL-for-LRMs 时\n- **文献检索如大海捞针**：研究人员需在 arXiv 上手动筛选海量论文，难以区分哪些是专门针对“大推理模型”的 RL 研究，哪些只是通用 LLM 的微调，效率极低。\n- **技术路线选择盲目**：面对生成式奖励、密集奖励或无监督奖励等多种设计范式，团队缺乏系统性的对比视角，容易在错误的奖励函数设计上浪费数周算力资源。\n- **算法复现门槛高**：找不到经过验证的策略优化算法（如免 Critic 算法）的最佳实践代码，导致从零复现前沿论文时频繁遇到收敛失败的问题。\n- **领域动态滞后**：由于该细分领域更新极快，团队难以实时掌握像 SSRL 或 MARTI 这样的最新框架，导致研发起点落后于社区平均水平。\n\n### 使用 Awesome-RL-for-LRMs 后\n- **知识获取精准高效**：团队直接利用其结构化的论文列表，按“奖励设计”和“策略优化”分类快速锁定几十篇核心文献，将调研时间从两周缩短至两天。\n- **技术方案决策科学**：借助综述中对不同奖励机制的深度剖析，团队迅速确定了适合当前任务的“密集奖励 + 策略梯度”组合，避免了无效的试错成本。\n- **工程落地有章可循**：通过关联的开源项目链接（如 TTRL、MARTI），研究人员直接参考了成熟的代码实现，大幅降低了算法复现难度并加速了实验迭代。\n- **紧跟前沿保持领先**：依托其持续的新闻更新机制，团队第一时间集成了最新的智能体记忆模块研究成果，确保模型架构始终处于行业第一梯队。\n\nAwesome-RL-for-LRMs 不仅是一份文献清单，更是大推理模型研发团队在强化学习探索路上的导航仪与加速器。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTsinghuaC3I_Awesome-RL-for-LRMs_e10d984f.png","TsinghuaC3I","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FTsinghuaC3I_5303c8c6.jpg","Center for Collaborative & Conversational Intelligence at Tsinghua University",null,"https:\u002F\u002Fc3i.ee.tsinghua.edu.cn\u002Fen\u002F","https:\u002F\u002Fgithub.com\u002FTsinghuaC3I",[79],{"name":80,"color":81,"percentage":82},"TeX","#3D6117",100,2424,130,"2026-04-13T02:50:13","MIT",1,"","未说明",{"notes":91,"python":89,"dependencies":92},"该项目是一个综述列表（Awesome List），主要收集了关于大型推理模型强化学习（RL for LRMs）的论文、前沿模型和相关开源项目链接。它本身不是一个可执行的软件工具或框架，因此没有特定的运行环境、依赖库或硬件需求。用户若需运行列表中提到的具体模型（如 Qwen3, GLM-4.5 等）或代码库（如 SSRL, MARTI 等），需参考各个独立项目的 README 文件以获取相应的环境配置信息。",[],[14,35],[95,96,97,98,99,100,101],"awesome-list","llm","reasoning","rl","deepseek-r1","open-source","lrm","2026-03-27T02:49:30.150509","2026-04-13T22:47:21.275309",[105,110,115,120,125,130,135],{"id":106,"question_zh":107,"answer_zh":108,"source_url":109},31995,"如何获取该综述中引用文献的 BibTeX 参考文献文件？","维护者已上传了包含所有引用文献的 BibTeX 风格参考文献文件。您可以直接访问以下链接获取：https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-for-LRMs\u002Fblob\u002Fmain\u002Freference.bib。请注意，目前无法保证该文件能及时更新。","https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-for-LRMs\u002Fissues\u002F54",{"id":111,"question_zh":112,"answer_zh":113,"source_url":114},31996,"我想推荐一篇关于强化学习与大语言模型推理的新论文，应该如何提交？","您可以在 GitHub 仓库中创建一个 Issue，提供论文的标题、简介、arXiv 链接以及代码仓库链接（如果有）。维护者在确认论文符合主题后，会将其添加到论文列表中。例如，已有用户通过此方式成功推荐了 AgentFlow、KRPO、DisCO、AMPO、CE-GPPO、RewardAnything、KlearReasoner 以及多模态 RL 相关的工作。","https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-for-LRMs\u002Fissues\u002F53",{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},31997,"我发现综述论文或 README 中遗漏了某篇重要工作的引用，该如何反馈？","您可以通过提交 Issue 指出遗漏的引用，并提供该工作的详细信息（如标题、作者、arXiv 链接等）。维护者会在未来的更新版本中考虑添加该引用。例如，有用户指出 DeepRetrieval 工作在 README 中被提及但未在论文正文中引用，维护者回复将在未来更新中添加。","https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-for-LRMs\u002Fissues\u002F45",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},31998,"该项目是否收录基于规则（rubric-based）的生成式奖励模型相关工作？","是的，项目已收录相关工作。例如，'RewardAnything'作为首个开源的基于规则的生成式奖励模型已被添加到集合中。该模型旨在推理时解释并遵循自然语言原则以生成奖励，专门用于具有多样化对齐目标的强化学习训练。用户可以通过 `pip install rewardanything` 直接使用。","https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-for-LRMs\u002Fissues\u002F37",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},31999,"是否有针对社交智能（Social Intelligence）领域的语言代理强化学习研究被收录？","有的。项目已收录了关于语言代理自适应社会推理（Adaptive Social Reasoning of Language Agents）的工作。该研究提出了 ASL 框架和 AMPO 算法，通过集成模式级和样本级的优势估计来实现动态模式切换，增强了语言代理在丰富社会语境中的自适应推理能力。","https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-for-LRMs\u002Fissues\u002F41",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},32000,"项目是否包含多轮多模态（Multi-Turn Multi-Modal）强化学习的相关研究？","是的，项目已收录了关于“交互式多模态工具使用代理的过程监督强化学习”（Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents）的研究。该工作专注于多轮对话代理的探索激励和信用分配问题。","https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-for-LRMs\u002Fissues\u002F35",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},32001,"除了标准的 GRPO 算法，项目中是否收录了其改进变体（如 GPPO, CE-GPPO 等）？","是的，项目收录了多种 GRPO 的改进算法。例如：1. **GPPO** (Gradient-Preserving Policy Optimization)：在 KlearReasoner 工作中提出，相比 GRPO\u002FCISPO 取得了更好效果；2. **CE-GPPO** (Controlling Entropy via Gradient-Preserving Policy Optimization)：通过温和且有界的方式重新引入被裁剪 token 的梯度，以平衡探索与利用，有效缓解熵不稳定性。","https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-for-LRMs\u002Fissues\u002F36",[141],{"id":142,"version":143,"summary_zh":144,"released_at":145},239241,"TripleR","之前仓库《Awesome RL 理论推理配方》（初始版本）的内容。","2025-09-11T02:15:04"]