[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-huggingface--trl":3,"tool-huggingface--trl":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150037,2,"2026-04-10T23:33:47",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":72,"owner_website":77,"owner_url":78,"languages":79,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":10,"env_os":98,"env_gpu":99,"env_ram":98,"env_deps":100,"category_tags":108,"github_topics":76,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":109,"updated_at":110,"faqs":111,"releases":140},6487,"huggingface\u002Ftrl","trl","Train transformer language models with reinforcement learning.","trl 是一个专为大语言模型后期训练设计的开源库，旨在帮助开发者利用强化学习等技术优化模型表现。它解决了传统方法中模型难以对齐人类偏好、训练流程复杂以及硬件资源消耗过大的痛点。无论是希望提升模型对话质量的研究人员，还是需要高效微调方案的工程师，都能通过 trl 轻松上手。\n\ntrl 基于 🤗 Transformers 生态构建，提供了多种开箱即用的训练器，如监督微调（SFT）、直接偏好优化（DPO）和群组相对策略优化（GRPO），让用户无需编写大量底层代码即可实现先进的对齐算法。其独特亮点在于卓越的扩展性与效率：不仅支持从单张显卡到多节点集群的无缝扩容，还深度集成了 PEFT 库，通过量化和 LoRA 技术在有限显存下训练超大模型；同时兼容 Unsloth 加速内核，进一步缩短训练时间。此外，trl 还提供了命令行接口，让非编程背景的用户也能快速启动微调任务。如果你正在寻找一个灵活、高效且功能全面的工具来定制属于自己的基础模型，trl 将是理想的选择。","# TRL - Transformers Reinforcement Learning\n\n\u003Cdiv style=\"text-align: center\">\n    \u003Cpicture>\n        \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftrl-lib\u002Fdocumentation-images\u002Fresolve\u002Fmain\u002FTRL%20banner%20light.png\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_trl_readme_2156d4ce973d.png\" alt=\"TRL Banner\">\n    \u003C\u002Fpicture>\n\u003C\u002Fdiv>\n\n\u003Chr> \u003Cbr>\n\n\u003Ch3 align=\"center\">\n    \u003Cp>A comprehensive library to post-train foundation models\u003C\u002Fp>\n\u003C\u002Fh3>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fblob\u002Fmain\u002FLICENSE\">\u003Cimg alt=\"License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fhuggingface\u002Ftrl.svg?color=blue\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Findex\">\u003Cimg alt=\"Documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fwebsite?label=documentation&url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Ftrl%2Findex&down_color=red&down_message=offline&up_color=blue&up_message=online\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Freleases\">\u003Cimg alt=\"GitHub release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Fhuggingface\u002Ftrl.svg\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Ftrl-lib\">\u003Cimg alt=\"Hugging Face Hub\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20Hub-trl--lib-yellow\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n## 🎉 What's New\n\n**TRL v1:** We released TRL v1 — a major milestone that marks a real shift in what TRL is. Read the [blog post](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Ftrl-v1) to learn more.\n\n## Overview\n\nTRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Built on top of the [🤗 Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.\n\n## Highlights\n\n- **Trainers**: Various fine-tuning methods are easily accessible via trainers like [`SFTTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fsft_trainer), [`GRPOTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fgrpo_trainer), [`DPOTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fdpo_trainer), [`RewardTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Freward_trainer) and more.\n\n- **Efficient and scalable**:\n  - Leverages [🤗 Accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) to scale from single GPU to multi-node clusters using methods like [DDP](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fintermediate\u002Fddp_tutorial.html) and [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed).\n  - Full integration with [🤗 PEFT](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) enables training on large models with modest hardware via quantization and LoRA\u002FQLoRA.\n  - Integrates [🦥 Unsloth](https:\u002F\u002Fgithub.com\u002Funslothai\u002Funsloth) for accelerating training using optimized kernels.\n\n- **Command Line Interface (CLI)**: A simple interface lets you fine-tune with models without needing to write code.\n\n## Installation\n\n### Python Package\n\nInstall the library using `pip`:\n\n```bash\npip install trl\n```\n\n### From source\n\nIf you want to use the latest features before an official release, you can install TRL from source:\n\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl.git\n```\n\n### Repository\n\nIf you want to use the examples you can clone the repository with the following command:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl.git\n```\n\n## Quick Start\n\nFor more flexibility and control over training, TRL provides dedicated trainer classes to post-train language models or PEFT adapters on a custom dataset. Each trainer in TRL is a light wrapper around the 🤗 Transformers trainer and natively supports distributed training methods like DDP, DeepSpeed ZeRO, and FSDP.\n\n### `SFTTrainer`\n\nHere is a basic example of how to use the [`SFTTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fsft_trainer):\n\n```python\nfrom trl import SFTTrainer\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"trl-lib\u002FCapybara\", split=\"train\")\n\ntrainer = SFTTrainer(\n    model=\"Qwen\u002FQwen2.5-0.5B\",\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n### `GRPOTrainer`\n\n[`GRPOTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fgrpo_trainer) implements the [Group Relative Policy Optimization (GRPO) algorithm](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2402.03300) that is more memory-efficient than PPO and was used to train [Deepseek AI's R1](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\u002FDeepSeek-R1).\n\n```python\nfrom datasets import load_dataset\nfrom trl import GRPOTrainer\nfrom trl.rewards import accuracy_reward\n\ndataset = load_dataset(\"trl-lib\u002FDeepMath-103K\", split=\"train\")\n\ntrainer = GRPOTrainer(\n    model=\"Qwen\u002FQwen2.5-0.5B-Instruct\",\n    reward_funcs=accuracy_reward,\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n> [!NOTE]\n> For reasoning models, use the `reasoning_accuracy_reward()` function for better results.\n\n### `DPOTrainer`\n\n[`DPOTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fdpo_trainer) implements the popular [Direct Preference Optimization (DPO) algorithm](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2305.18290) that was used to post-train [Llama 3](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2407.21783) and many other models. Here is a basic example of how to use the `DPOTrainer`:\n\n```python\nfrom datasets import load_dataset\nfrom trl import DPOTrainer\n\ndataset = load_dataset(\"trl-lib\u002Fultrafeedback_binarized\", split=\"train\")\n\ntrainer = DPOTrainer(\n    model=\"Qwen3\u002FQwen-0.6B\",\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n### `RewardTrainer`\n\nHere is a basic example of how to use the [`RewardTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Freward_trainer):\n\n```python\nfrom trl import RewardTrainer\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"trl-lib\u002Fultrafeedback_binarized\", split=\"train\")\n\ntrainer = RewardTrainer(\n    model=\"Qwen\u002FQwen2.5-0.5B-Instruct\",\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n## Command Line Interface (CLI)\n\nYou can use the TRL Command Line Interface (CLI) to quickly get started with post-training methods like Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO):\n\n**SFT:**\n\n```bash\ntrl sft --model_name_or_path Qwen\u002FQwen2.5-0.5B \\\n    --dataset_name trl-lib\u002FCapybara \\\n    --output_dir Qwen2.5-0.5B-SFT\n```\n\n**DPO:**\n\n```bash\ntrl dpo --model_name_or_path Qwen\u002FQwen2.5-0.5B-Instruct \\\n    --dataset_name argilla\u002FCapybara-Preferences \\\n    --output_dir Qwen2.5-0.5B-DPO \n```\n\nRead more about CLI in the [relevant documentation section](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fclis) or use `--help` for more details.\n\n## Development\n\nIf you want to contribute to `trl` or customize it to your needs make sure to read the [contribution guide](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fblob\u002Fmain\u002FCONTRIBUTING.md) and make sure you make a dev install:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl.git\ncd trl\u002F\npip install -e .[dev]\n```\n\n## Experimental\n\nA minimal incubation area is available under `trl.experimental` for unstable \u002F fast-evolving features. Anything there may change or be removed in any release without notice.\n\nExample:\n\n```python\nfrom trl.experimental.new_trainer import NewTrainer\n```\n\nRead more in the [Experimental docs](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fexperimental_overview).\n\n## Citation\n\n```bibtex\n@software{vonwerra2020trl,\n  title   = {{TRL: Transformers Reinforcement Learning}},\n  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},\n  license = {Apache-2.0},\n  url     = {https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl},\n  year    = {2020}\n}\n```\n\n## License\n\nThis repository's source code is available under the [Apache-2.0 License](LICENSE).\n","# TRL - 变压器强化学习\n\n\u003Cdiv style=\"text-align: center\">\n    \u003Cpicture>\n        \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftrl-lib\u002Fdocumentation-images\u002Fresolve\u002Fmain\u002FTRL%20banner%20light.png\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_trl_readme_2156d4ce973d.png\" alt=\"TRL Banner\">\n    \u003C\u002Fpicture>\n\u003C\u002Fdiv>\n\n\u003Chr> \u003Cbr>\n\n\u003Ch3 align=\"center\">\n    \u003Cp>用于基础模型后训练的全面库\u003C\u002Fp>\n\u003C\u002Fh3>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fblob\u002Fmain\u002FLICENSE\">\u003Cimg alt=\"License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fhuggingface\u002Ftrl.svg?color=blue\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Findex\">\u003Cimg alt=\"Documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fwebsite?label=documentation&url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Ftrl%2Findex&down_color=red&down_message=offline&up_color=blue&up_message=online\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Freleases\">\u003Cimg alt=\"GitHub release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Fhuggingface\u002Ftrl.svg\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Ftrl-lib\">\u003Cimg alt=\"Hugging Face Hub\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20Hub-trl--lib-yellow\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n## 🎉 新功能\n\n**TRL v1:** 我们发布了 TRL v1 — 这是一个重要的里程碑，标志着 TRL 的真正转变。请阅读 [博客文章](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Ftrl-v1) 以了解更多信息。\n\n## 概述\n\nTRL 是一个尖端库，旨在使用监督微调 (SFT)、组相对策略优化 (GRPO) 和直接偏好优化 (DPO) 等先进技术对基础模型进行后训练。TRL 构建在 [🤗 Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) 生态系统之上，支持多种模型架构和模态，并且可以在各种硬件配置上进行扩展。\n\n## 亮点\n\n- **训练器**: 各种微调方法可通过诸如 [`SFTTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fsft_trainer)、[`GRPOTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fgrpo_trainer)、[`DPOTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fdpo_trainer)、[`RewardTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Freward_trainer) 等训练器轻松访问。\n\n- **高效且可扩展**:\n  - 利用 [🤗 Accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) 通过 [DDP](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fintermediate\u002Fddp_tutorial.html) 和 [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed) 等方法从单 GPU 扩展到多节点集群。\n  - 与 [🤗 PEFT](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) 完全集成，可通过量化和 LoRA\u002FQLoRA 在中等硬件上训练大型模型。\n  - 集成 [🦥 Unsloth](https:\u002F\u002Fgithub.com\u002Funslothai\u002Funsloth)，利用优化内核加速训练。\n\n- **命令行界面 (CLI)**: 简单的界面使您无需编写代码即可对模型进行微调。\n\n## 安装\n\n### Python 包\n\n使用 `pip` 安装库：\n\n```bash\npip install trl\n```\n\n### 从源码安装\n\n如果您想在正式发布之前使用最新功能，可以从源码安装 TRL：\n\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl.git\n```\n\n### 仓库\n\n如果您想使用示例，可以使用以下命令克隆仓库：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl.git\n```\n\n## 快速入门\n\n为了在训练中获得更大的灵活性和控制力，TRL 提供了专门的训练器类，用于在自定义数据集上对语言模型或 PEFT 适配器进行后训练。TRL 中的每个训练器都是 🤗 Transformers 训练器的轻量级封装，并原生支持分布式训练方法，如 DDP、DeepSpeed ZeRO 和 FSDP。\n\n### `SFTTrainer`\n\n以下是使用 [`SFTTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fsft_trainer) 的基本示例：\n\n```python\nfrom trl import SFTTrainer\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"trl-lib\u002FCapybara\", split=\"train\")\n\ntrainer = SFTTrainer(\n    model=\"Qwen\u002FQwen2.5-0.5B\",\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n### `GRPOTrainer`\n\n[`GRPOTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fgrpo_trainer) 实现了 [组相对策略优化 (GRPO) 算法](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2402.03300)，该算法比 PPO 更节省内存，并被用于训练 [Deepseek AI 的 R1](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\u002FDeepSeek-R1)。\n\n```python\nfrom datasets import load_dataset\nfrom trl import GRPOTrainer\nfrom trl.rewards import accuracy_reward\n\ndataset = load_dataset(\"trl-lib\u002FDeepMath-103K\", split=\"train\")\n\ntrainer = GRPOTrainer(\n    model=\"Qwen\u002FQwen2.5-0.5B-Instruct\",\n    reward_funcs=accuracy_reward,\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n> [!NOTE]\n> 对于推理模型，请使用 `reasoning_accuracy_reward()` 函数以获得更好的效果。\n\n### `DPOTrainer`\n\n[`DPOTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fdpo_trainer) 实现了流行的 [直接偏好优化 (DPO) 算法](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2305.18290)，该算法曾用于对 [Llama 3](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2407.21783) 和许多其他模型进行后训练。以下是使用 `DPOTrainer` 的基本示例：\n\n```python\nfrom datasets import load_dataset\nfrom trl import DPOTrainer\n\ndataset = load_dataset(\"trl-lib\u002Fultrafeedback_binarized\", split=\"train\")\n\ntrainer = DPOTrainer(\n    model=\"Qwen3\u002FQwen-0.6B\",\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n### `RewardTrainer`\n\n以下是使用 [`RewardTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Freward_trainer) 的基本示例：\n\n```python\nfrom trl import RewardTrainer\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"trl-lib\u002Fultrafeedback_binarized\", split=\"train\")\n\ntrainer = RewardTrainer(\n    model=\"Qwen\u002FQwen2.5-0.5B-Instruct\",\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n## 命令行界面 (CLI)\n\n您可以使用 TRL 命令行界面 (CLI) 快速开始使用监督微调 (SFT) 或直接偏好优化 (DPO) 等后训练方法：\n\n**SFT:**\n\n```bash\ntrl sft --model_name_or_path Qwen\u002FQwen2.5-0.5B \\\n    --dataset_name trl-lib\u002FCapybara \\\n    --output_dir Qwen2.5-0.5B-SFT\n```\n\n**DPO:**\n\n```bash\ntrl dpo --model_name_or_path Qwen\u002FQwen2.5-0.5B-Instruct \\\n    --dataset_name argilla\u002FCapybara-Preferences \\\n    --output_dir Qwen2.5-0.5B-DPO \n```\n\n有关 CLI 的更多信息，请参阅 [相关文档部分](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fclis)，或使用 `--help` 获取更多详细信息。\n\n## 开发\n\n如果您想为 `trl` 做出贡献或根据您的需求对其进行定制，请务必阅读 [贡献指南](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fblob\u002Fmain\u002FCONTRIBUTING.md)，并确保进行开发安装：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl.git\ncd trl\u002F\npip install -e .[dev]\n```\n\n## 实验性\n\n在 `trl.experimental` 下提供了一个最小的实验性区域，用于存放不稳定的或快速发展的功能。该区域中的任何内容都可能在任何版本中未经通知地发生变化或被移除。\n\n示例：\n\n```python\nfrom trl.experimental.new_trainer import NewTrainer\n```\n\n更多信息请参阅[实验性文档](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fexperimental_overview)。\n\n## 引用\n\n```bibtex\n@software{vonwerra2020trl,\n  title   = {{TRL: 变换器强化学习}},\n  author  = {冯·韦拉，莱昂德罗；贝尔卡达，尤内斯；坦斯托尔，刘易斯；比钦，爱德华；瑟什，特里斯坦；兰伯特，内森；黄圣毅；拉苏尔，卡希夫；加卢埃德克，昆汀},\n  license = {Apache-2.0},\n  url     = {https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl},\n  year    = {2020}\n}\n```\n\n## 许可证\n\n本仓库的源代码根据[Apache-2.0 许可证](LICENSE)开放。","# TRL 快速上手指南\n\nTRL (Transformers Reinforcement Learning) 是一个用于基础模型后训练的综合库，支持监督微调 (SFT)、组相对策略优化 (GRPO) 和直接偏好优化 (DPO) 等先进技术。它与 🤗 Transformers 生态深度集成，支持分布式训练和高效显存优化。\n\n## 环境准备\n\n*   **系统要求**：Linux 或 macOS 系统（Windows 需通过 WSL2 运行）。\n*   **Python 版本**：建议 Python 3.9 及以上。\n*   **前置依赖**：\n    *   PyTorch (建议 2.0+)\n    *   🤗 Transformers\n    *   🤗 Accelerate\n    *   🤗 Datasets\n*   **硬件建议**：推荐使用 NVIDIA GPU。若显存有限，可配合 🤗 PEFT (LoRA\u002FQLoRA) 或 Unsloth 使用。\n\n> **国内加速提示**：\n> 在中国大陆地区，建议配置 pip 国内镜像源以加快安装速度：\n> ```bash\n> export PIP_INDEX_URL=https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 安装步骤\n\n### 方式一：通过 PyPI 安装（推荐）\n安装稳定版本：\n```bash\npip install trl\n```\n\n### 方式二：从源码安装（获取最新特性）\n如需使用尚未发布的最新功能（如 TRL v1 新特性）：\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl.git\n```\n\n### 方式三：克隆仓库（用于运行示例）\n如果需要运行官方提供的示例脚本：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl.git\ncd trl\n```\n\n## 基本使用\n\nTRL 提供了多种 Trainer 类，只需几行代码即可启动训练。以下是最常用的三种场景示例。\n\n### 1. 监督微调 (SFT)\n使用 `SFTTrainer` 对模型进行指令微调。\n\n```python\nfrom trl import SFTTrainer\nfrom datasets import load_dataset\n\n# 加载数据集\ndataset = load_dataset(\"trl-lib\u002FCapybara\", split=\"train\")\n\n# 初始化训练器\ntrainer = SFTTrainer(\n    model=\"Qwen\u002FQwen2.5-0.5B\",\n    train_dataset=dataset,\n)\n\n# 开始训练\ntrainer.train()\n```\n\n### 2. 直接偏好优化 (DPO)\n使用 `DPOTrainer` 基于人类偏好数据对齐模型。\n\n```python\nfrom datasets import load_dataset\nfrom trl import DPOTrainer\n\n# 加载二值化偏好数据集\ndataset = load_dataset(\"trl-lib\u002Fultrafeedback_binarized\", split=\"train\")\n\n# 初始化训练器\ntrainer = DPOTrainer(\n    model=\"Qwen3\u002FQwen-0.6B\",\n    train_dataset=dataset,\n)\n\n# 开始训练\ntrainer.train()\n```\n\n### 3. 组相对策略优化 (GRPO)\n使用 `GRPOTrainer` 进行更高效的强化学习训练（适用于推理模型）。\n\n```python\nfrom datasets import load_dataset\nfrom trl import GRPOTrainer\nfrom trl.rewards import accuracy_reward\n\n# 加载数据集\ndataset = load_dataset(\"trl-lib\u002FDeepMath-103K\", split=\"train\")\n\n# 初始化训练器\ntrainer = GRPOTrainer(\n    model=\"Qwen\u002FQwen2.5-0.5B-Instruct\",\n    reward_funcs=accuracy_reward,\n    train_dataset=dataset,\n)\n\n# 开始训练\ntrainer.train()\n```\n\n### 命令行快速启动 (CLI)\n如果不希望编写代码，可以直接使用 CLI 命令启动训练。\n\n**SFT 示例：**\n```bash\ntrl sft --model_name_or_path Qwen\u002FQwen2.5-0.5B \\\n    --dataset_name trl-lib\u002FCapybara \\\n    --output_dir Qwen2.5-0.5B-SFT\n```\n\n**DPO 示例：**\n```bash\ntrl dpo --model_name_or_path Qwen\u002FQwen2.5-0.5B-Instruct \\\n    --dataset_name argilla\u002FCapybara-Preferences \\\n    --output_dir Qwen2.5-0.5B-DPO \n```","某初创团队正在开发一款垂直领域的法律问答助手，需要将通用的开源大模型微调为符合专业法律逻辑且回答风格严谨的专用模型。\n\n### 没有 trl 时\n- **代码实现复杂**：工程师需从零手写强化学习（如 DPO 或 PPO）的训练循环、损失函数及采样逻辑，极易引入难以排查的 Bug。\n- **资源门槛极高**：缺乏对 DeepSpeed 和 FSDP 的原生支持，导致在单卡或消费级显卡上无法加载和训练参数量较大的法律模型。\n- **生态割裂严重**：需要手动拼接 Hugging Face Transformers、Datasets 和 PEFT 库，数据格式转换繁琐，实验迭代周期长达数周。\n- **算法验证困难**：尝试最新的偏好对齐算法（如 GRPO）时，因缺乏标准实现而不得不反复复现论文，研发风险不可控。\n\n### 使用 trl 后\n- **开箱即用**：直接调用 `DPOTrainer` 或 `GRPOTrainer` 等封装好的训练器，仅需几行代码即可启动复杂的偏好对齐训练流程。\n- **高效显存优化**：原生集成 PEFT、QLoRA 及 Unsloth 加速内核，使团队能在有限的算力预算下完成全参数或高效微调。\n- **生态无缝衔接**：完美兼容 Hugging Face 生态系统，可直接加载数据集和模型，将原本数周的开发工作缩短至几天。\n- **前沿算法落地**：内置最新的强化学习算法实现，团队可立即验证不同策略对法律回答准确性的提升效果，无需重复造轮子。\n\ntrl 通过标准化和简化强化学习微调流程，让中小团队也能以低成本快速构建出对齐人类价值观的高质量垂直领域大模型。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_trl_2156d4ce.png","huggingface","Hugging Face","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhuggingface_90da21a4.png","The AI community building the future.",null,"https:\u002F\u002Fhuggingface.co\u002F","https:\u002F\u002Fgithub.com\u002Fhuggingface",[80,84,87,91],{"name":81,"color":82,"percentage":83},"Python","#3572A5",98,{"name":85,"color":86,"percentage":32},"Jinja","#a52a22",{"name":88,"color":89,"percentage":90},"Makefile","#427819",0,{"name":92,"color":93,"percentage":90},"Dockerfile","#384d54",17996,2633,"2026-04-10T19:58:46","Apache-2.0","未说明","需要 GPU 以支持分布式训练（DDP, DeepSpeed, FSDP）；具体型号和显存取决于模型大小，配合 PEFT\u002FQLoRA 可在较小显存设备上运行大模型；集成 Unsloth 优化内核加速",{"notes":101,"python":98,"dependencies":102},"TRL 基于 Hugging Face 生态系统构建，支持从单 GPU 到多节点集群的扩展。通过集成 PEFT 和量化技术（如 LoRA\u002FQLoRA），可在硬件资源有限的情况下训练大型模型。支持命令行界面（CLI）无需编写代码即可进行微调。实验性功能位于 trl.experimental 模块中，可能随时变更。",[103,104,105,106,107],"transformers","accelerate","peft","datasets","unsloth (可选)",[35,14],"2026-03-27T02:49:30.150509","2026-04-11T10:01:30.934724",[112,117,122,127,132,136],{"id":113,"question_zh":114,"answer_zh":115,"source_url":116},29357,"如何在 SFTTrainer 中计算基于生成的评估指标（如 BLEU）？","在 `Seq2SeqTrainingArguments` 中设置 `predict_with_generate=True`，并确保在初始化 `SFTTrainer` 时传入自定义的 `compute_metrics` 函数。虽然早期版本存在 logits 而非生成文本的问题，但后续更新已修复此问题，现在 `eval_preds` 中应包含正确的生成结果用于指标计算。请确保使用最新版本的 trl 库。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fissues\u002F862",{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},29358,"在使用 PEFT 进行 PPO 训练后，如何合并适配器（Adapter）模型以保存完整的模型文件（包含 config.json）？","如果使用了 `--use_peft` 参数，保存的模型通常只包含适配器权重（adapter_model.bin）而缺少完整的 config.json。解决方法是在训练完成后调用 `merge_and_unload()` 方法，将适配器权重合并到基础模型中，然后再保存模型。这样可以生成包含完整配置和权重的模型文件，避免加载时需要额外指定 PEFT 配置。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fissues\u002F526",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},29359,"DPOTrainer 在处理编码器 - 解码器模型（如 T5）时生成多个或损坏的响应怎么办？","该问题曾在旧版本中出现，特别是在使用 T5 等编码器 - 解码器架构时。维护者建议尝试在最新的代码分支（latest head）上运行，因为 DPO 模块正在进行重构（参考 PR #3906），许多针对编码器 - 解码器模型的兼容性问题已在更新中修复。如果问题依旧，请检查是否使用了正确的 Tokenizer 类（如 T5Tokenizer）并确认输入数据格式符合 DPO 要求。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fissues\u002F1025",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},29360,"为什么 reward_modeling.py 脚本中的训练损失和准确率显示异常？","这通常与数据处理或标签对齐有关。检查数据集是否正确格式化，确保 `chosen` 和 `rejected` 字段的文本长度未超过 `max_length` 限制。此外，确认 `RewardConfig` 中的 `remove_unused_columns` 设置为 `False` 以避免必要列被移除。如果使用了量化（如 4bit\u002F8bit 加载），请验证 `BitsAndBytesConfig` 配置是否正确，有时精度问题会导致梯度计算异常从而影响损失值。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fissues\u002F937",{"id":133,"question_zh":134,"answer_zh":135,"source_url":121},29361,"如何在消费级 GPU（如 24GB 显存）上微调大模型（如 20B 参数）进行 RLHF 训练？","必须结合 PEFT（参数高效微调）和量化技术。在命令行参数中添加 `--use_peft` 以启用 LoRA 等适配器方法，大幅减少显存占用。同时建议使用 `--load_in_4bit` 或 `--load_in_8bit` 配合 `BitsAndBytesConfig` 进行模型量化。注意训练完成后需使用 `merge_and_unload()` 合并模型权重才能导出完整模型文件。",{"id":137,"question_zh":138,"answer_zh":139,"source_url":126},29362,"DPO 训练时遇到编码器 - 解码器模型兼容性问题的最新状态是什么？","DPO 模块正在经历重大重构（见 PR #3906），旨在更好地支持包括 T5 在内的编码器 - 解码器模型。之前的版本在这些架构上可能会出现生成重复或损坏内容的问题。建议用户升级到主分支的最新版本进行测试，旧版 Issue 因重构计划已被关闭，新功能将解决大部分已知兼容性痛点。",[141,146,151,156,161,166,171,176,181,186,191,196,201,206,211,216,221,226,231,236],{"id":142,"version":143,"summary_zh":144,"released_at":145},198133,"v1.0.0","\u003Cimg width=\"1800\" height=\"1013\" alt=\"thumbnail-2\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F5c55b86a-0600-4f70-bf37-41ab240af851\" \u002F>\n\n请阅读我们的[博客文章](https:\u002F\u002Fhf.co\u002Fblog\u002Ftrl-v1)，了解 TRL v1 的概览。\n\n## 功能特性\n\n### 异步 GRPO\n\n异步 GRPO 通过将采样回放卸载到外部 vLLM 服务器，将生成过程与梯度更新循环解耦。生成过程与训练并行进行，从而消除了 GPU 的空闲时间，提高了硬件利用率。\n\n```python\nfrom trl.experimental.async_grpo import AsyncGRPOTrainer\nfrom trl.rewards import accuracy_reward\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"trl-lib\u002FDeepMath-103K\", split=\"train\")\n\ntrainer = AsyncGRPOTrainer(\n    model=\"Qwen\u002FQwen2.5-0.5B-Instruct\",\n    reward_funcs=accuracy_reward,\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5293 中实现。\n\n### 变分序列级软策略优化 (VESPO)\n\n\u003Cimg width=\"465\" height=\"279\" alt=\"Screenshot 2026-03-20 at 5 49 50 PM\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb60c9697-6eb7-498e-95b3-df78c367f5fa\" \u002F>\n\n[VESPO](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2602.10693) 解决了离策略强化学习中由于策略陈旧、异步更新以及训练与推理不匹配所导致的训练不稳定问题。VESPO 并不依赖于启发式的 token 级别裁剪（GRPO）或序列长度归一化（GSPO），而是从变分框架中推导出一个原则性的重塑核函数。在实践中，这产生了一种平滑且不对称的 Gamma 权重函数，能够在不引入长度偏差的情况下优雅地抑制极端的序列级重要性权重。可以通过 `GRPOConfig` 的 `loss_type` 参数来启用：\n\n```python\nfrom trl import GRPOConfig, GRPOTrainer\n\ntrainer = GRPOTrainer(\n    model=\"Qwen\u002FQwen3-0.6B\",\n    args=GRPOConfig(loss_type=\"vespo\"),\n    ...\n)\n```\n\n由 @casinca 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5199 中实现。\n\n### 散度邻近策略优化 (DPPO)\n\n\u003Cimg width=\"3180\" height=\"1187\" alt=\"z_TXYw37xZqsQ21YiDkYL\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F40f1d538-82b3-4097-91c6-119ea9f7797b\" \u002F>\n\u003Cimg width=\"1189\" height=\"490\" alt=\"SfgWotuuuRKPkg-0bxWv1\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F2b090df3-0bfb-42e4-9f94-15943736e689\" \u002F>\n\n[DPPO](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2602.04879) 是一种新的实验性训练器，它用散度约束替代了标准的 PPO 裁剪机制，从而提供更为原则性的信任区域更新。\n\n由 @LeonEricsson 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5117 中实现。\n\n### 自蒸馏策略优化 (SDPO)\n\n[SDPO](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2601.20802) 是一种新的实验性训练器，它通过模型自身高奖励轨迹的自蒸馏来增强在策略强化学习。与使用外部教师不同，SDPO 将当前模型在反馈条件下的输出视为自我教师，从中蒸馏出反馈信息。","2026-03-31T14:15:06",{"id":147,"version":148,"summary_zh":149,"released_at":150},198134,"v1.0.0rc1","## 功能特性\n\n### 变分序列级软策略优化 (VESPO)\n\n\u003Cimg width=\"465\" height=\"279\" alt=\"截图 2026-03-20 下午5:49:50\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb60c9697-6eb7-498e-95b3-df78c367f5fa\" \u002F>\n\n[VESPO](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2602.10693) 解决了离策略强化学习中由于策略陈旧、异步更新以及训练与推理不一致所导致的训练不稳定问题。VESPO 并不依赖启发式的 token 级别裁剪（GRPO）或序列长度归一化（GSPO），而是从变分框架中推导出一种基于原则的重塑核函数。在实践中，这产生了一种平滑且非对称的 Gamma 权重函数，能够在不引入长度偏差的情况下，优雅地抑制极端的序列级重要性权重。可以通过 `GRPOConfig` 的 `loss_type` 参数来启用：\n\n```python\nfrom trl import GRPOConfig, GRPOTrainer\n\ntrainer = GRPOTrainer(\n    model=\"Qwen\u002FQwen3-0.6B\",\n    args=GRPOConfig(loss_type=\"vespo\"),\n    ...\n)\n```\n\n由 @casinca 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5199 中实现。\n\n### 散度邻近策略优化 (DPPO)\n\n\u003Cimg width=\"3180\" height=\"1187\" alt=\"z_TXYw37xZqsQ21YiDkYL\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F40f1d538-82b3-4097-91c6-119ea9f7797b\" \u002F>\n\u003Cimg width=\"1189\" height=\"490\" alt=\"SfgWotuuuRKPkg-0bxWv1\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F2b090df3-0bfb-42e4-9f94-15943736e689\" \u002F>\n\n[DPPO](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2602.04879) 是一种新的实验性训练器，它用散度约束替代了标准的 PPO 裁剪机制，从而提供更为原则化的信任域更新。\n\n由 @LeonEricsson 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5117 中实现。\n\n### 奖励函数现在可以记录额外的列和标量指标\n\n奖励函数可以返回一个包含额外值（标量或每样本列）的字典，这些值将与奖励一起被记录下来。这样可以更方便地跟踪中间信号，而无需编写自定义回调函数。\n\n```python\ndef my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):\n    extracted = [extract_answer(c) for c in completions]\n    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]\n\n    if log_extra:\n        log_extra(\"golden_answer\", list(answer))\n        log_extra(\"extracted_answer\", extracted)\n\n    if log_metric:\n        log_metric(\"accuracy\", sum(rewards) \u002F len(rewards))\n\n    return rewards\n```\n\n\u003Cimg width=\"1400\" height=\"407\" alt=\"图片\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd345b0ac-0d3c-446f-9321-a26e73ee16b4\" \u002F>\n\u003Cimg width=\"1353\" height=\"673\" alt=\"图片\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb4c0302b-f69a-4715-9aad-278b4ad13299\" \u002F>\n\n由 @manueldeprada 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5233 中实现。\n\n### `VLLMClient.chat()` 中的工具调用支持\n\n`VLLMClient.chat()` 现在支持工具调用，从而可以直接通过 vLLM 客户端接口实现智能体工作流。\n\n由 @kansalaman 在 https:\u002F\u002Fgithub.com\u002F","2026-03-20T23:55:04",{"id":152,"version":153,"summary_zh":154,"released_at":155},198135,"v0.29.1","## 变更内容\n\n* 在 SFT\u002FGRPO\u002FRLOO 中处理 `mm_token_type_ids`，以修复 `IndexError`，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5178 中完成\n* 修复 `prepare_multimodal_messages`，使其支持 `tool_calls` 和 `tool` 角色，由 @alvarobartt 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5212 中完成\n* 修复作为 CLI JSON 字符串传递时的 `model_init_kwargs` 类型问题，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5230 中完成\n* 在 GRPO 的 `_generate_single_turn` 中将回放调度与 vLLM 后端解耦，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5122 中完成\n* 简化跨 vLLM 版本的结构化输出逻辑，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5215 中完成\n* 在 vLLM 客户端和服务器中为 `prompts` 添加对原始 ID 的支持，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5225 中完成\n* 当向 vLLM 客户端传递原始标记 ID 时，添加对 VLM 的支持，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5227 中完成\n* 将 `rollout_func` 从 `_generate_single_turn` 移至 `_generate`，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5232 中完成\n* [GRPO\u002FRLOO] 在调用 vLLM 生成之前进行分词，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5238 中完成\n* 在 MiniLLMConfig 中支持对 `teacher_model_init_kwargs` 的 JSON 字符串解析，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5259 中完成\n* [GRPO\u002FRLOO] 在 `_generate_single_turn` 中统一所有生成后端的分词流程，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5239 中完成\n* [GRPO\u002FRLOO] 将分词提示从 `_generate_single_turn` 中提取出来，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5240 中完成\n* [CPO\u002FORPO] 修复对不同长度的 chosen\u002Frejected prompts 的处理问题，由 @davmels 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4639 中完成\n* 修复作为 CLI JSON 字符串传递时的 `teacher_model_init_kwargs` 类型问题，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5258 中完成\n* 修复在 GKD\u002FGOLD 中以 CLI JSON 字符串形式传递 `model_init_kwargs` 时的支持问题，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5266 中完成\n* 修复 DPO VLM 训练中 `mm_token_type_ids` 被静默丢弃的问题，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5279 中完成\n* 修复 MiniLLM 在以 CLI JSON 字符串形式传递 `model_init_kwargs` 时的支持问题，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5274 中完成\n* 修复 GRPOTrainer 对 vLLM 模型配置的属性访问问题，由 @falcondai 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5302 中完成\n* [GRPO] 通过拼接标记 ID 修复工具调用循环中的重新分词错误，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5242 中完成\n\n## 新贡献者\n\n* @davmels 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4639 中完成了他们的首次贡献\n* @falcondai 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5302 中完成了他们的首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fcompare\u002Fv0.29.0...v0.29.1","2026-03-20T03:57:13",{"id":157,"version":158,"summary_zh":159,"released_at":160},198136,"v0.29.0","## 功能特性\n\n### 向 `GRPOTrainer` 添加 `environment_factory`\n\n`GRPOTrainer` 现在接受一个 `environment_factory` 参数，允许用户指定用于训练的自定义环境类。通过让用户定义具有特定动态和奖励结构的环境，这一功能实现了更加灵活和多样化的训练场景。\n\n```python\nfrom datasets import Dataset\nfrom trl import GRPOConfig, GRPOTrainer\n\ndataset = Dataset.from_dict({\n    \"prompt\": [[{\"role\": \"user\", \"content\": f\"将计数器增加 {i}。\"}] for i in range(1, 7)]\n})\n\ndef reward_func(environments, **kwargs):\n    return [env.counter for env in environments]\n\nclass IncrementEnv:\n    def reset(self):\n        self.counter = 0\n\n    def increment(self, step: int) -> int:\n        \"\"\"\n        增加内部计数器。\n\n        Args:\n            step: 要加到计数器上的值。\n\n        Returns:\n            更新后的计数器值。\n        \"\"\"\n        self.counter += step\n        return self.counter\n\ntrainer = GRPOTrainer(\n    model=\"Qwen\u002FQwen3-0.6B\",\n    args=GRPOConfig(chat_template_kwargs={\"enable_thinking\": False}),\n    train_dataset=dataset,\n    reward_funcs=reward_func,\n    environment_factory=IncrementEnv,\n)\ntrainer.train()\n```\n\n由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5093 中提出\n\n### 技能\n\nTRL 引入了原生的 CLI 集成：trl-training，这是一项一流的代理技能，以结构化且代理可读的格式公开 TRL 的训练工作流（SFT、DPO、GRPO 等）。该技能直接打包在 trl 库中，可以通过 CLI 安装：\n\n```bash\n# 按代理名称安装到项目的代理目录（默认作用域为项目）：claude、codex、opencode\ntrl skills install trl-training --target \u003Cagent>\n```\n\n这使得 AI 代理能够使用明确定义的接口，安全且可重复地执行 TRL 训练工作流。\n\n技能可以安装在项目或全局范围内，并支持明确的目标和覆盖控制。\n\n* 实现代理技能 [1\u002FN]：创建训练技能（MVP），由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5096 中完成\n* 实现代理技能 [2\u002FN]：创建技能模块，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5097 中完成\n* 实现代理技能 [3\u002FN]：创建技能安装程序，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5100 中完成\n* 实现代理技能 [4\u002FN]：创建技能 CLI，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5103 中完成\n\n### 其他\n* 将 vllm_is_ratio 传递给 LigerFusedLinearGRPOLoss，在 compute_liger_loss 中使用，由 @yukiu00 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5031 中提出\n* 功能：top_k selective_log_softmax，由 @LeonEricsson 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5104 中提出\n* 添加 Trackio 集成，用于模型卡片可视化，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F5101 中提出\n* 更新工具处理方式，以支持训练器中的 JSON 字符串模式，由 @qgallouedec 在 https:","2026-02-25T22:38:09",{"id":162,"version":163,"summary_zh":164,"released_at":165},198137,"v0.28.0","## 功能特性\n* [GRPOTrainer]: 支持异步工具调用的智能体训练，由 @pramodith 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4742 中实现\n* 为 vLLM 客户端添加重试策略以提高鲁棒性，由 @apalmas-saifh 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4845 中实现\n* 在在线 DPO 中启用 vLLM 的生成睡眠模式，由 @winglian 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4882 中实现\n* 在 `is_conversational` 中支持工具调用数据，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4923 中实现\n* [GRPO] 为带有个体奖励的完成结果添加 Parquet 日志记录，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4818 中实现\n* 更新 wordle.py 示例，加入对环境 token 的掩码处理，由 @sergiopaniego 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4895 中实现\n* NeMo-Gym 集成，由 @cmunley1 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4848 中实现\n\n## 实验性功能\n* 与 DPO 协同重构 KTO [c\u002FN]：移除 ref_model_init_kwargs，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4837 中实现\n* 与 DPO 协同重构 KTO [e\u002FN]：移除 label_pad_token_id，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4875 中实现\n* 与 DPO 协同重构 KTO [d\u002FN]：移除 base_model_attribute_name，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4862 中实现\n* 修复 `openenv\u002Futils.py` 中的类型提示：在未安装 vLLM 的情况下提供回退方案，由 @Datta0 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4868 中实现\n* 从实验性训练器中移除 label_pad_token_id，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4878 中实现\n* 提升 GOLD 训练速度，由 @141forever 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4888 中实现\n* 从实验性 BCO 中移除 ref_model_init_kwargs，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4946 中实现\n* 从实验性 PRM 中移除 max_prompt_length，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4963 中实现\n* 从实验性 BCO 中移除 max_prompt_length，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4964 中实现\n* 从实验性 CPO 中移除 max_prompt_length，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4965 中实现\n* 从实验性 ORPO 中移除 max_prompt_length，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4966 中实现\n* 从实验性 CPO 中移除 padding_value，改用 pad_token_id，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4962 中实现\n\n## 修复\n* 修复 peft 的 _patch_transformers_hybrid_cache，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4844 中实现\n* KTO 重构 [4\u002FN]：移除未使用的 padding_value，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4839 中实现\n* 修复：未定义的 `current_gradient_accumulation_steps`，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4852 中实现\n* 修复（DeepSeek OPSM）：传递正确的 (vLLM) 对数似然值，由 @casinca 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4857 中实现\n* 修复 prompt-completion 类型及 transformers v5 的 SFT 训练问题，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4880 中实现\n* Bugfix：vLLM 中的对数似然漂移问题","2026-02-10T13:28:21",{"id":167,"version":168,"summary_zh":169,"released_at":170},198138,"v0.27.2","## 变更内容\n\n* 由 @qgallouedec 在 #4960 中移除了对 `warnings_issued` 的访问权限\n* 修复 SFTTrainer 的初始化逻辑：仅在 transformers 版本低于 v5 时移除 `TrainingArguments.push_to_hub_token`，由 @albertvillanova 在 #4942 中完成\n* 修复 DPO 预处理中为对话数据额外添加的 EOS 标记问题，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4908 中完成\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fcompare\u002Fv0.27.1...v0.27.2","2026-02-03T18:10:01",{"id":172,"version":173,"summary_zh":174,"released_at":175},198139,"v0.27.1","## 变更内容\n\n* 修复：未定义的 `current_gradient_accumulation_steps`，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4852 中完成\n* 修复（DeepSeek OPSM）：传递正确的（vLLM）对数概率，由 @casinca 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4857 中完成\n* 修复 prompt-completion 类型的 SFT 训练以及 transformers v5 的兼容性，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4880 中完成\n* Bugfix：在 vLLM 服务模式下（相较于同机部署模式）出现的对数概率漂移问题，由 @kdubovikov 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4873 中修复\n* 修复 RewardTrainer 的结果不可复现问题，由 @liyc-ai 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4887 中完成\n\n## 新贡献者\n\n* @kdubovikov 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4873 中完成了首次贡献\n* @liyc-ai 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4887 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fcompare\u002Fv0.27.0...v0.27.1","2026-01-24T03:42:17",{"id":177,"version":178,"summary_zh":179,"released_at":180},198140,"v0.27.0","## 功能特性\n\n* 在 GRPO、RLOO 和 OnlineDPO 配置中添加 `vllm_group_port` 参数，由 @pointerhacker 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4545 中实现。\n* 在 BFD 打包中保留被截断的标记，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4632 中实现。\n* 支持异步奖励函数，并并行化对奖励函数的调用。由 @pramodith 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4567 中实现。\n* RLOO 支持异步奖励。由 @pramodith 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4718 中实现。\n* 支持 vLLM 0.12.0，由 @jiqing-feng 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4117 中实现。\n* 新特性：DeepSeek V3.2 的离策略序列掩码，由 @casinca 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4689 中实现。\n* 🎭 使用 `forward_masked_logits` 函数进行前向传播时，显存占用可减少高达 50%，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4729 中实现。\n* [GRPO] 添加一个配置项以限制工具调用的迭代次数，由 @pramodith 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4761 中实现。\n* 将梯度检查点的默认设置切换为 use_reentrant=False（PyTorch 推荐），由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4811 中实现。\n* 添加对 GDPO 的支持：用于多奖励强化学习优化的组奖励-解耦归一化策略优化，由 @nbasyl 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4785 中实现。\n\n## 实验性功能\n\n* 将 `AutoModelForCausalLMWithValueHead` 和 `AutoModelForSeq2SeqLMWithValueHead` 移至实验性模块，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4654 中实现。\n* 将 DPODataCollatorWithPadding 移至 `experimental.utils`，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4667 中实现。\n* 将 `DataCollatorForChatML` 移至 `experimental.utils`，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4668 中实现。\n* 将 `add_bos_token_if_needed` 和 `add_eos_token_if_needed` 移至 `experimental.utils`，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4674 中实现。\n* 将 `truncate_right` 和 `SIMPLE_CHAT_TEMPLATE` 移至 `experimental.utils`，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4677 中实现。\n* 将 `prepare_model_for_kbit_training`、`enable_gradient_checkpointing` 和 `prepare_peft_model` 移至 `experimental.utils`，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4704 中实现。\n* 将 `get_reward` 函数移至 `experimental.utils`，由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4683 中实现。\n* 从 testing_utils 中移除实验性导入，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4727 中实现。\n* ORPO：避免损失函数中的灾难性消去现象，由 @hartmans 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4763 中实现。\n* 重构 KTO [1\u002FN]：使模型初始化现代化，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4783 中实现。\n* [GOLD] 添加概率合并修复以实现链式法则，由 @kashif 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4765 中实现。\n* 与 DPO 协同重构 KTO [a\u002FN]：移除编码器-解码器支持，由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4792 中实现。\n* 与 DPO 协同重构 KTO [b\u002FN]：简化截断","2026-01-16T02:34:32",{"id":182,"version":183,"summary_zh":184,"released_at":185},198141,"v0.26.2","## 变更内容\n\n* 由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4647 中实现，覆盖了模型生成时使用的默认配置。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fcompare\u002Fv0.26.1...v0.26.2","2025-12-18T15:55:24",{"id":187,"version":188,"summary_zh":189,"released_at":190},198142,"v0.26.1","## 变更内容\n\n* 修复 vLLM 错误：在使用 GRPO 进行训练时，不支持工具的使用，由 @apalmas-saifh 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4663 中完成。\n* 修复 GRPO 配置验证问题：当 `num_generations_eval` 被指定且与 `num_generations` 不同时，确保配置正确，由 @apalmas-saifh 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4682 中完成。\n\n## 新贡献者\n\n* @apalmas-saifh 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4663 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fcompare\u002Fv0.26.0...v0.26.1","2025-12-12T17:50:48",{"id":192,"version":193,"summary_zh":194,"released_at":195},198143,"v0.26.0","## Features\r\n\r\n### 🕵️‍♂️ GRPO: Agent training\r\n\r\n`GRPOTrainer` now supports training agents using tools. This allows language models to interact with external functions or APIs during training.\r\n\r\n```python\r\nfrom datasets import Dataset\r\nfrom trl import GRPOTrainer\r\n\r\ndef multiply(a: int, b: int) -> int:\r\n    \"\"\"\r\n    Multiplies two integers.\r\n\r\n    Args:\r\n        a: The first integer.\r\n        b: The second integer.\r\n\r\n    Returns:\r\n        The product of the two integers.\r\n    \"\"\"\r\n    return a * b\r\n\r\n\r\ndataset = Dataset.from_list(\r\n    [\r\n        {\"prompt\": [{\"role\": \"user\", \"content\": \"What is 3 multiplied by 4?\"}], \"answer\": 12},\r\n        {\"prompt\": [{\"role\": \"user\", \"content\": \"Calculate 7 times 8.\"}], \"answer\": 56},\r\n        {\"prompt\": [{\"role\": \"user\", \"content\": \"Find the product of 5 and 6.\"}], \"answer\": 30},\r\n        {\"prompt\": [{\"role\": \"user\", \"content\": \"What do you get when you multiply 9 by 9?\"}], \"answer\": 81},\r\n        {\"prompt\": [{\"role\": \"user\", \"content\": \"Compute 12 multiplied by 11.\"}], \"answer\": 132},\r\n        {\"prompt\": [{\"role\": \"user\", \"content\": \"What is 15 times 14?\"}], \"answer\": 210},\r\n    ]\r\n)\r\n\r\ndef accuracy(completions, answer, **kwargs):\r\n    predictions = [completion[-1][\"content\"] for completion in completions]\r\n    rewards = [float(str(ans) in pred) for pred, ans in zip(predictions, answer)]\r\n    return rewards\r\n\r\ntrainer = GRPOTrainer(\r\n    model=\"Qwen\u002FQwen3-0.6B\",\r\n    train_dataset=dataset,\r\n    tools=[multiply],\r\n    reward_funcs=accuracy,\r\n)\r\ntrainer.train()\r\n```\r\n\r\nby @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4300\r\n\r\n### ScaleRL: Add CISPO Loss\r\n\r\nCISPO Loss was first introduced in the [Minimax-M1 paper](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2506.13585), the [ScaleRL paper](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2510.13786) subsequently showed that CISPO loss scales the best in terms of performance and efficiency as models are trained for longer.\r\n\r\n`GRPOTrainer` now supports the CISPO loss using `loss_type=\"cispo\"` in the `GRPOConfig`.\r\n\r\nby @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4495\r\n\r\n### Add vLLM quantization option for colocate\r\n\r\nWhen the input model is quantized using bitsandbytes, vLLM will now also use quantization when in colocate mode.\r\n\r\nby @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4496\r\n\r\n### Reasoning reward\r\n\r\nTRL nows includes a reasoning reward function\r\n\r\n```python\r\nfrom trl.rewards import reasoning_accuracy_reward\r\n\r\nsolutions = [r\"\\frac{1}{3}\", r\"\\frac{1}{3}\", r\"\\frac{1}{3}\"]\r\ncompletions = [\r\n    [\r\n        {\r\n            \"role\": \"assistant\",\r\n            \"content\": r\"\u003Cthink> Reasoning content \u003C\u002Fthink> The final answer is \\boxed{\\frac{1}{3}}\",\r\n        }\r\n    ],\r\n    [\r\n        {\r\n            \"role\": \"assistant\",\r\n            \"content\": r\"\u003Cthink> Reasoning content \u003C\u002Fthink> The final answer is \\boxed{\\frac{1}{2}}\",\r\n        }\r\n    ],\r\n    [\r\n        {\r\n            \"role\": \"assistant\",\r\n            \"content\": r\"\u003Cthink> Reasoning content with partial answers \\boxed{\\frac{1}{3}} but no final answer\",\r\n        }\r\n    ],\r\n]\r\nreasoning_accuracy_reward(completions, solutions)  # [1.0, 0.0, 0.0] \r\n```\r\n\r\nAs any other reward function, it can be used in `GRPOTrainer` or `RLOOTrainer`.\r\n\r\n```python\r\nfrom trl import GRPOTrainer\r\nfrom trl.rewards import reasoning_accuracy_reward\r\n\r\ntrainer = GRPOTrainer(\r\n    ...,\r\n    reward_funcs=reasoning_accuracy_reward,\r\n)\r\n```\r\n\r\nby @lewtun in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4563\r\n\r\n### Add `shuffle_dataset` option to `SFTTrainer`\r\n\r\nYou can now shuffle the dataset in `SFTTrainer` by setting the `shuffle_dataset` argument to `True` in `SFTConfig`. This is useful when the dataset features high similarity between consecutive samples.\r\n\r\n```python\r\nfrom trl import SFTTrainer, SFTConfig\r\n\r\nSFTConfig(shuffle_dataset=True)\r\n```\r\n\r\nby @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4564\r\n\r\n### Add SAPO Loss in GRPO\r\n\r\nSoft Adaptive Policy Optimization (SAPO), replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO.\r\n\r\nYou can now use SAPO loss in `GRPOTrainer` by setting `loss_type=\"sapo\"` in the `GRPOConfig`.\r\n\r\nby @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4600\r\n\r\n### Other Features\r\n\r\n* Support completion bootstrap for VLM in GRPO\u002FRLOO by @SolarWindRider in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4452\r\n* Add support for images inside tables with Trackio completions logging by @taha-yassine in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4505\r\n* Add step time metric to GRPO Trainer for performance tracking by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4516\r\n* Add target_parameters to LoraConfig by @jonnyli1125 in https:\u002F\u002Fgithub.c","2025-12-09T20:51:12",{"id":197,"version":198,"summary_zh":199,"released_at":200},198144,"v0.25.1","## What's Changed\r\n\r\n* Replace accelerate logging with stdlib in CLI by @lewtun in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4512\r\n* Add temporary workaround for `lr_scheduler_kwargs` dtype issue in Transformers 4.57.0 by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4513\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fcompare\u002Fv0.25.0...0.25.1","2025-11-12T16:51:21",{"id":202,"version":203,"summary_zh":204,"released_at":205},198145,"v0.25.0","## Features\r\n\r\n* 💤 Switch to sleep level=2 and split wake-ups in GRPO and RLOO trainers by @xxrjun in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4296\r\n* Added custom `prepare_model_for_kbit_training` to save VRAM by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4335\r\n* Add `add_generation_prompt` to processor_kwargs in GRPO and RLOO trainer by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4361\r\n* Add support for Trackio completions logging in GRPOTrainer by @taha-yassine in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4359\r\n* Support chat_template_kwargs by @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4350\r\n* GRPO: ScaleRL -> Support casting LM Head to FP32 by @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4303\r\n* Support casting to fp32 when word embeddings are tied to lm_head by @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4446\r\n* 💬 Add chat to vLLM client and server, update trainer calls by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4450\r\n\r\n## Experimental\r\n\r\n* 🚚 Move BCO to `trl.experimental` by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4312\r\n* 👑 [experimental] GOLD Trainer by @kashif in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4349\r\n* Add PAPOTrainer for preference-based optimization by @SolarWindRider in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4334\r\n* [GFPO] fix the GFPO loss calculation error caused by unmodified old_per_token_logps by @Peter-Chou in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4454\r\n* 🕹️ Add rollout function for OpenEnv integration by @lewtun in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4310\r\n\r\n## Fixes\r\n\r\n* [Activation-checkpointing] add tensor dedup and param offloading by @kashif in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4247\r\n* Fix attn_implementation name in OnlineDPO for transformers v5 by @albertvillanova in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4322\r\n* Hotfix: Fall back to config.text_config._name_or_path if missing config._name_or_path by @albertvillanova in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4324\r\n* Fix GRPO and RLOO trainers for continuous batching by @albertvillanova in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4348\r\n* Fix: `add_generation_prompt=True` for conversational only by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4362\r\n* Remove ignored max_length parameter from PRMTrainer data collator by @albertvillanova in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4355\r\n* Fix add_generation_prompt arg for paged transformers in GRPO and RLOO trainers by @albertvillanova in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4370\r\n* Fix GKD Liger memory spike by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4140\r\n* Fix GRPO with replay buffer by inserting images in the prompt by @albertvillanova in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4391\r\n* fix: Remove chat template setting from non-SFT trainer scripts by @behroozazarkhalili in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4437\r\n* 🖼️ Fix reporting images with vLLM by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4476\r\n\r\n## Documentation and Examples\r\n\r\n* Added SFT LoRA notebook by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4244\r\n* Update notebooks README with latest additions by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4316\r\n* Add notebooks to Examples docs and restructure by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4317\r\n* Highlight OpenEnv in landing docs by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4327\r\n* Update OpenEnv docs by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4328\r\n* Add OpenEnv blog to landing by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4333\r\n* 🗞️ Update \"What's New\" by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4338\r\n* Update Reducing Memory Consumption guide with more details by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4332\r\n* Fixed links inside Tips in docs by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4360\r\n* 🔥 docs: Add RapidFire AI integration guide by @kamran-rapidfireAI in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4340\r\n* Fix paper link for \"Towards Efficient and Exact Optimization of Language Model Alignment\" by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4409\r\n* Migrate experimental trl feature docs  by @ethanknights in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4411\r\n* Update SFT QLoRA notebook with **14B** model on free Colab by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4336\r\n* Create \"Talks\" subsection by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4414\r\n* Openenv wordle example by @burtenshaw in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4357\r\n* docs: Remove outdated conversational dataset conversion guidance by @behroozazarkhalili in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4422\r\n* docs: List all trainers that support Liger Kernel by @behroozazarkhalili in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4432\r\n* Add On-Policy Distillation from thinking lab","2025-11-06T00:18:30",{"id":207,"version":208,"summary_zh":209,"released_at":210},198146,"v0.24.0","## Features\r\n\r\n* Add accuracy reward by @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4270\r\n* Add support for `token_type_ids` in `DPOTrainer` by @aweers in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4285\r\n* 💰 `RichProgressCallback` enhancement by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4245\r\n* Include `chat_template_kwargs` in `apply_chat_template` by @cmpatino in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4233\r\n* 🏷️ Account for `token_type_ids` in `DataCollatorForVisionLanguageModeling` by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4190\r\n* 🎨 Support mixing image+text and text-only examples by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4203\r\n* 🎁 `RewardTrainer` refactor by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4093\r\n* 🎞️ Support sequence classification models in `clone_chat_template` by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4097\r\n* ✨ Add logging for training completion and model saving in training scripts by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4048\r\n* 🖨️ Print rich table for messages by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4160\r\n* 😴 Add `vllm_enable_sleep_mode` to RLOO Trainer by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4107\r\n* 📽 Multi image support for GRPO\u002FRLOO by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4113\r\n* 👁️ Add VLM support to RLOO trainer by @behroozazarkhalili in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4067\r\n* ℹ️ Enable XPU for vLLM client by @jiqing-feng in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4031\r\n* 🧶 feat: Add WeaveCallback for W&B Weave integration by @parambharat in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4089\r\n\r\n## Fixes\r\n\r\n* [Online-DPO] fix the completion_len == max_new_tokens crash by @kashif in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4193\r\n* Fix entropy and accuracy calculation for prompt_tuning techniques. by @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4196\r\n* Fix prompt-completion labeling with add_generation_prompt and warning by @behroozazarkhalili in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4201\r\n* 🌡️ Have vLLM return processed (temperature scaled) log probs by @YonatanGideoni in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4163\r\n* Fix handling of f_divergence_type in DPO by @albertvillanova in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4171\r\n* ⚡ Fix Flash Attention x Padding-Free loss by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4170\r\n* Pass required token_type_ids by @albertvillanova in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4148\r\n* 👩‍🦯 Fix usage of VLM using text only by @SamuelBarryCS in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4080\r\n* ⚓ [vllm] ensure MASTER_ADDR\u002FMASTER_PORT are set safely by @kashif in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4057\r\n* 📤 Fix a dataset loading bug in scripts by @singing-cat in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4124\r\n* 🐯 fix: use_liger_kernel with IterableDataset by @jue-jue-zi in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4087\r\n* [GKD] Fix `batchmean` reduce op in GKDTrainer's loss by @cmpatino in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4105\r\n* Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4081\r\n* Aux loss is already included in the loss returned by Transformers by @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4078\r\n* ♨️ [GRPO] Fix potential hang in `get_high_entropy_mask` by @akakakakakaa in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4041\r\n\r\n## Documentation\r\n\r\n* Remove logging.md: trainer-specific metrics documentation by @behroozazarkhalili in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4269\r\n* Remove using_llama_models.md: outdated Llama2-specific documentation by @behroozazarkhalili in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4268\r\n* Remove how_to_train.md: outdated training FAQ by @behroozazarkhalili in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4267\r\n* Add Qwen3-VL notebooks (SFT, GRPO) by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4275\r\n* Remove obsolete research_projects directory by @behroozazarkhalili in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4243\r\n* Add Efficient Online Training with GRPO and vLLM in TRL to community tutorials by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4219\r\n* Add trainers taxonomy to docs by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4195\r\n* Updated vLLM integration guide by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4162\r\n* [DOCS] Lora without regret by @burtenshaw in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4181\r\n* Add docstring for OnlineTrainerState by @albertvillanova in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4166\r\n* ⚖️ Align SFT and DPO for model creation and deprecate `DPOConfig.padding_value` in favour or `pad_token_id` by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4006\r\n* 🏞️ Context Parallelism benchmark guide by @sergiopanie","2025-10-16T00:29:40",{"id":212,"version":213,"summary_zh":214,"released_at":215},198147,"v0.23.1","## What's Changed\r\n\r\n* ♨️ [GRPO] Fix potential hang in `get_high_entropy_mask` by @akakakakakaa in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4041\r\n* Aux loss is already included in the loss returned by Transformers by @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4078\r\n* Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4081\r\n* 🐯 fix: use_liger_kernel with IterableDataset by @jue-jue-zi in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4087\r\n* [SFTrainer]: Fix DFT Loss by @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4112\r\n* ⚡ Fix Flash Attention x Padding-Free loss by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4170\r\n\r\n## New Contributors\r\n\r\n* @Hoesu made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4081\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fcompare\u002Fv0.23.0...v0.23.1","2025-10-02T05:20:49",{"id":217,"version":218,"summary_zh":219,"released_at":220},198148,"v0.23.0","## Major\r\n\r\n### 🥓 Context Parallelism\r\n\r\nSFT now supports Context Parallelism (CP) for training large language models on very large sequences. You can now train with an arbitrarily long sequence length.\r\n\r\n\u003Cimg width=\"844\" height=\"336\" alt=\"Screenshot 2025-09-09 at 10 39 30 PM\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ff1dfc349-440a-4e05-aac9-439a3c286f08\" \u002F>\r\n\r\nby @kashif in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3994\r\n\r\n### 🧨 Dynamic Fine-Tuning\r\n\r\n\r\nDynamic Fine-Tuning (DFT) is a nnow supported in TRL.\r\n\r\n```python\r\nfrom trl import SFTConfig\r\n\r\ntraining_args = SFTConfig(\r\n    loss_type=\"dft\",\r\n    ...\r\n)\r\n```\r\n\r\n\u003Cimg width=\"692\" height=\"472\" alt=\"Screenshot 2025-09-09 at 10 37 36 PM\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F4ee2b4ab-7cc6-4578-bfac-c38124891510\" \u002F>\r\n\r\nby @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4042\r\n\r\n### 🪵 Truncated Importance Sampling (TIS) to address rollout-training mismatch\r\n\r\nDifferent implementations are used for rollout generation (vLLM) and model training. The implementation gap implicitly turns the on-policy RL to be off-policy. Truncated Importance Sampling (TIS) a simple yet effective importance sampling technique for handling such discrepancy. This is now implemented in GRPO.\r\n\r\n```python\r\nfrom trl import GRPOConfig\r\n\r\ntraining_args = GRPOConfig(\r\n    ...\r\n    use_vllm=True,\r\n    vllm_importance_sampling_correction=True, # default True\r\n    vllm_importance_sampling_cap=2.0, # hyper-parameter C\r\n)\r\n```\r\n\r\nby @LeonEricsson in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3867\r\n\r\n### 🥣 [SFTTrainer]: Add Aux Loss for MoE models\r\n\r\nMixture of Experts (MoE) models require an auxiliary loss to ensure that the different experts are used evenly. This auxiliary loss is now supported in SFTTrainer.\r\n\r\n```python\r\ntraining_args = SFTConfig(\r\n    model_init_kwargs={\"output_router_logits\": True},\r\n    ...\r\n)\r\n```\r\n\r\nby @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4012\r\n\r\n### 💤 [GRPO\u002FRLOO] Adds an option to sleep vllm when running in colocated mode\r\n\r\nWhen running GRPO (or RLOO) with vLLM in colocated mode, the vLLM server consume VRAM during optimization while not being used. We now have an option to put the vLLM server to sleep during optimization to free up VRAM.\r\n\r\n```python\r\nfrom trl import GRPOConfig\r\n\r\ntraining_args = GRPOConfig(..., vllm_sleep_enabled=True)\r\n```\r\n\r\nby @edbeeching in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3968\r\n\r\n### ⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer\r\n\r\nYou can now use vLLM server mode with OnlineDPOTrainer. Additionally, VLM models are now supported.\r\n\r\nby @vaelev in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3783\r\n\r\n\r\n### Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations\r\n\r\nThe paper index has been significantly enhanced with the addition of 9+ new algorithm implementations, providing a more comprehensive resource for users.\r\n\r\nby @behroozazarkhalili in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3990\r\n\r\n### Other Notable Changes\r\n\r\n* 👷 Added Kernels on the Hub x TRL guide by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3969\r\n* 🌵 Refactor entropy_from_logits for memory efficiency by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4013\r\n\r\n## What's Changed\r\n\r\n* ⬆️ Bump dev version by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3978\r\n* 👮 Fix GRPO CLI by setting parameters for `get_soft_overlong_punishment` by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3972\r\n* 🪃 `args.gradient_checkpointing = False` instead of `args = dataclasses.replace(args, gradient_checkpointing=False)` by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3981\r\n* [GRPO] Adds an option to sleep vllm when running in colocated mode by @edbeeching in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3968\r\n* 🎯 Add Trackio integration documentation and update TOC by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3971\r\n* ⚖️ Fix scale_rewards issue in GRPO by @Peter-Chou in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3992\r\n* ⏰ fix: add return to shift_tokens_right by @ginkyenglee in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3987\r\n* Add pre-commit and hf-doc-builder as dev dependencies by @albertvillanova in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3993\r\n* [GRPO] Truncated Importance Sampling to address rollout-training mismatch by @LeonEricsson in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3867\r\n* Fixed tags shown problem in memory usage docs by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3999\r\n* ✖️ Support pad-to-multiple-of and padding-free by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3996\r\n* 💾 [bugfix] fix PPO save_checkpoint by @hjh0119 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3998\r\n* [GRPO]: Fix Multi-GPU training for Entropy based masking of tokens. by @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3964\r\n* 📏 `torch_dype` to `dtype` everywhere by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F4000\r\n* Comprehensive Paper Index Enhancement","2025-09-10T04:39:53",{"id":222,"version":223,"summary_zh":224,"released_at":225},198149,"v0.22.2","## What's Changed\r\n\r\n* ⚖️ Fix scale_rewards issue in GRPO by @Peter-Chou in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3992\r\n* ⏰ fix: add return to shift_tokens_right by @ginkyenglee in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3987\r\n* ✖️ Support pad-to-multiple-of and padding-free by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3996\r\n\r\n## New Contributors\r\n* @Peter-Chou made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3992\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fcompare\u002Fv0.22.1...v0.22.2","2025-09-03T14:44:47",{"id":227,"version":228,"summary_zh":229,"released_at":230},198150,"v0.22.1","## What changed\r\n- Refactor version retrieval to use `importlib.metadata` by @qgallouedec\r\n- Release: 0.22.1 by @qgallouedec\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fcompare\u002Fv0.22.0...v0.22.1","2025-08-29T22:11:44",{"id":232,"version":233,"summary_zh":234,"released_at":235},198151,"v0.22.0","## Major\r\n\r\n### 🔮 Native VLM support for `SFTTrainer`\r\n\r\n`SFTTrainer` now natively supports Vision-Language Models (VLMs). This includes support for both languauge modeling, prompt-completion data. \r\nIt also supports training on completion-only.\r\n\r\n\u003Cimg width=\"1136\" height=\"586\" alt=\"Group 291-6\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F2629b8e7-d853-4b7c-91d5-f4c128287e04\" \u002F>\r\n\r\n```python\r\nfrom trl import SFTConfig, SFTTrainer\r\nfrom datasets import load_dataset\r\n\r\ntrainer = SFTTrainer(\r\n    model=\"Qwen\u002FQwen2.5-VL-3B-Instruct\",\r\n    args=SFTConfig(max_length=None),\r\n    train_dataset=load_dataset(\"trl-lib\u002Fllava-instruct-mix\", split=\"train\"),\r\n)\r\ntrainer.train()\r\n```\r\n\r\nby @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3862, https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3907 and https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3908\r\n\r\n### 🔥 `RLOOTrainer` refactor\r\n\r\n`RLOOTrainer` has been refactored to align with the design principles of other other trainers in the library. You can now use this trainer exactly like GRPO.\r\n\r\n```python\r\nfrom datasets import load_dataset\r\nfrom trl import RLOOConfig, RLOOTrainer\r\n\r\ndataset = load_dataset(\"trl-lib\u002Fultrafeedback-prompt\", split=\"train\")\r\n\r\n# Dummy reward function for demonstration purposes\r\ndef reward_num_unique_letters(completions, **kwargs):\r\n    \"\"\"Reward function that rewards completions with more unique letters.\"\"\"\r\n    completion_contents = [completion[0][\"content\"] for completion in completions]\r\n    return [float(len(set(content))) for content in completion_contents]\r\n\r\ntrainer = RLOOTrainer(\r\n    model=\"Qwen\u002FQwen2-0.5B-Instruct\",\r\n    reward_funcs=reward_num_unique_letters,\r\n    train_dataset=dataset,\r\n)\r\ntrainer.train()\r\n```\r\n\r\nby @shirinyamani in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3801\r\n\r\n### 🧭 HF jobs x TRL guide\r\n\r\nYou can now levarage Hugging Face Jobs to easily train and deploy your models with TRL.\r\n\r\n```bash\r\nhf jobs uv run --flavor a100-large --secrets HF_TOKEN \"https:\u002F\u002Fraw.githubusercontent.com\u002Fhuggingface\u002Ftrl\u002Fmain\u002Ftrl\u002Fscripts\u002Fsft.py\" --model_name_or_path Qwen\u002FQwen2-0.5B --dataset_name trl-lib\u002FCapybara\r\n```\r\n\r\nA guide is available in the [docs](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fmain\u002Fen\u002Ftraining_jobs).\r\n\r\nby @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3890\r\n\r\n### 🏌️ DAPO loss type\r\n\r\n`GRPOTrainer` now supports DAPO loss type, which aggregates token-level losses by normalizing with the number of active token in the global accumulated batch. This method was introduced to eliminate length bias. Simply use\r\n\r\n```python\r\nfrom trl import GRPOConfig, GRPOTrainer\r\n\r\ntraining_args = GRPOConfig(\r\n    loss_type=\"dapo\",\r\n    ...\r\n)\r\n```\r\n\r\nby @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3938\r\n\r\n### 🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch\r\n\r\nThe authors of [Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2508.08221) find that the combination of:\r\n\r\n1. scaling rewards by the standard deviation computed over the entire batch and\r\n2. aggregating loss over the total number of tokens\r\n\r\ncan unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and [DAPO](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2503.14476).\r\n\r\nTRL supports using these learnings to train a GRPO model by:\r\n\r\n```python\r\nfrom trl import GRPOConfig\r\n\r\ntraining_args = GRPOConfig(\r\n    scale_rewards=\"batch\",\r\n    loss_type=\"dapo\",\r\n    ...\r\n)\r\n```\r\n\r\nby @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3935\r\n\r\n### 🎢 [Callbacks] BEMA\r\n\r\nBias-Corrected Exponential Moving Average (BEMA) improves the stability and efficiency of language model fine-tuning by reducing stochasticity and eliminating bias. To use BEMA with SFT as described in the paper, you can now use the [`BEMACallback`]:\r\n\r\n```python\r\nfrom trl import BEMACallback, SFTTrainer\r\n\r\ntrainer = SFTTrainer(\r\n    ...\r\n    callbacks=[BEMACallback()],\r\n)\r\n```\r\n\r\nby @kashif in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3855\r\n\r\n### Minor\r\n\r\n* 🎀 New defaults: `gradient_checkpointing=True` by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3510\r\n* 🎚️ Add dataset mixer by @lewtun in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3791\r\n* 💇 Add soft overlong punishment reward function and update documentation by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3804\r\n* 🗿 [CPO] Add AlphaPO method via CPOTrainer by @kashif in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3824\r\n* 🗳️ Extend BCO Trainer dataset format support by @reihig-ut in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3134\r\n* 🐯 Support assistant-only training and Liger by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3914\r\n* 🎆 Add entropy logging in SFT by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3940\r\n* 📸 Return `position_ids` for `flash_attention_3` by @jue-jue-zi in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3942\r\n\r\n## Deprecations\r\n\r\n* 🗑️ Deprecate ","2025-08-29T22:07:33",{"id":237,"version":238,"summary_zh":239,"released_at":240},198152,"v0.21.0","## Major and breaking\r\n\r\n### 🌺 OpenAI GPT OSS & Harmony support\r\n\r\n\u003Cimg width=\"4544\" height=\"2344\" alt=\"Group 293-2\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F17241da2-1b1d-41bc-a5f8-0983ea46606f\" \u002F>\r\n\r\nOpen AI GPT OSS models are here! Check out the [OpenAI Cookbook](https:\u002F\u002Fcookbook.openai.com\u002Farticles\u002Fgpt-oss\u002Ffine-tune-transfomers) to see an example of how to SFT these models.\r\n\r\nby @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3848\r\n\r\n### Add vLLM transformers backend to online methods\r\n\r\nYou can now pass `vllm_model_impl` to the TRL vLLM server.\r\nExample, for `transformers` backend:\r\n\r\n```\r\ntrl vllm-serve ... --vllm_model_impl transformers\r\n```\r\n\r\nby @merveenoyan in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3773\r\n\r\n## What's Changed\r\n* ⬆️ Bump dev version by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3793\r\n* Fix broken PEFT+TRL docs link in `using_llama_models.md` by @bwook00 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3794\r\n* 🐙 Add MPO VLM example script by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3799\r\n* Examples list updated in docs by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3806\r\n* Add vLLM transformers backend to online methods by @merveenoyan in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3773\r\n* Correction parameter description by @1787648106 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3803\r\n* Add GSPO script examples (VLM\u002FLLM) by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3810\r\n* add xpu support for mergekit by @yao-matrix in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3800\r\n* GSPO parameters update from v2 by @BounharAbdelaziz in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3798\r\n* fix CI docs and grpo slow test by @kashif in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3814\r\n* Performance optimization: Replace list comprehensions with tensor operations in BCO and KTO trainers by @chi2liu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3813\r\n* Improve trainer doc by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3818\r\n* Add 'Post training a VLM for reasoning with GRPO using TRL' recipe to Community tutorials by @sergiopaniego in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3843\r\n* [GRPO]: Fix Entropy Mask Threshold Calculation when using Multi-GPU training by @pramodith in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3833\r\n* 🪦 Remove deprecated by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3817\r\n* 🌺 OpenAI GPT OSS & Harmony support by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3848\r\n* Release: v0.21 by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3849\r\n\r\n## New Contributors\r\n* @bwook00 made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3794\r\n* @merveenoyan made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3773\r\n* @1787648106 made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3803\r\n* @BounharAbdelaziz made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3798\r\n* @chi2liu made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fpull\u002F3813\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fcompare\u002Fv0.20.0...v0.21.0","2025-08-05T17:01:43"]