[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Liuziyu77--Visual-RFT":3,"tool-Liuziyu77--Visual-RFT":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":79,"owner_twitter":79,"owner_website":79,"owner_url":82,"languages":83,"stars":100,"forks":101,"last_commit_at":102,"license":103,"difficulty_score":10,"env_os":104,"env_gpu":105,"env_ram":106,"env_deps":107,"category_tags":117,"github_topics":79,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":118,"updated_at":119,"faqs":120,"releases":149},3177,"Liuziyu77\u002FVisual-RFT","Visual-RFT","Official repository of 'Visual-RFT: Visual Reinforcement Fine-Tuning' & 'Visual-ARFT: Visual Agentic Reinforcement Fine-Tuning'’","Visual-RFT 是一款专注于提升多模态大模型视觉感知能力的开源微调框架。它率先将 Deepseek-R1 的强化学习策略完整迁移至视觉领域，旨在解决现有大型视觉语言模型（LVLM）在复杂视觉任务中推理能力不足的问题。通过以 Qwen2-VL 为基座模型，Visual-RFT 设计了一套基于规则的“可验证奖励”机制，并结合 GRPO 强化学习算法进行高效微调，显著增强了模型在开放词汇检测、少样本检测、推理定位及细粒度图像分类等任务上的表现。\n\n此外，该项目还衍生出 Visual-ARFT 版本，进一步赋予模型智能体（Agent）能力，使其能够自主浏览网页获取实时信息，或编写代码对图像进行裁剪、旋转等专业处理。Visual-RFT 的核心亮点在于其创新的奖励设计与对 R1 推理范式的成功适配，为多模态研究提供了新的技术路径。\n\n这款工具非常适合人工智能研究人员、算法工程师以及希望深入探索多模态强化学习的开发者使用。无论是需要复现前沿论文成果，还是希望定制具备更强视觉推理与自主操作能力的专属模型，Visual-RFT 都提供了成熟的代码实现与数据集支持，助力用户轻松开启高阶模型优化之旅","Visual-RFT 是一款专注于提升多模态大模型视觉感知能力的开源微调框架。它率先将 Deepseek-R1 的强化学习策略完整迁移至视觉领域，旨在解决现有大型视觉语言模型（LVLM）在复杂视觉任务中推理能力不足的问题。通过以 Qwen2-VL 为基座模型，Visual-RFT 设计了一套基于规则的“可验证奖励”机制，并结合 GRPO 强化学习算法进行高效微调，显著增强了模型在开放词汇检测、少样本检测、推理定位及细粒度图像分类等任务上的表现。\n\n此外，该项目还衍生出 Visual-ARFT 版本，进一步赋予模型智能体（Agent）能力，使其能够自主浏览网页获取实时信息，或编写代码对图像进行裁剪、旋转等专业处理。Visual-RFT 的核心亮点在于其创新的奖励设计与对 R1 推理范式的成功适配，为多模态研究提供了新的技术路径。\n\n这款工具非常适合人工智能研究人员、算法工程师以及希望深入探索多模态强化学习的开发者使用。无论是需要复现前沿论文成果，还是希望定制具备更强视觉推理与自主操作能力的专属模型，Visual-RFT 都提供了成熟的代码实现与数据集支持，助力用户轻松开启高阶模型优化之旅。","\u003Cp align=\"center\">\n\u003C!--   \u003Ch1 align=\"center\">\u003Cimg src=\"assets\u002Flogo.png\" width=\"256\">\u003C\u002Fh1> -->\n  \u003Ch1 align=\"center\">Visual-RFT: Visual Reinforcement Fine-Tuning\u003C\u002Fh1>\n    \u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FLiuziyu77\">\u003Cstrong>Ziyu Liu*\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FSunzeY\">\u003Cstrong>Zeyi Sun*\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Fyuhangzang.github.io\u002F\">\u003Cstrong>Yuhang Zang\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Flightdxy.github.io\u002F\">\u003Cstrong>Xiaoyi Dong\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=sJkqsqkAAAAJ\">\u003Cstrong>Yuhang Cao\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Fkennymckormick.github.io\u002F\">\u003Cstrong>Haodong Duan\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n     \u003Ca href=\"http:\u002F\u002Fdahua.site\u002F\">\u003Cstrong>Dahua Lin\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n     \u003Ca href=\"https:\u002F\u002Fmyownskyw7.github.io\u002F\">\u003Cstrong>Jiaqi Wang\u003C\u002Fstrong>\u003C\u002Fa>\n  \u003C\u002Fp>\n  \u003Ch2 align=\"center\">Accepted By ICCV 2025!\u003C\u002Fh2>\n\u003C!-- 🏠\u003Ca href=\"https:\u002F\u002Fliuziyu77.github.io\u002FMIA-DPO\u002F\">Homepage\u003C\u002Fa>\u003C\u002Fh3>| -->\n  📖\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01785\">Paper\u003C\u002Fa> |\n  🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flaolao77\u002Fvirft-datasets-67bc271b6f2833eccc0651df\">Datasets\u003C\u002Fa> | 🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2503.01785\">Daily Paper\u003C\u002Fa>\u003C\u002Fh3>\n\u003Cdiv align=\"center\">\u003C\u002Fdiv>\n\u003Cp align=\"center\">\n  \u003Cp>\n🌈We introduce \u003Cstrong>Visual Reinforcement Fine-tuning (Visual-RFT)\u003C\u002Fstrong>, the first comprehensive adaptation of \u003Cstrong>Deepseek-R1's RL strategy\u003C\u002Fstrong> to the \u003Cstrong>multimodal field\u003C\u002Fstrong>. We use the Qwen2-VL-2\u002F7B model as our base model and design a \u003Cstrong>rule-based verifiable reward\u003C\u002Fstrong>, which is integrated into a \u003Cstrong>GRPO-based reinforcement fine-tuning framework\u003C\u002Fstrong> to enhance the performance of LVLMs across various visual perception tasks. \u003Cstrong>ViRFT\u003C\u002Fstrong> extends R1's reasoning capabilities to multiple visual perception tasks, including various detection tasks like \u003Cstrong>Open Vocabulary Detection, Few-shot Detection, Reasoning Grounding, and Fine-grained Image Classification\u003C\u002Fstrong>.\n  \u003C\u002Fp>\n\u003C!--     \u003Ca href=\"\">\n      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_readme_b35344156ded.png\" alt=\"Logo\" width=\"100%\"> \n    \u003C\u002Fa> -->\n\u003Cbr>\n\n\u003Ca href=\"\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_readme_0c8401d15d58.png\" alt=\"Logo\" >\n\u003C\u002Fa>\n\n## 🔥🔥🔥 Visual-RFT: Visual Reinforcement Fine-Tuning\nWe introduce *Visual Reinforcement Fine-tuning (Visual-RFT)*, the first comprehensive adaptation of Deepseek-R1’s RL strategy to the multimodal field. We use the Qwen2-VL-2\u002F7B model as our base model and design a rule-based verifiable reward, which is integrated into a GRPO-based reinforcement fine-tuning framework to enhance the performance of LVLMs across various visual perception tasks.\n\n📖\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01785\">Paper\u003C\u002Fa> | 🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flaolao77\u002Fvirft-datasets-67bc271b6f2833eccc0651df\">Datasets\u003C\u002Fa> | 🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2503.01785\">Daily Paper\u003C\u002Fa>\n\n## 🔥🔥🔥 Visual-ARFT: Visual Agentic Reinforcement Fine-Tuning\nOur new work *Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT)* is designed for enabling flexible and adaptive agentic abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs’ agentic search and coding abilities. \n\n📖\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14246\">Paper\u003C\u002Fa> | 🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flaolao77\u002FMAT\">Datasets\u003C\u002Fa> | 🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flaolao77\u002Fvisual-arft-682c601d0e35ac6470adfe9f\">Models\u003C\u002Fa>\n\n\n## 📢 News\n- 🚀 [06\u002F26\u002F2025] Our paper **Visual-RFT** is accepted by ICCV 2025!\n- 🚀 [05\u002F21\u002F2025] We support both **HuggingFace Dataset** format and **JSON** file format as input datasets for training.\n- 🚀 [05\u002F21\u002F2025] We updata the trainer of **Visual-RFT** to support both Qwen2-VL and Qwen2.5-VL. And we support multi-image inputs with `grpo_trainer_mp.py`.\n- 🚀 [05\u002F20\u002F2025] We release **Visual-ARFT** repository \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT\u002Ftree\u002Fmain\u002FVisual-ARFT\">Repo-URL\u003C\u002Fa>: A RFT framework dedicated to enhancing the **multimodal agentic capabilities of LVLMs**. (Support Qwen2-VL and Qwen2.5-VL)\n- 🚀 [03\u002F12\u002F2025] We release the code of **Visual-RFT** to build the \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT\u002Ftree\u002Fmain\u002Fdataset\">dataset\u003C\u002Fa> on your own data.\n- 🚀 [03\u002F04\u002F2025] We release our **Visual-RFT's** \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01785\">Paper\u003C\u002Fa>.\n- 🚀 [03\u002F04\u002F2025] We upload our training datasets of **Visual-RFT** to \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flaolao77\u002Fvirft-datasets-67bc271b6f2833eccc0651df\">Huggingface\u003C\u002Fa>.\n- 🚀 [03\u002F04\u002F2025] We release **Visual-RFT** repository and our training code.\n\n## 💡 Highlights\n- 🔥 **Visual Reinforcement Fine-tuning (Visual-RFT)**: We introduce Visual Reinforcement Fine-tuning (**Visual-RFT**), which extends reinforcement learning with verified rewards on visual perception tasks that are effective with limited data for fine-tuning.\n- 🔥 **Verified Rewards**: We design different **verified rewards** for different visual tasks that enable efficient, high-quality reward computation at a negligible cost. This allows the seamless transfer of DeepSeek R1's style reinforcement learning strategy to the multi-modal domain.\n- 🔥 **Extensive Experiments**: We conduct **extensive experiments** on various visual perception tasks, including fine-grained image classification, open vocabulary object detection, few-shot object detection, and reasoning grounding.\n- 🔥 **Open Source**: We fully **open-source** the training code, training data, and evaluation scripts on Github to facilitate further research.\n\n\n\u003Ca href=\"\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_readme_b35344156ded.png\" alt=\"Logo\" >\n\u003C\u002Fa>\n\n\n## Framework\n**Visual-RFT** framework is shown below. The policy model generates a group of responses based on the input. Each response is passed through a verifiable reward function to compute the reward. After group computation of the rewards for each output, the quality of each response is evaluated and used to update the policy model. To ensure the stability of the policy model training, **Visual-RFT** use KL divergence to limit the difference between the policy model and the reference model. For ***more implementation details***, including data generation, the design of the ***verifiable reward***, and other aspects, please refer to our paper.\n\n\u003Ca href=\"\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_readme_601f4f11264e.png\" alt=\"Logo\" >\n\u003C\u002Fa>\n\n## 🛠️ Setup\n```\ngit clone https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT.git\nconda create -n Visual-RFT python=3.10\nconda activate Visual-RFT\nbash setup.sh\n```\n\n## Inference\nWe have uploaded the model trained on 200+ samples from the LISA dataset (\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FZery\u002FQwen2-VL-7B_visual_rft_lisa_IoU_reward\">🤗Huggingface\u003C\u002Fa>). You can use it to evaluate the inference performance of **Reasoning Grounding**. More details refer to `demo`.\n\n## Training\n### Datasets\nTo train on our various visual perception tasks, first visit \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flaolao77\u002Fvirft-datasets-67bc271b6f2833eccc0651df\">Huggingface Datasets\u003C\u002Fa> to download the datasets. We have uploaded different datasets for different tasks.\n| Datasets             |Task  |Setting          | Description                                                                 |\n|------------------------------|------|----|-----------------------------------------------------------------------------|\n| laolao77\u002FViRFT_COCO   |Detection | -                 | It includes all categories from COCO, with a total of 6k entries.            |\n| laolao77\u002FViRFT_COCO_base65     | Detection |Open Vocabulary       | It includes 65 basic categories from COCO, with a total of 6k entries.      |\n| laolao77\u002FViRFT_COCO_8_cate_4_shot |  Detection| Few-shot | It includes 8 selected categories from COCO.                                 |\n| laolao77\u002FViRFT_LVIS_few_shot     |  Detection| Few-shot      | It includes 6 selected categories from COCO.                                 |\n| laolao77\u002FViRFT_CLS_flower_4_shot |  Classification| Few-shot     | It includes the 102 categories from the Flower102 dataset, with 4 images per category. |\n| laolao77\u002FViRFT_CLS_fgvc_aircraft_4_shot|  Classification| Few-shot | It includes the 100 categories from the FGVC-Aircraft dataset, with 4 images per category. |\n| laolao77\u002FViRFT_CLS_car196_4shot   |  Classification| Few-shot   | It includes the 196 categories from the Stanford Cars dataset, with 4 images per category. |\n| laolao77\u002FViRFT_CLS_pets37_4shot  |  Classification| Few-shot    | It includes the 37 categories from the Pets37 dataset, with 4 images per category. |\n| LISA dataset | Grounding | - | Reasoning Grounding|\n> 🔔 If your want to build a dataset on your own data, you can refere to `dataset\u002Fbuild_dataset.ipynb`. Just provide a `json` file with `image`, `promble` and 'solution'.\n\n**Datasets Formats**\n🔦 We support both **HuggingFace Dataset** format and **JSON** file format as input datasets for training.\n\nRefer to \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT\u002Fblob\u002Fmain\u002Fsrc\u002Fvirft\u002Fsrc\u002Fopen_r1\u002Fgrpo.py\">grpo.py\u003C\u002Fa> for **HuggingFace Dataset** format example.\n\nRefer to \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT\u002Fblob\u002Fmain\u002FVisual-ARFT\u002Fsrc\u002Fvisual_arft\u002Fsrc\u002Fopen_r1\u002Fgrpo_agent_search.py\">grpo.py\u003C\u002Fa> for **JSON** format example.\n\n### GRPO\nAfter downloading the dataset, you can start training using the following example bash script. Our bash scripts are in ```\u002Fsrc\u002Fscripts```\n> 🔔 There's no need for prolonged training. For a dataset with only a few hundred samples, 200 steps should be sufficient.\n```\n# There's no need for prolonged training. For a dataset with only a few hundred samples, 200 steps should be sufficient.\nexport DEBUG_MODE=\"true\"\nexport LOG_PATH=\".\u002Fdebug_log_2b_GRPO_coco_base65cate_6k.txt\"\n\nexport DATA_PATH=.\u002Fshare_data\u002FViRFT_COCO_base65   ### your local dataset downloading from huggingface\nexport CKPT_PATH=.\u002Fshare_models\u002FQwen2-VL-2B-Instruct    ### Qwen2-VL-2B checkpoint path\nexport SAVE_PATH=.\u002Fshare_models\u002FQwen2-VL-2B-Instruct_GRPO_coco_base65cate_6k    ### save path\n\ntorchrun --nproc_per_node=\"8\" \\\n    --nnodes=\"1\" \\\n    --node_rank=\"0\" \\\n    --master_addr=\"127.0.0.1\" \\\n    --master_port=\"12345\" \\\n    src\u002Fopen_r1\u002Fgrpo.py \\\n    --output_dir ${SAVE_PATH}  \\\n    --model_name_or_path ${CKPT_PATH} \\\n    --dataset_name ${DATA_PATH} \\\n    --deepspeed local_scripts\u002Fzero3.json \\\n    --max_prompt_length 1024 \\\n    --per_device_train_batch_size 1 \\\n    --gradient_accumulation_steps 2 \\\n    --logging_steps 1 \\\n    --bf16 \\\n    --report_to wandb \\\n    --gradient_checkpointing false \\\n    --attn_implementation flash_attention_2 \\\n    --max_pixels 401408 \\\n    --num_train_epochs 1 \\\n    --run_name Qwen2-VL-2B_GRPO_coco_base65cate_6k \\\n    --save_steps 100 \\\n    --save_only_model true \\\n    --num_generations 8 '\n```\n\n### OOM Tips \n⏰ Running into OOM (Out-Of-Memory) issues during training is quite common, especially when using GPUs with limited memory. \n\n🔦 But no worries — here are some helpful **OOM tips** for you:\n\n1. **About distributed training:** You can alleviate memory pressure by specifying the `--deepspeed` argument, e.g. `--deepspeed \u002Fsrc\u002Fvisual_arft\u002Flocal_scripts\u002Fzero3.json`.  If memory is still insufficient, you can further reduce the load by using: `--deepspeed \u002Fsrc\u002Fvisual_arft\u002Flocal_scripts\u002Fzero3_offload.json`.\n\n2. **About the number of generations per group in GRPO:** You can reduce GPU memory usage by lowering the `--num_generation parameter`. In the example script, the default value is `--num_generation 8`, but you can try setting it to 4 to save memory. Keep in mind, though, that a smaller `--num_generation` may lead to worse performance.\n  \n3. **About gradient_checkpointing:** Moreover, setting `--gradient_checkpointing` to `true` can save memory, allowing for a higher `--num_generations` limit, which leads to better training performance. However, it will slow down the training process.\n\n4. **About Image resolution:** If you're still encountering OOM issues, you can also reduce the resolution of the images in the training dataset!\n\n### SFT\nWe use \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory\">LLaMa-Factory\u003C\u002Fa> for supervised fine-tuning (SFT) of the model. You can convert the downloaded dataset into the corresponding Qwen SFT format for training.\n\n## Evaluation\nWe conducted extensive experiments on various visual perception tasks, including **fine-grained image classification**, **open vocabulary object detection**, **few-shot object detection**, and **reasoning grounding**. **ViRFT** achieves remarkable performance improvements across these tasks with minimal data and computational cost, significantly surpassing supervised fine-tuning baselines.\n\n> We provide a step-by-step tutorial for using the evaluation code. If you encounter any issues, feel free to open an issue.\n\n### COCO Evaluation\nYou can use the files in the ```coco_evaluation``` directory for model inference and obtain evaluation results. Our code supports multi-GPU evaluation, and it requires at least two GPUs.\n\nFor ***inference***: \n```\ncd .\u002Fcoco_evaluation\npython Qwen2_VL_coco_infere.py\n```\nPlease note that some file paths and model paths in ```Qwen2_VL_coco_infere.py``` need to be modified.\n```\n### line 167-168, change for your model path and model base.\nmodel_path = \".\u002Fshare_models\u002FQwen2-VL-2B-Instruct_RL\u002F\"  # RL model\nmodel_base = \".\u002Fshare_models\u002FQwen2-VL-2B-Instruct\u002F\"  # original Qwen2-VL model\n### line 182, change for your coco val annnotation path\nwith open('.\u002Fdata\u002Fcoco\u002Fannotations\u002Finstances_val2017.json', 'r') as json_file:\n### line 224, Modify according to your own image path.\nimage_path = '.\u002Fdata\u002Fcoco\u002Fval2017\u002F'+image['file_name']    \n### line 231-241, selecte the categories you want to evaluation\nselected_cate = ['bus', 'train', 'fire hydrant', 'stop sign', 'cat', 'dog', 'bed', 'toilet']\n### line 350, results save path\nwith open(f'prediction_results.json', 'w') as json_file:\n```\nThe inference results will be saved in `JSON` format and later used for evaluation.\n\nFor ***evaluation***, just run ```.\u002Fcoco_evaluation\u002Fevaluation.ipynb``` step by step.\n\n### LVIS Evaluation\nYou can use the files in the ```lvis_evaluation``` directory for model inference and obtain evaluation results. Our code supports multi-GPU evaluation, and it requires at least two GPUs.\n\nFor ***inference***: \n```\ncd .\u002Flvis_evaluation\npython Qwen2_VL_lvis_infere.py\n```\nPlease note that some file paths and model paths in ```Qwen2_VL_lvis_infere.py``` need to be modified.\n```\n### line 169-170, change for your model path and model base\nmodel_path = \".\u002Fshare_models\u002FQwen2-VL-2B-Instruct_RL\u002F\"  # RL model\nmodel_base = \".\u002Fshare_models\u002FQwen2-VL-2B-Instruct\u002F\"  # original Qwen2-VL model\n### line 184, change for your lvis val annnotation path\nwith open('.\u002Fdata\u002Flvis\u002Fannotations\u002Flvis_v1_val.json', 'r') as json_file:\n### line 228, Modify according to your own image path.\nimage_path = '.\u002Fdata\u002Flvis\u002F' + \"\u002F\".join(parts[-2:])   \n### line 234-242, selecte the categories you want to evaluation\nselected_cate = ['horse_buggy', 'die', 'kitchen_table', 'omelet', 'papaya', 'stepladder']\n### line 346, results save path\nwith open(f'prediction_results.json', 'w') as json_file:\n```\nThe inference results will be saved in `JSON` format and later used for evaluation.\n\nFor ***evaluation***, just run ```.\u002Flvis_evaluation\u002Flvis_evaluation.ipynb``` step by step.\n\n### Classification Evaluation\nYou can use the files in the ```classification``` directory for model inference and obtain evaluation results. Our code supports multi-GPU evaluation, and it requires at least two GPUs.\n```\ncd .\u002Fclassification\npython Qwen2_VL_classification_infere.py\n```\nPlease note that the model paths in ```Qwen2_VL_classification_infere.py``` need to be modified.\n```\n### line 61-63, change for your model path and model base\nmodel_path = \".\u002Fshare_models\u002FQwen2-VL-2B-Instruct_RL\u002F\"  # after RL\nmodel_base = \".\u002Fshare_models\u002FQwen2-VL-2B-Instruct\u002F\"  # original Qwen2-VL\n```\nInference and result computation are performed simultaneously. After the program finishes running, the number of correctly classified items will be displayed in the command line, and the accuracy is obtained by dividing it by the length of the validation set. (Flower102: 2463, Pets37: 3669, stanford cars: 8041, fgvc-aircraft: 3333)\n\n> 🔔 Sometimes, due to environment issues, the model may produce incorrect inferences when `use_cache = None`. You might consider explicitly setting `use_cache = True`.\n> `generated_ids = model.generate(**inputs, max_new_tokens=1024, use_cache=True)`\n\n### Evaluation Results\n*We have conducted **extensive experiments**; please refer to our paper for further details*.\n\n\n### Case Study\nIn the following figure, we present some inference examples from **ViRFT**. We observe that the thinking process significantly enhances the reasoning and grounding ability with **ViRFT**. Through **ViRFT**, Qwen2-VL learns to think critically and carefully examine the image to produce accurate grounding results.\n\u003Ca href=\"\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_readme_b9461758a5ad.png\" alt=\"Logo\" >\n\u003C\u002Fa>\nWe also present some inference cases of the model when handling *fine-grained classification tasks*. These results not demonstrate the strong generalization ability of **ViRFT** across various visual tasks.\n\u003Ca href=\"\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_readme_6559078afdeb.png\" alt=\"Logo\" >\n\u003C\u002Fa>\n\n\n\n## ✒️Citation\n```\n@article{liu2025visual,\n  title={Visual-RFT: Visual Reinforcement Fine-Tuning},\n  author={Liu, Ziyu and Sun, Zeyi and Zang, Yuhang and Dong, Xiaoyi and Cao, Yuhang and Duan, Haodong and Lin, Dahua and Wang, Jiaqi},\n  journal={arXiv preprint arXiv:2503.01785},\n  year={2025}\n}\n@misc{liu2025visualagenticreinforcementfinetuning,\n      title={Visual Agentic Reinforcement Fine-Tuning}, \n      author={Ziyu Liu and Yuhang Zang and Yushan Zou and Zijian Liang and Xiaoyi Dong and Yuhang Cao and Haodong Duan and Dahua Lin and Jiaqi Wang},\n      year={2025},\n      eprint={2505.14246},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14246}, \n}\n```\n\n## 📄 License\n![Code License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache_2.0-green.svg) ![Data License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FData%20License-CC%20By%20NC%204.0-red.svg) **Usage and License Notices**: The data and code are intended and licensed for research use only.\nLicense: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https:\u002F\u002Fopenai.com\u002Fpolicies\u002Fterms-of-use\n\n## Acknowledgement\nWe sincerely thank projects \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FDeep-Agent\u002FR1-V\">R1-V\u003C\u002Fa>, \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fopen-r1\">Open-R1\u003C\u002Fa>, and \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Fopen-r1-multimodal\">Open-r1-multimodal\u003C\u002Fa> for providing their open-source resources.\n\n\n\n\n\n\n\n\n","\u003Cp align=\"center\">\n\u003C!--   \u003Ch1 align=\"center\">\u003Cimg src=\"assets\u002Flogo.png\" width=\"256\">\u003C\u002Fh1> -->\n  \u003Ch1 align=\"center\">Visual-RFT：视觉强化微调\u003C\u002Fh1>\n    \u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FLiuziyu77\">\u003Cstrong>刘子宇*\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FSunzeY\">\u003Cstrong>孙泽义*\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Fyuhangzang.github.io\u002F\">\u003Cstrong>臧宇航\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Flightdxy.github.io\u002F\">\u003Cstrong>董晓毅\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=sJkqsqkAAAAJ\">\u003Cstrong>曹宇航\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n    \u003Ca href=\"https:\u002F\u002Fkennymckormick.github.io\u002F\">\u003Cstrong>段浩东\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n     \u003Ca href=\"http:\u002F\u002Fdahua.site\u002F\">\u003Cstrong>林大华\u003C\u002Fstrong>\u003C\u002Fa>\n    ·\n     \u003Ca href=\"https:\u002F\u002Fmyownskyw7.github.io\u002F\">\u003Cstrong>王佳琪\u003C\u002Fstrong>\u003C\u002Fa>\n  \u003C\u002Fp>\n  \u003Ch2 align=\"center\">已被ICCV 2025接收！\u003C\u002Fh2>\n\u003C!-- 🏠\u003Ca href=\"https:\u002F\u002Fliuziyu77.github.io\u002FMIA-DPO\u002F\">主页\u003C\u002Fa>\u003C\u002Fh3>| -->\n  📖\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01785\">论文\u003C\u002Fa> |\n  🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flaolao77\u002Fvirft-datasets-67bc271b6f2833eccc0651df\">数据集\u003C\u002Fa> | 🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2503.01785\">每日论文\u003C\u002Fa>\u003C\u002Fh3>\n\u003Cdiv align=\"center\">\u003C\u002Fdiv>\n\u003Cp align=\"center\">\n  \u003Cp>\n🌈我们提出了\u003Cstrong>视觉强化微调（Visual-RFT）\u003C\u002Fstrong>,这是首次将\u003Cstrong>Deepseek-R1的强化学习策略\u003C\u002Fstrong>全面应用于\u003Cstrong>多模态领域\u003C\u002Fstrong>。我们以Qwen2-VL-2\u002F7B模型为基础，设计了一种基于规则的可验证奖励，并将其整合到基于GRPO的强化微调框架中，从而提升LVLMs在各类视觉感知任务中的性能。\u003Cstrong>ViRFT\u003C\u002Fstrong>将R1的推理能力扩展到了多种视觉感知任务，包括各种检测任务，如\u003Cstrong>开放词汇检测、少样本检测、推理性定位以及细粒度图像分类\u003C\u002Fstrong>。\n  \u003C\u002Fp>\n\u003C!--     \u003Ca href=\"\">\n      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_readme_b35344156ded.png\" alt=\"Logo\" width=\"100%\"> \n    \u003C\u002Fa> -->\n\u003Cbr>\n\n\u003Ca href=\"\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_readme_0c8401d15d58.png\" alt=\"Logo\" >\n\u003C\u002Fa>\n\n## 🔥🔥🔥 Visual-RFT：视觉强化微调\n我们提出了*视觉强化微调（Visual-RFT）*，这是首次将Deepseek-R1的强化学习策略全面应用于多模态领域。我们以Qwen2-VL-2\u002F7B模型为基础，设计了一种基于规则的可验证奖励，并将其整合到基于GRPO的强化微调框架中，以提升LVLMs在各类视觉感知任务中的性能。\n\n📖\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01785\">论文\u003C\u002Fa> | 🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flaolao77\u002Fvirft-datasets-67bc271b6f2833eccc0651df\">数据集\u003C\u002Fa> | 🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2503.01785\">每日论文\u003C\u002Fa>\n\n## 🔥🔥🔥 Visual-ARFT：视觉代理式强化微调\n我们的新工作*视觉代理式强化微调（Visual-ARFT）*旨在为大型视觉-语言模型（LVLMs）赋予灵活且自适应的代理能力。借助Visual-ARFT，开源的LVLMs具备了浏览网页以获取实时信息更新的能力，并能够编写代码对输入图像进行裁剪、旋转等图像处理操作。我们还提出了一套多模态代理工具基准测试（MAT），包含两种设置（MAT-Search和MAT-Coding），用于评估LVLMs的代理搜索与编码能力。\n\n📖\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14246\">论文\u003C\u002Fa> | 🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flaolao77\u002FMAT\">数据集\u003C\u002Fa> | 🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flaolao77\u002Fvisual-arft-682c601d0e35ac6470adfe9f\">模型\u003C\u002Fa>\n\n\n## 📢 新闻\n- 🚀 [2025年6月26日] 我们的论文**Visual-RFT**已被ICCV 2025接收！\n- 🚀 [2025年5月21日] 我们支持**HuggingFace Dataset**格式和**JSON**文件格式作为训练数据集。\n- 🚀 [2025年5月21日] 我们更新了**Visual-RFT**的训练器，使其同时支持Qwen2-VL和Qwen2.5-VL。并且通过`grpo_trainer_mp.py`支持多图像输入。\n- 🚀 [2025年5月20日] 我们发布了**Visual-ARFT**仓库：\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT\u002Ftree\u002Fmain\u002FVisual-ARFT\">仓库链接\u003C\u002Fa>：一个专门用于增强**LVLMs多模态代理能力**的RFT框架。（支持Qwen2-VL和Qwen2.5-VL）\n- 🚀 [2025年3月12日] 我们发布了**Visual-RFT**的代码，以便您使用自己的数据构建\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT\u002Ftree\u002Fmain\u002Fdataset\">数据集\u003C\u002Fa>。\n- 🚀 [2025年3月4日] 我们发布了**Visual-RFT**的\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01785\">论文\u003C\u002Fa>。\n- 🚀 [2025年3月4日] 我们将**Visual-RFT**的训练数据集上传至\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flaolao77\u002Fvirft-datasets-67bc271b6f2833eccc0651df\">Huggingface\u003C\u002Fa>。\n- 🚀 [2025年3月4日] 我们发布了**Visual-RFT**仓库及其训练代码。\n\n## 💡 亮点\n- 🔥 **视觉强化微调（Visual-RFT）**：我们提出了视觉强化微调（**Visual-RFT**），它将强化学习与针对视觉感知任务的可验证奖励相结合，在少量数据下即可实现高效的微调。\n- 🔥 **可验证奖励**：我们为不同的视觉任务设计了多种**可验证奖励**，能够在几乎不增加成本的情况下高效、高质量地计算奖励。这使得DeepSeek R1风格的强化学习策略能够无缝迁移到多模态领域。\n- 🔥 **广泛的实验**：我们在多种视觉感知任务上进行了**广泛的实验**，包括细粒度图像分类、开放词汇目标检测、少样本目标检测以及推理性定位等。\n- 🔥 **开源**：我们将训练代码、训练数据和评估脚本全部**开源**至Github，以促进进一步的研究。\n\n\n\u003Ca href=\"\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_readme_b35344156ded.png\" alt=\"Logo\" >\n\u003C\u002Fa>\n\n\n## 框架\n以下是**Visual-RFT**框架的示意图。策略模型根据输入生成一组响应，每条响应都会经过可验证奖励函数来计算奖励。对所有输出的奖励进行批量计算后，会评估每条响应的质量，并据此更新策略模型。为了确保策略模型训练的稳定性，**Visual-RFT**使用KL散度来限制策略模型与参考模型之间的差异。有关***更多实施细节***，包括数据生成、***可验证奖励***的设计以及其他方面，请参阅我们的论文。\n\n\u003Ca href=\"\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_readme_601f4f11264e.png\" alt=\"Logo\" >\n\u003C\u002Fa>\n\n## 🛠️ 部署\n```\ngit clone https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT.git\nconda create -n Visual-RFT python=3.10\nconda activate Visual-RFT\nbash setup.sh\n```\n\n## 推理\n我们已上传在 LISA 数据集的 200 多个样本上训练的模型（\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FZery\u002FQwen2-VL-7B_visual_rft_lisa_IoU_reward\">🤗Huggingface\u003C\u002Fa>）。您可以使用它来评估 **推理性定位** 的推理性能。更多详情请参阅 `demo`。\n\n## 训练\n### 数据集\n要针对我们的各种视觉感知任务进行训练，首先请访问 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flaolao77\u002Fvirft-datasets-67bc271b6f2833eccc0651df\">Huggingface 数据集\u003C\u002Fa>下载所需数据集。我们已为不同任务上传了不同的数据集。\n| 数据集             |任务  |设置          | 描述                                                                 |\n|------------------------------|------|----|-----------------------------------------------------------------------------|\n| laolao77\u002FViRFT_COCO   |检测 | -                 | 包含 COCO 的所有类别，共计 6k 条目。            |\n| laolao77\u002FViRFT_COCO_base65     | 检测 |开放词汇       | 包含 COCO 的 65 个基础类别，共计 6k 条目。      |\n| laolao77\u002FViRFT_COCO_8_cate_4_shot |  检测| 少样本 | 包含 COCO 中选定的 8 个类别。                                 |\n| laolao77\u002FViRFT_LVIS_few_shot     |  检测| 少样本      | 包含 COCO 中选定的 6 个类别。                                 |\n| laolao77\u002FViRFT_CLS_flower_4_shot |  分类| 少样本     | 包含 Flower102 数据集中的 102 个类别，每类 4 张图像。 |\n| laolao77\u002FViRFT_CLS_fgvc_aircraft_4_shot|  分类| 少样本 | 包含 FGVC-Aircraft 数据集中的 100 个类别，每类 4 张图像。 |\n| laolao77\u002FViRFT_CLS_car196_4shot   |  分类| 少样本   | 包含斯坦福汽车数据集中的 196 个类别，每类 4 张图像。 |\n| laolao77\u002FViRFT_CLS_pets37_4shot  |  分类| 少样本    | 包含 Pets37 数据集中的 37 个类别，每类 4 张图像。 |\n| LISA 数据集 | 定位 | - | 推理性定位|\n> 🔔 如果您想基于自己的数据构建数据集，可以参考 `dataset\u002Fbuild_dataset.ipynb`。只需提供一个包含 `image`、`promble` 和 `solution` 的 `json` 文件即可。\n\n**数据集格式**\n🔦 我们支持 **HuggingFace Dataset** 格式和 **JSON** 文件格式作为训练输入数据集。\n\n有关 **HuggingFace Dataset** 格式的示例，请参阅 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT\u002Fblob\u002Fmain\u002Fsrc\u002Fvirft\u002Fsrc\u002Fopen_r1\u002Fgrpo.py\">grpo.py\u003C\u002Fa>。\n\n有关 **JSON** 格式的示例，请参阅 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT\u002Fblob\u002Fmain\u002FVisual-ARFT\u002Fsrc\u002Fvisual_arft\u002Fsrc\u002Fopen_r1\u002Fgrpo_agent_search.py\">grpo.py\u003C\u002Fa>。\n\n### GRPO\n下载数据集后，您可以使用以下示例 Bash 脚本来开始训练。我们的 Bash 脚本位于 ```\u002Fsrc\u002Fscripts``` 目录下。\n> 🔔 不需要长时间训练。对于只有几百个样本的数据集，200 步就足够了。\n```\n# 不需要长时间训练。对于只有几百个样本的数据集，200 步就足够了。\nexport DEBUG_MODE=\"true\"\nexport LOG_PATH=\".\u002Fdebug_log_2b_GRPO_coco_base65cate_6k.txt\"\n\nexport DATA_PATH=.\u002Fshare_data\u002FViRFT_COCO_base65   ### 您从 Huggingface 下载的本地数据集\nexport CKPT_PATH=.\u002Fshare_models\u002FQwen2-VL-2B-Instruct    ### Qwen2-VL-2B 检查点路径\nexport SAVE_PATH=.\u002Fshare_models\u002FQwen2-VL-2B-Instruct_GRPO_coco_base65cate_6k    ### 保存路径\n\ntorchrun --nproc_per_node=\"8\" \\\n    --nnodes=\"1\" \\\n    --node_rank=\"0\" \\\n    --master_addr=\"127.0.0.1\" \\\n    --master_port=\"12345\" \\\n    src\u002Fopen_r1\u002Fgrpo.py \\\n    --output_dir ${SAVE_PATH}  \\\n    --model_name_or_path ${CKPT_PATH} \\\n    --dataset_name ${DATA_PATH} \\\n    --deepspeed local_scripts\u002Fzero3.json \\\n    --max_prompt_length 1024 \\\n    --per_device_train_batch_size 1 \\\n    --gradient_accumulation_steps 2 \\\n    --logging_steps 1 \\\n    --bf16 \\\n    --report_to wandb \\\n    --gradient_checkpointing false \\\n    --attn_implementation flash_attention_2 \\\n    --max_pixels 401408 \\\n    --num_train_epochs 1 \\\n    --run_name Qwen2-VL-2B_GRPO_coco_base65cate_6k \\\n    --save_steps 100 \\\n    --save_only_model true \\\n    --num_generations 8 '\n```\n\n### OOM 技巧 \n⏰ 在训练过程中遇到 OOM（内存不足）问题非常常见，尤其是在使用显存有限的 GPU 时。\n\n🔦 不过不用担心——这里有一些有用的 **OOM 技巧**：\n\n1. **关于分布式训练：** 您可以通过指定 `--deepspeed` 参数来缓解内存压力，例如 `--deepspeed \u002Fsrc\u002Fvisual_arft\u002Flocal_scripts\u002Fzero3.json`。如果内存仍然不足，还可以进一步减轻负载：`--deepspeed \u002Fsrc\u002Fvisual_arft\u002Flocal_scripts\u002Fzero3_offload.json`。\n\n2. **关于 GRPO 中每组的生成数量：** 您可以通过降低 `--num_generation` 参数来减少 GPU 内存占用。在示例脚本中，默认值是 `--num_generation 8`，但您可以尝试将其设置为 4 以节省内存。不过请注意，较小的 `--num_generation` 可能会导致性能下降。\n\n3. **关于梯度检查点：** 此外，将 `--gradient_checkpointing` 设置为 `true` 可以节省内存，从而允许更高的 `--num_generations` 上限，进而提升训练效果。然而，这会减慢训练速度。\n\n4. **关于图像分辨率：** 如果您仍然遇到 OOM 问题，也可以降低训练数据集中图像的分辨率！\n\n### SFT\n我们使用 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory\">LLaMa-Factory\u003C\u002Fa> 对模型进行监督微调（SFT）。您可以将下载的数据集转换为相应的 Qwen SFT 格式来进行训练。\n\n## 评估\n我们在多种视觉感知任务上进行了大量实验，包括 **细粒度图像分类**、**开放词汇目标检测**、**少样本目标检测** 和 **推理性定位**。**ViRFT** 在这些任务中仅用少量数据和计算资源就实现了显著的性能提升，远超监督微调基线。\n\n> 我们提供了使用评估代码的分步教程。如果您遇到任何问题，请随时提交问题。\n\n### COCO 评估\n您可以使用 ```coco_evaluation``` 目录中的文件进行模型推理并获得评估结果。我们的代码支持多 GPU 评估，至少需要两块 GPU。\n\n对于 ***推理***：\n```\ncd .\u002Fcoco_evaluation\npython Qwen2_VL_coco_infere.py\n```\n请注意，```Qwen2_VL_coco_infere.py``` 中的一些文件路径和模型路径需要修改。\n```\n### 第 167–168 行，需根据您的模型路径和模型基础进行更改。\nmodel_path = \".\u002Fshare_models\u002FQwen2-VL-2B-Instruct_RL\u002F\"  # RL 模型\nmodel_base = \".\u002Fshare_models\u002FQwen2-VL-2B-Instruct\u002F\"  # 原始 Qwen2-VL 模型\n### 第 182 行，需根据您的 COCO 验证集标注路径进行更改。\nwith open('.\u002Fdata\u002Fcoco\u002Fannotations\u002Finstances_val2017.json', 'r') as json_file:\n\n### 第224行，根据你自己的图像路径进行修改。\nimage_path = '.\u002Fdata\u002Fcoco\u002Fval2017\u002F'+image['file_name']    \n### 第231-241行，选择你想要评估的类别\nselected_cate = ['公交车', '火车', '消防栓', '停车标志', '猫', '狗', '床', '马桶']\n### 第350行，结果保存路径\nwith open(f'prediction_results.json', 'w') as json_file:\n```\n推理结果将以`JSON`格式保存，随后用于评估。\n\n对于***评估***，只需逐步运行```.\u002Fcoco_evaluation\u002Fevaluation.ipynb```即可。\n\n### LVIS评估\n你可以使用```lvis_evaluation```目录中的文件进行模型推理，并获得评估结果。我们的代码支持多GPU评估，至少需要两块GPU。\n\n对于***推理***：\n```\ncd .\u002Flvis_evaluation\npython Qwen2_VL_lvis_infere.py\n```\n请注意，```Qwen2_VL_lvis_infere.py```中的一些文件路径和模型路径需要修改。\n```\n### 第169-170行，修改为你的模型路径和模型基础\nmodel_path = \".\u002Fshare_models\u002FQwen2-VL-2B-Instruct_RL\u002F\"  # RL模型\nmodel_base = \".\u002Fshare_models\u002FQwen2-VL-2B-Instruct\u002F\"  # 原始Qwen2-VL模型\n### 第184行，修改为你自己的LVIS验证标注路径\nwith open('.\u002Fdata\u002Flvis\u002Fannotations\u002Flvis_v1_val.json', 'r') as json_file:\n### 第228行，根据你自己的图像路径进行修改。\nimage_path = '.\u002Fdata\u002Flvis\u002F' + \"\u002F\".join(parts[-2:])   \n### 第234-242行，选择你想要评估的类别\nselected_cate = ['马车', '骰子', '餐桌', '煎蛋卷', '木瓜', '梯凳']\n### 第346行，结果保存路径\nwith open(f'prediction_results.json', 'w') as json_file:\n```\n推理结果将以`JSON`格式保存，随后用于评估。\n\n对于***评估***，只需逐步运行```.\u002Flvis_evaluation\u002Flvis_evaluation.ipynb```即可。\n\n### 分类评估\n你可以使用```classification```目录中的文件进行模型推理，并获得评估结果。我们的代码支持多GPU评估，至少需要两块GPU。\n```\ncd .\u002Fclassification\npython Qwen2_VL_classification_infere.py\n```\n请注意，```Qwen2_VL_classification_infere.py```中的模型路径需要修改。\n```\n### 第61-63行，修改为你的模型路径和模型基础\nmodel_path = \".\u002Fshare_models\u002FQwen2-VL-2B-Instruct_RL\u002F\"  # 经过RL训练后的模型\nmodel_base = \".\u002Fshare_models\u002FQwen2-VL-2B-Instruct\u002F\"  # 原始Qwen2-VL模型\n```\n推理和结果计算是同时进行的。程序运行结束后，命令行会显示分类正确的数量，准确率则通过将其除以验证集长度得到。（Flower102：2463，Pets37：3669，stanford cars：8041，fgvc-aircraft：3333）\n\n> 🔔 有时，由于环境问题，当`use_cache = None`时，模型可能会产生错误的推理结果。你可以考虑显式地设置`use_cache = True`。\n> `generated_ids = model.generate(**inputs, max_new_tokens=1024, use_cache=True)`\n\n### 评估结果\n*我们进行了**大量实验**；更多细节请参阅我们的论文*。\n\n\n### 案例研究\n在下图中，我们展示了来自**ViRFT**的一些推理示例。我们可以看到，通过**ViRFT**，思维过程显著提升了推理和定位能力。借助**ViRFT**，Qwen2-VL学会了批判性思考，并仔细检查图像以生成准确的定位结果。\n\u003Ca href=\"\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_readme_b9461758a5ad.png\" alt=\"Logo\" >\n\u003C\u002Fa>\n我们还展示了一些模型在处理*细粒度分类任务*时的推理案例。这些结果表明了**ViRFT**在各种视觉任务中强大的泛化能力。\n\u003Ca href=\"\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_readme_6559078afdeb.png\" alt=\"Logo\" >\n\u003C\u002Fa>\n\n\n\n## ✒️引用\n```\n@article{liu2025visual,\n  title={Visual-RFT: 视觉强化微调},\n  author={刘子宇、孙泽毅、臧宇航、董晓怡、曹宇航、段浩东、林大华、王佳琪},\n  journal={arXiv预印本 arXiv:2503.01785},\n  year={2025}\n}\n@misc{liu2025visualagenticreinforcementfinetuning,\n      title={视觉代理式强化微调}, \n      author={刘子宇、臧宇航、邹雨珊、梁子健、董晓怡、曹宇航、段浩东、林大华、王佳琪},\n      year={2025},\n      eprint={2505.14246},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14246}, \n}\n```\n\n## 📄许可\n![代码许可](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache_2.0-green.svg) ![数据许可](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FData%20License-CC%20By%20NC%204.0-red.svg) **使用与许可声明**：数据和代码仅用于研究目的，并受相应许可约束。\n许可：署名-非商业性使用4.0国际版。应遵守OpenAI的政策：https:\u002F\u002Fopenai.com\u002Fpolicies\u002Fterms-of-use\n\n## 致谢\n我们衷心感谢\u003Cahref=\"https:\u002F\u002Fgithub.com\u002FDeep-Agent\u002FR1-V\">R1-V\u003C\u002Fa>、\u003Cahref=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fopen-r1\">Open-R1\u003C\u002Fa>以及\u003Cahref=\"https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Fopen-r1-multimodal\">Open-r1-multimodal\u003C\u002Fa>项目提供的开源资源。","# Visual-RFT 快速上手指南\n\nVisual-RFT (Visual Reinforcement Fine-Tuning) 是首个将 DeepSeek-R1 的强化学习策略全面适配到多模态领域的开源项目。它基于 Qwen2-VL 模型，通过设计基于规则的**可验证奖励（Verifiable Reward）**并结合 **GRPO** 算法，显著提升了大型视觉语言模型（LVLM）在开放词汇检测、少样本检测、推理定位及细粒度图像分类等任务上的表现。\n\n## 1. 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+)\n*   **Python 版本**: 3.10\n*   **GPU**: 支持 CUDA 的 NVIDIA 显卡（建议使用显存 24GB+ 的显卡进行训练，如 A10\u002FA100\u002FRTX 3090\u002F4090）\n*   **依赖框架**: PyTorch, Transformers, DeepSpeed, FlashAttention-2\n\n> **💡 国内加速建议**\n> 鉴于网络环境，建议在安装依赖和下载模型时使用国内镜像源：\n> *   **PyPI 镜像**: `pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple \u003Cpackage>`\n> *   **HuggingFace 镜像**: 设置环境变量 `export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com`\n\n## 2. 安装步骤\n\n克隆仓库并配置 Conda 环境：\n\n```bash\n# 1. 克隆项目代码\ngit clone https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT.git\ncd Visual-RFT\n\n# 2. 创建并激活 Python 3.10 虚拟环境\nconda create -n Visual-RFT python=3.10\nconda activate Visual-RFT\n\n# 3. 运行安装脚本（自动安装所需依赖）\nbash setup.sh\n```\n\n## 3. 基本使用\n\n### 3.1 数据准备\n\n您可以直接使用官方提供的 HuggingFace 数据集，或使用自定义的 JSON 格式数据。\n\n*   **官方数据集**: 访问 [HuggingFace Collections](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flaolao77\u002Fvirft-datasets-67bc271b6f2833eccc0651df) 下载对应任务的数据集（如 `laolao77\u002FViRFT_COCO_base65`）。\n*   **自定义数据**: 准备一个包含 `image`, `problem`, `solution` 字段的 JSON 文件。参考 `dataset\u002Fbuild_dataset.ipynb` 进行构建。\n\n### 3.2 开始训练 (GRPO)\n\nVisual-RFT 强调高效微调，对于几百条样本的数据集，通常只需训练 200 步即可见效。\n\n以下是一个基于 COCO 基础类别的检测任务训练示例。请根据您的实际路径修改 `DATA_PATH` (数据集路径) 和 `CKPT_PATH` (基座模型路径)。\n\n```bash\n# 设置环境变量\nexport DEBUG_MODE=\"true\"\nexport LOG_PATH=\".\u002Fdebug_log_2b_GRPO_coco_base65cate_6k.txt\"\n\n# ⚠️ 请替换为您的本地数据集路径 (从 HuggingFace 下载后)\nexport DATA_PATH=.\u002Fshare_data\u002FViRFT_COCO_base65   \n\n# ⚠️ 请替换为您的 Qwen2-VL-2B-Instruct 模型检查点路径\nexport CKPT_PATH=.\u002Fshare_models\u002FQwen2-VL-2B-Instruct    \n\n# 设置模型保存路径\nexport SAVE_PATH=.\u002Fshare_models\u002FQwen2-VL-2B-Instruct_GRPO_coco_base65cate_6k\n\n# 启动训练 (示例使用 8 张卡)\ntorchrun --nproc_per_node=\"8\" \\\n    --nnodes=\"1\" \\\n    --node_rank=\"0\" \\\n    --master_addr=\"127.0.0.1\" \\\n    --master_port=\"12345\" \\\n    src\u002Fopen_r1\u002Fgrpo.py \\\n    --output_dir ${SAVE_PATH}  \\\n    --model_name_or_path ${CKPT_PATH} \\\n    --dataset_name ${DATA_PATH} \\\n    --deepspeed local_scripts\u002Fzero3.json \\\n    --max_prompt_length 1024 \\\n    --per_device_train_batch_size 1 \\\n    --gradient_accumulation_steps 2 \\\n    --logging_steps 1 \\\n    --bf16 \\\n    --report_to wandb \\\n    --gradient_checkpointing false \\\n    --attn_implementation flash_attention_2 \\\n    --max_pixels 401408 \\\n    --num_train_epochs 1 \\\n    --run_name Qwen2-VL-2B_GRPO_coco_base65cate_6k \\\n    --save_steps 100 \\\n    --save_only_model true \\\n    --num_generations 8\n```\n\n### 3.3 显存优化 (OOM 解决方案)\n\n如果在训练过程中遇到显存不足 (OOM) 问题，可尝试以下调整：\n\n1.  **启用 DeepSpeed Offload**: 将 `--deepspeed` 参数改为 `local_scripts\u002Fzero3_offload.json`，将部分状态卸载到 CPU。\n2.  **减少生成数量**: 降低 `--num_generations` 参数（例如从 `8` 改为 `4`），这会减少每组采样的数量从而节省显存，但可能轻微影响效果。\n3.  **开启梯度检查点**: 添加或设置 `--gradient_checkpointing true`。\n\n### 3.4 推理测试\n\n项目已提供在 LISA 数据集上训练的模型，可用于评估**推理定位 (Reasoning Grounding)** 能力。\n\n*   **模型地址**: [Zery\u002FQwen2-VL-7B_visual_rft_lisa_IoU_reward](https:\u002F\u002Fhuggingface.co\u002FZery\u002FQwen2-VL-7B_visual_rft_lisa_IoU_reward)\n*   **使用方法**: 请参考项目根目录下的 `demo` 文件夹获取具体的推理代码示例。","某电商平台的自动化运营团队正致力于构建一个智能系统，用于从海量商品图中自动识别违规细节（如商标侵权、违禁品）并生成合规报告。\n\n### 没有 Visual-RFT 时\n- **推理逻辑薄弱**：基础多模态模型面对复杂场景（如遮挡或模糊的违规标志）时，往往直接猜测答案，缺乏逐步推导过程，导致误判率高。\n- **细粒度感知不足**：在处理“开放词汇检测”或“少样本检测”任务时，模型难以精准定位未见过的新类型违规物体，容易漏检。\n- **规则遵循性差**：模型输出的格式经常不统一，难以通过自动化脚本直接验证结果，需要大量人工二次校对。\n- **泛化能力受限**：一旦商品背景或拍摄角度发生微小变化，模型的性能便大幅下滑，无法适应多样化的实拍图。\n\n### 使用 Visual-RFT 后\n- **强化推理链条**：Visual-RFT 引入了类似 Deepseek-R1 的强化学习策略，迫使模型在输出结论前进行显式的逻辑推演，显著提升了复杂场景下的判断准确率。\n- **精准视觉定位**：基于规则的可验证奖励机制，让模型在细粒度图像分类和推理定位任务上表现卓越，能敏锐捕捉细微的违规特征。\n- **输出严格合规**：模型学会了严格遵守预设的规则格式输出，使得检测结果可直接被下游系统解析，实现了全流程自动化。\n- **鲁棒性大幅增强**：经过 GRPO 框架的微调，模型在面对不同光照、角度及未知类别的商品图时，依然保持稳定的高水准识别能力。\n\nVisual-RFT 通过将先进的强化学习策略引入多模态领域，成功将通用视觉语言模型转化为具备深度推理能力和高精度感知水平的专业行业助手。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FLiuziyu77_Visual-RFT_b1e75079.png","Liuziyu77","Ziyu Liu","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FLiuziyu77_3cb6db39.jpg",null,"Shanghai AI Lab","Shanghai","https:\u002F\u002Fgithub.com\u002FLiuziyu77",[84,88,92,96],{"name":85,"color":86,"percentage":87},"Jupyter Notebook","#DA5B0B",80.6,{"name":89,"color":90,"percentage":91},"Python","#3572A5",18.4,{"name":93,"color":94,"percentage":95},"Shell","#89e051",0.9,{"name":97,"color":98,"percentage":99},"Makefile","#427819",0,2305,106,"2026-04-02T00:43:52","Apache-2.0","Linux","必需 NVIDIA GPU，支持 Flash Attention 2；显存需求取决于模型大小（2B\u002F7B）及 --num_generations 参数，建议使用多卡分布式训练（示例为 8 卡），显存不足时需开启 DeepSpeed ZeRO-3 Offload","未说明",{"notes":108,"python":109,"dependencies":110},"1. 推荐使用 conda 创建名为 'Visual-RFT' 的虚拟环境。2. 训练基于 Qwen2-VL 或 Qwen2.5-VL 模型。3. 必须安装 flash_attention_2 以启用注意力优化。4. 遇到显存溢出 (OOM) 时，可通过降低 --num_generations 参数、开启 --gradient_checkpointing 或使用 deepspeed zero3_offload 策略缓解。5. 对于几百条样本的数据集，通常仅需训练 200 步。","3.10",[111,112,113,114,115,116],"torch","transformers","deepspeed","flash-attn","accelerate","wandb",[26,14,13,15,54],"2026-03-27T02:49:30.150509","2026-04-06T08:35:16.153031",[121,126,131,136,141,145],{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},14630,"使用 LoRA 微调时遇到 'None of the inputs have requires_grad=True' 和 'element 0 of tensors does not require grad' 错误，如何解决？","该错误通常由显存不足或配置冲突引起。解决方案包括：\n1. 调整图像尺寸以降低显存需求（例如将图像 resize 到 448*448）：\n```python\nresized_image = image.resize((448, 448), resample=Image.LANCZOS)\n```\n2. 仅对语言模块进行 LoRA 微调。\n3. 在训练参数中设置 `use_cache=False`（因为 `use_cache=True` 与 gradient checkpointing 不兼容）。\n4. 检查是否因参数更新过少导致 Loss 为 0，若出现此情况可能需要改用全量微调或调整学习率。","https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT\u002Fissues\u002F147",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},14631,"如何转换从 Hugging Face 下载的 Parquet 格式训练数据？显存不足时如何优化训练配置？","虽然具体转换脚本未在评论中直接给出，但针对显存不足的问题，可以通过 DeepSpeed ZeRO-3 优化并将优化器和参数卸载到 CPU 来解决。配置如下：\n```json\n\"zero_optimization\": {\n    \"stage\": 3, \n    \"offload_optimizer\": {\n        \"device\": \"cpu\", \n        \"pin_memory\": true\n    }, \n    \"offload_param\": {\n        \"device\": \"cpu\", \n        \"pin_memory\": true\n    }, \n    \"overlap_comm\": true, \n    \"contiguous_gradients\": true\n}\n```\n此外，如果 `num_generations` 设置过小导致 GRPO 效果差，可以尝试使用 8Bit 优化器来节省显存。","https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT\u002Fissues\u002F7",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},14632,"无法复现论文中的 Table 3 结果，模型输出为空或评估指标极低，原因是什么？","无法复现结果的主要原因可能包括：\n1. 推理过程中选择的类别（selected categories）配置错误，需检查评估脚本中的类别列表是否与论文一致。\n2. 强化学习（RL）和 Few-shot 设置对随机种子非常敏感。建议记录并提供实验的随机种子，并报告结果的均值和标准差以确保可复现性。\n3. 确认使用的基线模型版本是否正确（如 Qwen2-VL-2B-Instruct）。","https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT\u002Fissues\u002F117",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},14633,"运行 grpo_classification 微调时遇到 'RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1' 错误怎么办？","该错误通常由像素值张量维度处理不当引起。解决方法是修改代码中关于 `pixel_values` 的处理逻辑：\n1. 注释掉强制增加维度的代码：\n```python\n# prompt_inputs['pixel_values'] = prompt_inputs['pixel_values'][None]\n```\n2. 恢复或确保 `pixel_values` 根据 `num_generations` 正确重复：\n```python\npixel_values = prompt_inputs[\"pixel_values\"].repeat(self.num_generations, 1)\n```\n此修改已在 Aircraft 数据集的 4-shot 细粒度分类任务中验证有效。","https:\u002F\u002Fgithub.com\u002FLiuziyu77\u002FVisual-RFT\u002Fissues\u002F17",{"id":142,"question_zh":143,"answer_zh":144,"source_url":125},14634,"LoRA 微调后 Loss 一直为 0 且奖励没有更新，模型是否学到了东西？","这种情况很可能是因为 LoRA 微调更新的参数量太少，导致模型未能有效学习。社区反馈表明，如果在 LoRA 模式下出现 Loss 恒为 0 且规则奖励极低且不更新的现象，建议尝试切换为全量微调（Full Fine-tuning），或者检查学习率和训练步数设置是否合理。",{"id":146,"question_zh":147,"answer_zh":148,"source_url":130},14635,"在显存有限的设备上训练 Visual-RFT 有哪些具体的资源优化策略？","针对显存受限的设备，可以采取以下综合策略：\n1. **图像预处理**：将输入图像 Resize 到较小尺寸（如 448x448）。\n2. **微调方式**：仅对语言模块部分使用 LoRA，冻结视觉编码器。\n3. **DeepSpeed 配置**：启用 ZeRO Stage 3 并将优化器和参数卸载（Offload）到 CPU。\n4. **批次设置**：减小 `per_device_train_batch_size` 并增加 `gradient_accumulation_steps`。\n5. **精度设置**：使用 `fp16` 或 `bf16` 混合精度训练。\n6. **生成数量**：适当调整 `num_generations`，若显存极度紧张可配合 8Bit 优化器使用。",[]]