[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-yifan123--flow_grpo":3,"tool-yifan123--flow_grpo":61},[4,18,26,36,44,52],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",141543,2,"2026-04-06T11:32:54",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":53,"name":54,"github_repo":55,"description_zh":56,"stars":57,"difficulty_score":10,"last_commit_at":58,"category_tags":59,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,60],"视频",{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":77,"owner_website":79,"owner_url":80,"languages":81,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":94,"env_os":95,"env_gpu":96,"env_ram":97,"env_deps":98,"category_tags":111,"github_topics":77,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":112,"updated_at":113,"faqs":114,"releases":143},4556,"yifan123\u002Fflow_grpo","flow_grpo","[NeurIPS 2025] An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL","flow_grpo 是一个专为生成式 AI 设计的开源训练框架，旨在通过在线强化学习（Online RL）优化流匹配（Flow Matching）模型。它主要解决了传统扩散或流模型在生成图像时难以精准控制细节（如复杂物体计数、文字渲染准确性）以及难以对齐人类审美偏好的难题。\n\n该工具特别适合 AI 研究人员和开发者使用，尤其是那些希望微调 SD3.5、FLUX.1、Qwen-Image 或 Wan2.1 等主流模型，以提升特定任务表现的技术团队。普通设计师虽不直接参与训练，但可借助其产出的高质量模型获得更可控的生成效果。\n\nflow_grpo 的核心亮点在于高效的训练策略：它支持“无分类器引导（No-CFG）”训练，利用强化学习过程自然实现蒸馏效果；引入了\"Flow-GRPO-Fast\"加速机制，仅需部分步骤即可完成训练；并采用了“系数保持采样（CPS）”技术，显著提升了生成样本的质量与评估得分。此外，项目还集成了 GRPO-Guard 安全机制及多种奖励模型（如 GenEval、PickScore），为社区提供了从训练到部署的完整解决方案。","\u003Ch1 align=\"center\"> Flow-GRPO:\u003Cbr>Training Flow Matching Models via Online RL \u003C\u002Fh1>\n\u003Cdiv align=\"center\">\n  \u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05470'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArXiv-red?logo=arxiv'>\u003C\u002Fa>  &nbsp;\n  \u003Ca href='https:\u002F\u002Fgongyeliu.github.io\u002FFlow-GRPO\u002F'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisualization-green?logo=github'>\u003C\u002Fa> &nbsp;\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode-9E95B7?logo=github\">\u003C\u002Fa> &nbsp; \n  \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fjieliu\u002Fsd35m-flowgrpo-68298ec27a27af64b0654120'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel-blue?logo=huggingface'>\u003C\u002Fa> &nbsp; \n  \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fjieliu\u002FSD3.5-M-Flow-GRPO'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-blue?logo=huggingface'>\u003C\u002Fa> &nbsp;\n\u003C\u002Fdiv>\n\n## Changelog\n\u003Cdetails open>\n\n\u003Csummary>\u003Cstrong>2025-11-04\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n* Adding **GRPO-Guard** 🔥🔥.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Update History\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n**2025-11-04**\n* Adding support for [Bagel-7B](https:\u002F\u002Fhuggingface.co\u002FByteDance-Seed\u002FBAGEL-7B-MoT).\n\n**2025-10-14**\n\n* Refactor FlowGRPO-Fast for compatibility with FlowGRPO, add CPS sampling and No-CFG training on SD3.\n\n**2025-08-15**\n\n* Adding support for **Qwen-Image** and **Qwen-Image-Edit**.\n\n**2025-08-15**\n\n* Thanks [Jing Wang](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=Q9Np_KQAAAAJ&hl=zh-CN) for adding **Wan2.1**. Training command\n```bash\naccelerate launch --config_file scripts\u002Faccelerate_configs\u002Fmulti_gpu.yaml --num_processes=1 --main_process_port 29503 scripts\u002Ftrain_wan2_1.py --config config\u002Fgrpo.py:general_ocr_wan2_1\n```\n\n**2025-08-14**\n\n* Adding reward curve of Flow-GRPO-Fast vs. Flow-GRPO. In Pickscore reward, Flow-GRPO-Fast is comparable to Flow-GRPO with only 2 steps training.\n\n\n**2025-08-04**\n\n* Adding support for **FLUX.1-Kontext-dev**. For the counting task, we use Geneval reward to detect object counts and CLIP feature similarity to ensure consistency between the original and edited images. This implementation offers a runnable pipeline, but the training set contains only 800 samples. Making Flow-GRPO truly effective for editing tasks still requires further exploration by the community.\n\n\n**2025-07-31**\n\n- Adding Flow-GRPO-Fast.\n\n**2025-07-28**\n\n- Adding support for **FLUX.1-dev**.\n- Adding support for CLIPScore as reward model.\n- Introducing `config.sample.same_latent` to control whether the same noise is reused for identical prompts, addressing [Issue #7](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fissues\u002F7).\n\n**2025-05-15** \n\n- 🔥We showcase image examples from three tasks and their training evolution at https:\u002F\u002Fgongyeliu.github.io\u002FFlow-GRPO. Check them out!\n- 🔥We now provide an online demo for all three tasks at https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fjieliu\u002FSD3.5-M-Flow-GRPO. You're welcome to try it out!\n\u003C\u002Fdetails>\n\n## 🤗 Model\n| Task    | Model |\n| -------- | -------- |\n| GenEval     | [🤗GenEval](https:\u002F\u002Fhuggingface.co\u002Fjieliu\u002FSD3.5M-FlowGRPO-GenEval) |\n| Text Rendering     | [🤗Text](https:\u002F\u002Fhuggingface.co\u002Fjieliu\u002FSD3.5M-FlowGRPO-Text) |\n| Human Preference Alignment     | [🤗PickScore](https:\u002F\u002Fhuggingface.co\u002Fjieliu\u002FSD3.5M-FlowGRPO-PickScore) |\n\n## Training Speed\n\nTo improve training efficiency, we provide a better set of parameters for Flow-GRPO.\nWe found the following adjustments significantly accelerate training:\n\n* No CFG during training or testing — the RL process effectively performs **CFG distillation**.\n* Use the window mechanism from **Flow-GRPO-Fast** or **[MixGRPO](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2507.21802)** — only train on partial steps.\n* Adopt **[Coefficients-Preserving Sampling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.05952) (CPS)** — CPS provides a notable improvement on GenEval, and produces higher-quality samples. A typical setting is `noise_level = 0.8`, which works well without tuning for different models or step counts.\n\nThe figure below shows the test-set performance curves using GenEval and PickScore as rewards, where both training and evaluation are performed **without CFG**. The experiments are configured with [**geneval_sd3_fast_nocfg**](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fconfig\u002Fgrpo.py#L163) and [**pickscore_sd3_fast_nocfg**](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fconfig\u002Fgrpo.py#L323), using scripts from `scripts\u002Fmulti_node\u002Fsd3_fast`.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fflow_grpo_fast_nocfg_geneval.svg\" alt=\"Flow-GRPO-Fast Illustration\" width=\"350\"\u002F>\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fflow_grpo_fast_nocfg_pickscore.svg\" alt=\"Flow-GRPO-Fast Illustration\" width=\"350\"\u002F> \n\u003C\u002Fp>\n\n## 🛡️ Over-optimization (GRPO-Guard) 🔥🔥\n\nTo mitigates implicit over-optimization in flow matching, our team propose [GRPO-Guard](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.22319) ( [🔥Project Page](https:\u002F\u002Fjingw193.github.io\u002FGRPO-Guard\u002F)).\n\nWe first observe that the importance ratio exhibits an inherent bias:\n\n1. Its mean is consistently **below 1** and becomes significantly pronounced at low-noise steps (e.g., step 8 in SD3.5-M).\n\n2. The variance varies notably across different steps.\n\nIdeally, the importance ratio distribution should have a mean of 1 and stable variance. The clipping operation truncates overly confident positive or negative samples outside the region [1−ϵ,1+ϵ], ensuring stable gradient updates. However, the observed bias in the importance ratio disrupts this mechanism—gradients of positive samples are no longer properly constrained, **leading the policy model into over-optimization**. As a result, the proxy score continues to rise while the gold score declines, causing a severe degradation in image quality.\n\n\nThe biased ratio distributions are summarized in the table below.\n\n| FlowGRPO | GRPO-Guard|\n| - | - |\n| ![flow_grpo ratio](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_8bec6b13298e.gif) | ![grpo_guard ratio](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_3bb17b38f4b7.gif)  |\n| The clipping mechanism is imbalanced, failing to constrain overconfident positive samples. | The clipping mechanism is imbalanced, failing to constrain overconfident positive samples.|\n\n\nTo address this issue, [GRPO-Guard](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.22319) introduces two mechanisms that effectively alleviate over-optimization:\n\n- **RatioNorm**: Corrects the distributional bias of importance ratios and unifies their statistics across denoising steps.\n\n- **Gradient Reweight**: Further reweights the gradients of different denoising steps based on RatioNorm, balancing their contributions and preventing excessive optimization under specific noise levels.\n\nThe following figure compares over-optimization between GRPO-Guard and FlowGRPO on text rendering tasks. GRPO-Guard maintains the same rising trend in proxy scores as FlowGRPO while preventing rapid declines in gold scores, thus preserving high image quality and diversity.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_aada1412bc2c.png\" alt=\"GRPO-Guard Illustration\" width=900\"\u002F>\n\u003C\u002Fp>\n\n**Start Training**\n\nAfter downloading the base model and setting up the reward model, run the following script to start training the GRPO-Guard for the SD3.5-M text rendering task.\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fsd3_grpo_guard.sh 0\n# Other nodes\nbash scripts\u002Fmulti_node\u002Fsd3_grpo_guard.sh 1\n```\n\n## Flow-GRPO-Fast\nWe propose Flow-GRPO-Fast, an accelerated variant of Flow-GRPO that requires training on **only one or two denoising step** per trajectory. For each prompt, we first generate a deterministic trajectory using ODE sampling. At a randomly chosen intermediate step, we inject noise and switch to SDE sampling to generate a group. The rest of the process continues with ODE sampling. This confines stochasticity to one or two steps, allowing training to focus solely on that steps. This few-step training idea was primarily proposed by [Ziyang Yuan](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=fWxWEzsAAAAJ&hl=en) during our discussions in early June. \n\nFlow-GRPO-Fast achieves significant efficiency gains:\n\n- Each trajectory is trained only once or twice, significantly reducing the training cost.\n\n- Sampling before branching requires only a single prompt without group expansion, further speeding up data collection.\n\nExperiments on PickScore show that Flow-GRPO-Fast matches the reward performance of Flow-GRPO while offering faster training speed. The x-axis in the figure represents training epochs. Flow-GRPO-Fast with 2 training steps per iteration performs better than Flow-GRPO, while Flow-GRPO-Fast with only 1 training step per iteration performs slightly worse than Flow-GRPO. In both cases, compared to Flow-GRPO’s 10 training steps per iteration, the training process is significantly faster.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_45c0f4c88563.png\" alt=\"Flow-GRPO-Fast Illustration\" width=450\"\u002F>\n\u003C\u002Fp>\n\n\nPlease use scripts in `scripts\u002Fmulti_node\u002Fsd3_fast` to run these experiments.\n\n\n## 🚀 Quick Started\n### 1. Environment Set Up\nClone this repository and install packages.\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo.git\ncd flow_grpo\nconda create -n flow_grpo python=3.10.16\npip install -e .\n```\n\n### 2. Model Download\nTo avoid redundant downloads and potential storage waste during multi-GPU training, please pre-download the required models in advance.\n\n**Models**\n* **SD3.5**: `stabilityai\u002Fstable-diffusion-3.5-medium`\n* **Flux**: `black-forest-labs\u002FFLUX.1-dev`\n\n**Reward Models**\n* **PickScore**:\n  * `laion\u002FCLIP-ViT-H-14-laion2B-s32B-b79K`\n  * `yuvalkirstain\u002FPickScore_v1`\n* **CLIPScore**: `openai\u002Fclip-vit-large-patch14`\n* **Aesthetic Score**: `openai\u002Fclip-vit-large-patch14`\n\n\n### 3. Reward Preparation\nThe steps above only install the current repository. Since each reward model may rely on different versions, combining them in one Conda environment can cause version conflicts. To avoid this, we adopt a remote server setup inspired by ddpo-pytorch. You only need to install the specific reward model you plan to use.\n\n#### GenEval\nPlease create a new Conda virtual environment and install the corresponding dependencies according to the instructions in [reward-server](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Freward-server).\n\n#### OCR\nPlease install paddle-ocr:\n```bash\npip install paddlepaddle-gpu==2.6.2\npip install paddleocr==2.9.1\npip install python-Levenshtein\n```\nThen, pre-download the model using the Python command line:\n```python\nfrom paddleocr import PaddleOCR\nocr = PaddleOCR(use_angle_cls=False, lang=\"en\", use_gpu=False, show_log=False)\n```\n\n#### Pickscore\nPickScore requires no additional installation. Note that the original [pickscore](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyuvalkirstain\u002Fpickapic_v1) dataset corresponds to `dataset\u002Fpickscore` in this repository, containing some NSFW prompts. We strongly recommend using [pickapic\\_v1\\_no\\_images\\_training\\_sfw](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FCarperAI\u002Fpickapic_v1_no_images_training_sfw), the SFW version of the Pick-a-Pic dataset, which corresponds to `dataset\u002Fpickscore_sfw` in this repository.\n\n#### DeQA\nPlease create a new Conda virtual environment and install the corresponding dependencies according to the instructions in [reward-server](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Freward-server).\n\n#### UnifiedReward\nSince `sglang` may conflict with other environments, we recommend creating a new conda environment.\n```bash\nconda create -n sglang python=3.10.16\nconda activate sglang\npip install \"sglang[all]\"\n```\nWe use sglang to deploy the reward service. After installing sglang, please run the following command to launch UnifiedReward:\n```bash\npython -m sglang.launch_server --model-path CodeGoat24\u002FUnifiedReward-7b-v1.5 --api-key flowgrpo --port 17140 --chat-template chatml-llava --enable-p2p-check --mem-fraction-static 0.85\n```\n#### ImageReward\nPlease install imagereward:\n```bash\npip install image-reward\npip install git+https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP.git\n```\n\n### 4. Start Training\n\n#### GRPO\n\n\n**Single-node training**\n\n```bash\n# sd3\nbash scripts\u002Fsingle_node\u002Fgrpo.sh\n# flux\nbash scripts\u002Fsingle_node\u002Fgrpo_flux.sh\n```\n\n---\n\n\u003Cdetails> \u003Csummary>Multi-node training for SD3:\u003C\u002Fsummary>\n\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fsd3.sh 0\n# Other nodes\nbash scripts\u002Fmulti_node\u002Fsd3.sh 1\nbash scripts\u002Fmulti_node\u002Fsd3.sh 2\nbash scripts\u002Fmulti_node\u002Fsd3.sh 3\n```\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>Multi-node training for FLUX.1-dev\u003C\u002Fsummary>\n\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fflux.sh 0\n# Other node\nbash scripts\u002Fmulti_node\u002Fflux.sh 1\nbash scripts\u002Fmulti_node\u002Fflux.sh 2\nbash scripts\u002Fmulti_node\u002Fflux.sh 3\n```\nFor Flow-GRPO-Fast, please use `scripts\u002Fmulti_node\u002Fflux_fast.sh`. See the W&B logs for [Geneval](https:\u002F\u002Fapi.wandb.ai\u002Flinks\u002Fljie\u002Fqz47q208) (with `geneval_flux_fast` in the config) and [PickScore](https:\u002F\u002Fapi.wandb.ai\u002Flinks\u002Fljie\u002Fncdwa0wo) (with `pickscore_flux_fast` in the config).\n\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>Multi-node training for FLUX.1-Kontext-dev\u003C\u002Fsummary>\n\nPlease first download [generated\\_images.zip](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjieliu\u002Fcounting_edit\u002Fblob\u002Fmain\u002Fgenerated_images.zip) and extract it into the `counting_edit` directory. You can also use the scripts in the `counting_edit` directory to generate the data yourself.\n\nPlease install `diffusers` from the main branch to support `FLUX.1-Kontext-dev`:\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers.git\n```\nAfter upgrading Diffusers, some packages such as PEFT may also need to be upgraded. If you encounter any errors, please upgrade them according to the error messages.\nThen, run the scripts:\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fflux_kontext.sh 0\n# Other nodes\nbash scripts\u002Fmulti_node\u002Fflux_kontext.sh 1\nbash scripts\u002Fmulti_node\u002Fflux_kontext.sh 2\nbash scripts\u002Fmulti_node\u002Fflux_kontext.sh 3\n```\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>Multi-node training for Qwen-Image:\u003C\u002Fsummary>\n\nIn the implementation of Qwen-Image, we have unified Flow-GRPO and Flow-GRPO-Fast. You can control the size of the SDE window with `config.sample.sde_window_size`, and adjust the position of the window with `config.sample.sde_window_range`.\n\nPlease install `diffusers` from the main branch to support `Qwen-Image`:\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers.git\n```\nThen run the scripts:\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fqwenimage.sh 0\n# Other nodes\nbash scripts\u002Fmulti_node\u002Fqwenimage.sh 1\nbash scripts\u002Fmulti_node\u002Fqwenimage.sh 2\nbash scripts\u002Fmulti_node\u002Fqwenimage.sh 3\n```\nUsing the provided configuration, the resulting reward curve of Qwen-Image on the test set is shown below.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_64ec91c4c45c.png\" alt=\"Flow-GRPO-Fast Illustration\" width=350\"\u002F>\n\u003C\u002Fp>\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>Multi-node training for Qwen-Image-Edit:\u003C\u002Fsummary>\n\nSame as Flux Kontext, please first download [generated\\_images.zip](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjieliu\u002Fcounting_edit\u002Fblob\u002Fmain\u002Fgenerated_images.zip) and extract it into the `counting_edit` directory. You can also use the scripts in the `counting_edit` directory to generate the data yourself.\n\nPlease install `diffusers` from the main branch to support `Qwen-Image-Edit`:\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers.git\n```\nThen run the scripts:\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fqwenimage_edit.sh 0\n# Other nodes\nbash scripts\u002Fmulti_node\u002Fqwenimage_edit.sh 1\nbash scripts\u002Fmulti_node\u002Fqwenimage_edit.sh 2\nbash scripts\u002Fmulti_node\u002Fqwenimage_edit.sh 3\n```\n\nUsing the provided configuration, the resulting reward curve of Qwen-Image-Edit on the test set is shown below.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_fa2798b3d684.png\" alt=\"Flow-GRPO-Fast Illustration\" width=\"350\"\u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_5a729ce143ea.png\" alt=\"Flow-GRPO-Fast Illustration\" width=\"350\"\u002F> \n\u003C\u002Fp>\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>Multi-node training for Bagel:\u003C\u002Fsummary>\n\nPlease first upgrade `transformers` to **version>=4.44.0** install `flash-attn`:\n```bash\npip install transformers==4.44.0\npip install flash-attn==2.7.4.post1 --no-build-isolation\n```\n\nThen run the scripts:\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fbagel\u002Fmain.sh 0\n# Other nodes\nbash scripts\u002Fmulti_node\u002Fbagel\u002Fmain.sh 1\nbash scripts\u002Fmulti_node\u002Fbagel\u002Fmain.sh 2\nbash scripts\u002Fmulti_node\u002Fbagel\u002Fmain.sh 3\n```\n\nUsing the provided configuration, the resulting reward(PickScore) curve of Bagel on the test set is shown below (with 32 GPU).\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fbagel_pickscore.svg\" alt=\"Flow-GRPO-Fast Illustration\" width=\"350\"\u002F>\n\u003C\u002Fp>\n\n**[Note]: About resource requirements & OOM**\n\nThe default training script adopts full-parameter mode, whcih requires at least **8 × 80GB GPUs**. If you encounter OOM issues, you can switch to LoRA training with the config provided in `config\u002Fgrpo.py:pickscore_bagel_lora`.\n\n---\n\u003C\u002Fdetails>\n\n\n#### DPO \u002F OnlineDPO \u002F SFT \u002F OnlineSFT\n Single-node training:\n```bash\nbash scripts\u002Fsingle_node\u002Fdpo.sh\nbash scripts\u002Fsingle_node\u002Fsft.sh\n```\nMulti-node training:\n\nPlease update the entry Python script and config file names in the `scripts\u002Fmulti_node` bash file.\n\n\n## FAQ\n\n* Please use **fp16** for training whenever possible, as it provides higher precision than bf16, resulting in smaller log-probability errors between data collection and training. For Flux and Wan, becauase fp16 inference cannot produce valid images or videos, you will have to use **bf16** for training. Note that log-probability errors tend to be smaller at high-noise steps and larger at low-noise steps. Training only on high-noise steps yields better results in this case. Thanks to [Jing Wang](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=Q9Np_KQAAAAJ&hl=zh-CN) for these observations.\n\n* When using **Flow-GRPO-Fast**, set a relatively small `clip_range`, otherwise training may crash.\n\n* When implementing a new model, please check whether using different batch sizes leads to slight differences in the output. SD3 has this issue, which is why I ensure that the batch size for training is the same as that used for data collection.\n\n\n## How to Support Other Models\n\nTo integrate a new model into this framework, please follow the steps below:\n\n**1. Add the following files adapted for your model:**\n\n* `flow_grpo\u002Fdiffusers_patch\u002Fsd3_pipeline_with_logprob.py`:\n  This file is adapted from [pipeline\\_stable\\_diffusion\\_3.py](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\u002Fblob\u002Fmain\u002Fsrc\u002Fdiffusers\u002Fpipelines\u002Fstable_diffusion_3\u002Fpipeline_stable_diffusion_3.py). You can refer to diffusers for your model.\n\n* `scripts\u002Ftrain_sd3.py`:\n  This script is based on [train\\_dreambooth\\_lora\\_sd3.py](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\u002Fblob\u002Fmain\u002Fexamples\u002Fdreambooth\u002Ftrain_dreambooth_lora_sd3.py) from the DreamBooth examples.\n\n* `flow_grpo\u002Fdiffusers_patch\u002Fsd3_sde_with_logprob.py`:\n  This file handles SDE sampling. In most cases, you don't need to modify it. However, if your definitions of `dt` or `velocity` differ in sign or convention, please adjust accordingly.\n\n**2. Verify SDE sampling:**\nSet `noise_level = 0` in [sde\\_demo.py](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Ftree\u002Fmain\u002Fscripts\u002Fdemo\u002Fsd3_sde_demo.py) to check whether the generated images look normal. This helps verify that your SDE implementation is correct.\n\n**3. Ensure on-policy consistency:**\nSet [`config.sample.num_batches_per_epoch = 1`](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fconfig\u002Fgrpo.py#L120) and [`config.train.gradient_accumulation_steps = 1`](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fconfig\u002Fgrpo.py#L125C5-L125C47) to enforce a purely on-policy setup, where the model collecting samples is identical to the one being trained.\nUnder this setting, the [ratio](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fscripts\u002Ftrain_sd3.py#L886) should remain exactly 1. If it's not, please check whether the sampling and training code paths differ—for example, through use of `torch.compile` or other model wrappers—and make sure both share the same logic.\n\n**4. Tune reward behavior:**\nStart with `config.train.beta = 0` to observe if the reward increases during training. You may also need to adjust the noise level [here](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fflow_grpo\u002Fdiffusers_patch\u002Fsd3_sde_with_logprob.py#L47) based on your model. Other hyperparameters are generally model-agnostic and can be kept as default.\n\n\n## 🏁 Multi Reward Training\nFor multi-reward settings, you can pass in a dictionary where each key is a reward name and the corresponding value is its weight.\nFor example:\n\n```python\n{\n    \"pickscore\": 0.5,\n    \"ocr\": 0.2,\n    \"aesthetic\": 0.3\n}\n```\n\nThis means the final reward is a weighted sum of the individual rewards.\n\nThe following reward models are currently supported:\n* **Geneval** evaluates T2I models on complex compositional prompts.\n* **OCR** provides an OCR-based reward.\n* **PickScore** is a general-purpose T2I reward model trained on human preferences.\n* **[DeQA](https:\u002F\u002Fgithub.com\u002Fzhiyuanyou\u002FDeQA-Score)** is a multimodal LLM-based image quality assessment model that measures the impact of distortions and texture damage on perceived quality.\n* **ImageReward** is a general-purpose T2I reward model capturing text-image alignment, visual fidelity, and safety.\n* **QwenVL** is an experimental reward model using prompt engineering.\n* **Aesthetic** is a CLIP-based linear regressor predicting image aesthetic scores.\n* **JPEG\\_Compressibility** measures image size as a proxy for quality.\n* **UnifiedReward** is a state-of-the-art reward model for multimodal understanding and generation, topping the human preference leaderboard.\n\n        \n## ✨ Important Hyperparameters\nYou can adjust the parameters in `config\u002Fgrpo.py` to tune different hyperparameters. An empirical finding is that `config.sample.train_batch_size * num_gpu \u002F config.sample.num_image_per_prompt * config.sample.num_batches_per_epoch = 48`, i.e., `group_number=48`, `group_size=24`.\nAdditionally, setting `config.train.gradient_accumulation_steps = config.sample.num_batches_per_epoch \u002F\u002F 2`.\n\n## 🤗 Acknowledgement\nThis repo is based on [ddpo-pytorch](https:\u002F\u002Fgithub.com\u002Fkvablack\u002Fddpo-pytorch) and [diffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers). We thank the authors for their valuable contributions to the AIGC community. Special thanks to Kevin Black for the excellent *ddpo-pytorch* repo.\n\n## ⭐Citation\nIf you find Flow-GRPO useful for your research or projects, we would greatly appreciate it if you could cite the following paper:\n```\n@article{liu2025flow,\n  title={Flow-grpo: Training flow matching models via online rl},\n  author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Li, Yangguang and Liu, Jiaheng and Wang, Xintao and Wan, Pengfei and Zhang, Di and Ouyang, Wanli},\n  journal={arXiv preprint arXiv:2505.05470},\n  year={2025}\n}\n```\nIf you find GRPO-Guard useful for your research or projects, we would greatly appreciate it if you could cite the following paper:\n```\n@misc{wang2025grpoguardmitigatingimplicitoveroptimization,\n    title={GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping}, \n    author={Jing Wang and Jiajun Liang and Jie Liu and Henglin Liu and Gongye Liu and Jun Zheng and Wanyuan Pang and Ao Ma and Zhenyu Xie and Xintao Wang and Meng Wang and Pengfei Wan and Xiaodan Liang},\n    year={2025},\n    eprint={2510.22319},\n    archivePrefix={arXiv},\n    primaryClass={cs.CV},\n    url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.22319}, \n}\n```\nIf you find Flow-DPO useful for your research or projects, we would greatly appreciate it if you could cite the following paper:\n```\n@article{liu2025improving,\n  title={Improving video generation with human feedback},\n  author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Yuan, Ziyang and Liu, Xiaokun and Zheng, Mingwu and Wu, Xiele and Wang, Qiulin and Qin, Wenyu and Xia, Menghan and others},\n  journal={arXiv preprint arXiv:2501.13918},\n  year={2025}\n}\n```\n","\u003Ch1 align=\"center\"> Flow-GRPO：\u003Cbr>通过在线强化学习训练流匹配模型 \u003C\u002Fh1>\n\u003Cdiv align=\"center\">\n  \u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05470'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArXiv-red?logo=arxiv'>\u003C\u002Fa>  &nbsp;\n  \u003Ca href='https:\u002F\u002Fgongyeliu.github.io\u002FFlow-GRPO\u002F'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisualization-green?logo=github'>\u003C\u002Fa> &nbsp;\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode-9E95B7?logo=github\">\u003C\u002Fa> &nbsp; \n  \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fjieliu\u002Fsd35m-flowgrpo-68298ec27a27af64b0654120'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel-blue?logo=huggingface'>\u003C\u002Fa> &nbsp; \n  \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fjieliu\u002FSD3.5-M-Flow-GRPO'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-blue?logo=huggingface'>\u003C\u002Fa> &nbsp;\n\u003C\u002Fdiv>\n\n## 更改记录\n\u003Cdetails open>\n\n\u003Csummary>\u003Cstrong>2025-11-04\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n* 增加了 **GRPO-Guard** 🔥🔥。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>更新历史\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n**2025-11-04**\n* 增加了对 [Bagel-7B](https:\u002F\u002Fhuggingface.co\u002FByteDance-Seed\u002FBAGEL-7B-MoT) 的支持。\n\n**2025-10-14**\n\n* 重构了 FlowGRPO-Fast，使其与 FlowGRPO 兼容，并在 SD3 上添加了 CPS 采样和无 CFG 训练。\n\n**2025-08-15**\n\n* 增加了对 **Qwen-Image** 和 **Qwen-Image-Edit** 的支持。\n\n**2025-08-15**\n\n* 感谢 [Jing Wang](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=Q9Np_KQAAAAJ&hl=zh-CN) 添加了 **Wan2.1**。训练命令如下：\n```bash\naccelerate launch --config_file scripts\u002Faccelerate_configs\u002Fmulti_gpu.yaml --num_processes=1 --main_process_port 29503 scripts\u002Ftrain_wan2_1.py --config config\u002Fgrpo.py:general_ocr_wan2_1\n```\n\n**2025-08-14**\n\n* 增加了 Flow-GRPO-Fast 与 Flow-GRPO 的奖励曲线对比。在 Pickscore 奖励下，仅需两步训练，Flow-GRPO-Fast 的表现即可与 Flow-GRPO 相媲美。\n\n\n**2025-08-04**\n\n* 增加了对 **FLUX.1-Kontext-dev** 的支持。针对计数任务，我们使用 Geneval 奖励来检测物体数量，并利用 CLIP 特征相似性确保原始图像与编辑后图像的一致性。这一实现提供了一个可运行的流水线，但训练集仅有 800 个样本。要使 Flow-GRPO 真正有效地应用于编辑任务，仍需社区进一步探索。\n\n\n**2025-07-31**\n\n* 增加了 Flow-GRPO-Fast。\n\n**2025-07-28**\n\n* 增加了对 **FLUX.1-dev** 的支持。\n* 增加了对 CLIPScore 作为奖励模型的支持。\n* 引入了 `config.sample.same_latent` 参数，用于控制是否对相同提示重复使用同一噪声，从而解决 [Issue #7](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fissues\u002F7) 问题。\n\n**2025-05-15** \n\n* 🔥我们在 https:\u002F\u002Fgongyeliu.github.io\u002FFlow-GRPO 展示了三个任务的图像示例及其训练演变过程，请大家查看！\n* 🔥我们现在也在 https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fjieliu\u002FSD3.5-M-Flow-GRPO 提供了这三个任务的在线演示，欢迎大家试用！\n\u003C\u002Fdetails>\n\n## 🤗 模型\n| 任务    | 模型 |\n| -------- | -------- |\n| GenEval     | [🤗GenEval](https:\u002F\u002Fhuggingface.co\u002Fjieliu\u002FSD3.5M-FlowGRPO-GenEval) |\n| 文本渲染     | [🤗Text](https:\u002F\u002Fhuggingface.co\u002Fjieliu\u002FSD3.5M-FlowGRPO-Text) |\n| 人类偏好对齐     | [🤗PickScore](https:\u002F\u002Fhuggingface.co\u002Fjieliu\u002FSD3.5M-FlowGRPO-PickScore) |\n\n## 训练速度\n\n为提升训练效率，我们为 Flow-GRPO 提供了一组更优的参数设置。以下调整显著加快了训练速度：\n\n* 在训练或测试过程中不使用 CFG — RL 过程实际上起到了 **CFG 蒸馏** 的作用。\n* 使用来自 **Flow-GRPO-Fast** 或 **[MixGRPO](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2507.21802)** 的窗口机制 — 只在部分步骤上进行训练。\n* 采用 **[系数保持采样](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.05952) (CPS)** — CPS 在 GenEval 任务上带来了显著提升，并生成了更高品质的样本。典型的设置是 `noise_level = 0.8`，无需针对不同模型或步数进行调整即可取得良好效果。\n\n下图展示了分别以 GenEval 和 Pickscore 为奖励时的测试集性能曲线，其中训练和评估均未使用 CFG。实验配置分别为 [**geneval_sd3_fast_nocfg**](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fconfig\u002Fgrpo.py#L163) 和 [**pickscore_sd3_fast_nocfg**](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fconfig\u002Fgrpo.py#L323)，使用的脚本来自 `scripts\u002Fmulti_node\u002Fsd3_fast`。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fflow_grpo_fast_nocfg_geneval.svg\" alt=\"Flow-GRPO-Fast 示意图\" width=\"350\"\u002F>\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fflow_grpo_fast_nocfg_pickscore.svg\" alt=\"Flow-GRPO-Fast 示意图\" width=\"350\"\u002F> \n\u003C\u002Fp>\n\n## 🛡️ 过度优化（GRPO-Guard） 🔥🔥\n\n为缓解流匹配中的隐性过度优化问题，我们的团队提出了 [GRPO-Guard](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.22319)（[🔥项目页面](https:\u002F\u002Fjingw193.github.io\u002FGRPO-Guard\u002F)）。\n\n我们首先观察到重要性比率存在固有偏差：\n\n1. 其均值始终 **低于 1**，且在低噪声步骤时尤为显著（例如 SD3.5-M 中的第 8 步）。\n\n2. 方差在不同步骤间变化明显。\n\n理想情况下，重要性比率的分布应具有均值为 1 且方差稳定的特性。裁剪操作会将过于自信的正样本或负样本截断至区间 [1−ϵ,1+ϵ] 之外，从而确保梯度更新的稳定性。然而，重要性比率的偏差破坏了这一机制——正样本的梯度不再受到适当约束，**导致策略模型陷入过度优化**。结果是代理分数持续上升，而黄金分数却不断下降，最终造成图像质量严重退化。\n\n\n下表总结了这些有偏的比率分布。\n\n| FlowGRPO | GRPO-Guard|\n| - | - |\n| ![flow_grpo ratio](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_8bec6b13298e.gif) | ![grpo_guard ratio](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_3bb17b38f4b7.gif)  |\n| 裁剪机制失衡，无法约束过于自信的正样本。 | 裁剪机制失衡，无法约束过于自信的正样本。|\n\n\n为解决这一问题，[GRPO-Guard](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.22319) 引入了两种有效缓解过度优化的机制：\n\n- **RatioNorm**：修正重要性比率的分布偏差，并统一各去噪步骤的统计特性。\n- **梯度重加权**：基于 RatioNorm 进一步对不同去噪步骤的梯度进行重加权，以平衡它们的贡献，防止在特定噪声水平下出现过度优化。\n\n下图比较了 GRPO-Guard 和 FlowGRPO 在文本渲染任务中的过度优化情况。GRPO-Guard 保持了与 FlowGRPO 相同的代理分数上升趋势，同时避免了黄金分数的快速下降，从而维持了较高的图像质量和多样性。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_aada1412bc2c.png\" alt=\"GRPO-Guard 示意图\" width=900\"\u002F>\n\u003C\u002Fp>\n\n**开始训练**\n\n下载基础模型并设置奖励模型后，运行以下脚本即可开始针对 SD3.5-M 文本渲染任务的 GRPO-Guard 训练。\n```bash\n# 主节点\nbash scripts\u002Fmulti_node\u002Fsd3_grpo_guard.sh 0\n# 其他节点\nbash scripts\u002Fmulti_node\u002Fsd3_grpo_guard.sh 1\n```\n\n## Flow-GRPO-Fast\n我们提出了 Flow-GRPO-Fast，它是 Flow-GRPO 的加速版本，每条轨迹只需在 **一到两个去噪步骤** 上进行训练。对于每个提示，我们首先使用 ODE 采样生成一条确定性轨迹。在随机选择的一个中间步骤，我们会注入噪声并切换到 SDE 采样以生成一组样本。随后的流程将继续使用 ODE 采样。这样，随机性就被限制在一到两个步骤内，从而使训练能够集中在这几个步骤上。这一少步训练的想法主要由 [Ziyang Yuan](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=fWxWEzsAAAAJ&hl=en) 在我们六月初的讨论中提出。\n\nFlow-GRPO-Fast 带来了显著的效率提升：\n\n- 每条轨迹仅需训练一到两次，大大降低了训练成本。\n- 分支前的采样只需单个提示，无需扩展分组，进一步加快了数据收集速度。\n  \n在 PickScore 上的实验表明，Flow-GRPO-Fast 的奖励性能与 Flow-GRPO 相当，但训练速度更快。图中横轴表示训练轮次。每次迭代训练 2 步的 Flow-GRPO-Fast 表现优于 Flow-GRPO，而每次迭代仅训练 1 步的 Flow-GRPO-Fast 则略逊于 Flow-GRPO。无论哪种情况，与 Flow-GRPO 每次迭代训练 10 步相比，整个训练过程都显著加快。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_45c0f4c88563.png\" alt=\"Flow-GRPO-Fast 示意图\" width=450\"\u002F>\n\u003C\u002Fp>\n\n\n请使用 `scripts\u002Fmulti_node\u002Fsd3_fast` 中的脚本运行这些实验。\n\n\n## 🚀 快速入门\n### 1. 环境搭建\n克隆本仓库并安装依赖包。\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo.git\ncd flow_grpo\nconda create -n flow_grpo python=3.10.16\npip install -e .\n```\n\n### 2. 模型下载\n为避免多 GPU 训练时重复下载和潜在的存储浪费，请提前下载所需模型。\n\n**模型**\n* **SD3.5**：`stabilityai\u002Fstable-diffusion-3.5-medium`\n* **Flux**：`black-forest-labs\u002FFLUX.1-dev`\n\n**奖励模型**\n* **PickScore**：\n  * `laion\u002FCLIP-ViT-H-14-laion2B-s32B-b79K`\n  * `yuvalkirstain\u002FPickScore_v1`\n* **CLIPScore**：`openai\u002Fclip-vit-large-patch14`\n* **美学评分**：`openai\u002Fclip-vit-large-patch14`\n\n### 3. 奖励模型准备\n上述步骤仅安装了当前仓库中的内容。由于每个奖励模型可能依赖于不同的版本，将它们合并到同一个 Conda 环境中可能会导致版本冲突。为避免这种情况，我们采用了受 ddpo-pytorch 启发的远程服务器设置。你只需安装计划使用的特定奖励模型即可。\n\n#### GenEval\n请创建一个新的 Conda 虚拟环境，并按照 [reward-server](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Freward-server) 中的说明安装相应的依赖项。\n\n#### OCR\n请安装 paddle-ocr：\n```bash\npip install paddlepaddle-gpu==2.6.2\npip install paddleocr==2.9.1\npip install python-Levenshtein\n```\n然后，使用 Python 命令行预下载模型：\n```python\nfrom paddleocr import PaddleOCR\nocr = PaddleOCR(use_angle_cls=False, lang=\"en\", use_gpu=False, show_log=False)\n```\n\n#### Pickscore\nPickScore 无需额外安装。请注意，原始的 [pickscore](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyuvalkirstain\u002Fpickapic_v1) 数据集对应于本仓库中的 `dataset\u002Fpickscore`，其中包含一些不适宜的内容。我们强烈建议使用 [pickapic\\_v1\\_no\\_images\\_training\\_sfw](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FCarperAI\u002Fpickapic_v1_no_images_training_sfw)，即 Pick-a-Pic 数据集的 SFW 版本，它对应于本仓库中的 `dataset\u002Fpickscore_sfw`。\n\n#### DeQA\n请创建一个新的 Conda 虚拟环境，并按照 [reward-server](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Freward-server) 中的说明安装相应的依赖项。\n\n#### UnifiedReward\n由于 `sglang` 可能与其他环境发生冲突，我们建议创建一个新的 conda 环境。\n```bash\nconda create -n sglang python=3.10.16\nconda activate sglang\npip install \"sglang[all]\"\n```\n我们使用 sglang 来部署奖励服务。安装 sglang 后，请运行以下命令启动 UnifiedReward：\n```bash\npython -m sglang.launch_server --model-path CodeGoat24\u002FUnifiedReward-7b-v1.5 --api-key flowgrpo --port 17140 --chat-template chatml-llava --enable-p2p-check --mem-fraction-static 0.85\n```\n\n#### ImageReward\n请安装 imagereward：\n```bash\npip install image-reward\npip install git+https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP.git\n```\n\n### 4. 开始训练\n\n#### GRPO\n\n\n**单节点训练**\n\n```bash\n# sd3\nbash scripts\u002Fsingle_node\u002Fgrpo.sh\n# flux\nbash scripts\u002Fsingle_node\u002Fgrpo_flux.sh\n```\n\n---\n\n\u003Cdetails> \u003Csummary>SD3 的多节点训练：\u003C\u002Fsummary>\n\n```bash\n# 主节点\nbash scripts\u002Fmulti_node\u002Fsd3.sh 0\n# 其他节点\nbash scripts\u002Fmulti_node\u002Fsd3.sh 1\nbash scripts\u002Fmulti_node\u002Fsd3.sh 2\nbash scripts\u002Fmulti_node\u002Fsd3.sh 3\n```\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>FLUX.1-dev 的多节点训练：\u003C\u002Fsummary>\n\n```bash\n# 主节点\nbash scripts\u002Fmulti_node\u002Fflux.sh 0\n# 其他节点\nbash scripts\u002Fmulti_node\u002Fflux.sh 1\nbash scripts\u002Fmulti_node\u002Fflux.sh 2\nbash scripts\u002Fmulti_node\u002Fflux.sh 3\n```\n对于 Flow-GRPO-Fast，请使用 `scripts\u002Fmulti_node\u002Fflux_fast.sh`。有关 [Geneval](https:\u002F\u002Fapi.wandb.ai\u002Flinks\u002Fljie\u002Fqz47q208)（配置中使用 `geneval_flux_fast`）和 [PickScore](https:\u002F\u002Fapi.wandb.ai\u002Flinks\u002Fljie\u002Fncdwa0wo)（配置中使用 `pickscore_flux_fast`）的 W&B 日志，请参阅相关链接。\n\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>FLUX.1-Kontext-dev 的多节点训练：\u003C\u002Fsummary>\n\n请先下载 [generated\\_images.zip](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjieliu\u002Fcounting_edit\u002Fblob\u002Fmain\u002Fgenerated_images.zip) 并将其解压到 `counting_edit` 目录下。你也可以使用 `counting_edit` 目录中的脚本自行生成数据。\n\n请从主分支安装 `diffusers` 以支持 `FLUX.1-Kontext-dev`：\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers.git\n```\n升级 Diffusers 后，某些包如 PEFT 也可能需要升级。如果遇到任何错误，请根据错误信息进行相应升级。\n然后，运行以下脚本：\n```bash\n# 主节点\nbash scripts\u002Fmulti_node\u002Fflux_kontext.sh 0\n# 其他节点\nbash scripts\u002Fmulti_node\u002Fflux_kontext.sh 1\nbash scripts\u002Fmulti_node\u002Fflux_kontext.sh 2\nbash scripts\u002Fmulti_node\u002Fflux_kontext.sh 3\n```\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>Qwen-Image 的多节点训练：\u003C\u002Fsummary>\n\n在 Qwen-Image 的实现中，我们统一了 Flow-GRPO 和 Flow-GRPO-Fast。你可以通过 `config.sample.sde_window_size` 控制 SDE 窗口的大小，并用 `config.sample.sde_window_range` 调整窗口的位置。\n\n请从主分支安装 `diffusers` 以支持 `Qwen-Image`：\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers.git\n```\n然后运行以下脚本：\n```bash\n# 主节点\nbash scripts\u002Fmulti_node\u002Fqwenimage.sh 0\n# 其他节点\nbash scripts\u002Fmulti_node\u002Fqwenimage.sh 1\nbash scripts\u002Fmulti_node\u002Fqwenimage.sh 2\nbash scripts\u002Fmulti_node\u002Fqwenimage.sh 3\n```\n使用提供的配置，Qwen-Image 在测试集上的奖励曲线如下所示。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_64ec91c4c45c.png\" alt=\"Flow-GRPO-Fast 插图\" width=350\"\u002F>\n\u003C\u002Fp>\n---\n\u003C\u002Fdetails.\n\n\n\u003Cdetails> \u003Csummary>Qwen-Image-Edit 的多节点训练：\u003C\u002Fsummary>\n\n与 Flux Kontext 类似，首先请下载 [generated\\_images.zip](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjieliu\u002Fcounting_edit\u002Fblob\u002Fmain\u002Fgenerated_images.zip) 并将其解压到 `counting_edit` 目录下。你也可以使用 `counting_edit` 目录中的脚本自行生成数据。\n\n请从主分支安装 `diffusers` 以支持 `Qwen-Image-Edit`：\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers.git\n```\n然后运行以下脚本：\n```bash\n# 主节点\nbash scripts\u002Fmulti_node\u002Fqwenimage_edit.sh 0\n# 其他节点\nbash scripts\u002Fmulti_node\u002Fqwenimage_edit.sh 1\nbash scripts\u002Fmulti_node\u002Fqwenimage_edit.sh 2\nbash scripts\u002Fmulti_node\u002Fqwenimage_edit.sh 3\n```\n\n使用提供的配置，Qwen-Image-Edit 在测试集上的奖励曲线如下所示。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_fa2798b3d684.png\" alt=\"Flow-GRPO-Fast 插图\" width=\"350\"\u002F> \n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_readme_5a729ce143ea.png\" alt=\"Flow-GRPO-Fast 插图\" width=\"350\"\u002F> \n\u003C\u002Fp>\n---\n\u003C\u002Fdetails.\n\n\n\u003Cdetails> \u003Csummary>Bagel 的多节点训练：\u003C\u002Fsummary>\n\n请先将 `transformers` 升级到 **版本>=4.44.0**，并安装 `flash-attn`：\n```bash\npip install transformers==4.44.0\npip install flash-attn==2.7.4.post1 --no-build-isolation\n```\n\n然后运行以下脚本：\n```bash\n# 主节点\nbash scripts\u002Fmulti_node\u002Fbagel\u002Fmain.sh 0\n\n# 其他节点\nbash scripts\u002Fmulti_node\u002Fbagel\u002Fmain.sh 1\nbash scripts\u002Fmulti_node\u002Fbagel\u002Fmain.sh 2\nbash scripts\u002Fmulti_node\u002Fbagel\u002Fmain.sh 3\n```\n\n根据提供的配置，Bagel 在测试集上的奖励（PickScore）曲线如下所示（使用 32 张 GPU）。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fbagel_pickscore.svg\" alt=\"Flow-GRPO-Fast 描述图\" width=\"350\"\u002F>\n\u003C\u002Fp>\n\n**【注】：关于资源需求与 OOM**\n\n默认的训练脚本采用全参数模式，这至少需要 **8 张 80GB 显存的 GPU**。如果遇到 OOM 问题，可以切换到 LoRA 训练，配置文件位于 `config\u002Fgrpo.py:pickscore_bagel_lora`。\n\n---\n\u003C\u002Fdetails>\n\n\n#### DPO \u002F OnlineDPO \u002F SFT \u002F OnlineSFT\n 单节点训练：\n```bash\nbash scripts\u002Fsingle_node\u002Fdpo.sh\nbash scripts\u002Fsingle_node\u002Fsft.sh\n```\n多节点训练：\n\n请在 `scripts\u002Fmulti_node` 的 bash 文件中更新入口 Python 脚本和配置文件名称。\n\n\n## 常见问题解答\n\n* 尽可能使用 **fp16** 进行训练，因为它比 bf16 具有更高的精度，从而减少数据收集与训练之间的对数概率误差。对于 Flux 和 Wan 模型，由于 fp16 推理无法生成有效的图像或视频，因此必须使用 **bf16** 进行训练。需要注意的是，对数概率误差在高噪声步骤时较小，而在低噪声步骤时较大。在这种情况下，仅在高噪声步骤上进行训练会取得更好的效果。感谢 [Jing Wang](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=Q9Np_KQAAAAJ&hl=zh-CN) 提出的这些观察结果。\n\n* 使用 **Flow-GRPO-Fast** 时，请设置相对较小的 `clip_range`，否则训练可能会崩溃。\n\n* 在实现新模型时，请检查使用不同批次大小是否会导致输出略有差异。SD3 存在这一问题，因此我确保训练时的批次大小与数据收集时的批次大小一致。\n\n\n## 如何支持其他模型\n\n要将新模型集成到该框架中，请按照以下步骤操作：\n\n**1. 添加适用于您模型的以下文件：**\n\n* `flow_grpo\u002Fdiffusers_patch\u002Fsd3_pipeline_with_logprob.py`：\n  此文件改编自 [pipeline_stable_diffusion_3.py](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\u002Fblob\u002Fmain\u002Fsrc\u002Fdiffusers\u002Fpipelines\u002Fstable_diffusion_3\u002Fpipeline_stable_diffusion_3.py)。您可以参考 diffusers 中针对您模型的相关代码。\n\n* `scripts\u002Ftrain_sd3.py`：\n  该脚本基于 DreamBooth 示例中的 [train_dreambooth_lora_sd3.py](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\u002Fblob\u002Fmain\u002Fexamples\u002Fdreambooth\u002Ftrain_dreambooth_lora_sd3.py)。\n\n* `flow_grpo\u002Fdiffusers_patch\u002Fsd3_sde_with_logprob.py`：\n  该文件负责处理 SDE 采样。大多数情况下无需修改此文件。但是，如果您的 `dt` 或 `velocity` 定义在符号或约定上有所不同，请相应调整。\n\n**2. 验证 SDE 采样：**\n在 [sde_demo.py](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Ftree\u002Fmain\u002Fscripts\u002Fdemo\u002Fsd3_sde_demo.py) 中将 `noise_level = 0`，以检查生成的图像是否正常。这有助于验证您的 SDE 实现是否正确。\n\n**3. 确保策略一致性：**\n将 [`config.sample.num_batches_per_epoch = 1`](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fconfig\u002Fgrpo.py#L120) 和 [`config.train.gradient_accumulation_steps = 1`](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fconfig\u002Fgrpo.py#L125C5-L125C47) 设置为 1，以强制执行纯在线策略设置，即采集样本的模型与正在训练的模型完全相同。\n在此设置下，[ratio](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fscripts\u002Ftrain_sd3.py#L886) 应保持精确为 1。如果不是，请检查采样和训练代码路径是否存在差异——例如通过使用 `torch.compile` 或其他模型包装器——并确保两者共享相同的逻辑。\n\n**4. 调整奖励行为：**\n首先将 `config.train.beta = 0`，以观察训练过程中奖励是否增加。您可能还需要根据您的模型调整此处的噪声级别 [here](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fflow_grpo\u002Fdiffusers_patch\u002Fsd3_sde_with_logprob.py#L47)。其他超参数通常与模型无关，可保持默认值。\n\n\n## 🏁 多奖励训练\n对于多奖励设置，您可以传入一个字典，其中每个键是奖励名称，对应的值是其权重。\n例如：\n\n```python\n{\n    \"pickscore\": 0.5,\n    \"ocr\": 0.2,\n    \"aesthetic\": 0.3\n}\n```\n\n这意味着最终奖励是各个奖励的加权总和。\n\n目前支持以下奖励模型：\n* **Geneval** 根据复杂的组合式提示评估 T2I 模型。\n* **OCR** 提供基于 OCR 的奖励。\n* **PickScore** 是一种基于人类偏好的通用 T2I 奖励模型。\n* **[DeQA](https:\u002F\u002Fgithub.com\u002Fzhiyuanyou\u002FDeQA-Score)** 是一种基于多模态 LLM 的图像质量评估模型，用于衡量失真和纹理损伤对感知质量的影响。\n* **ImageReward** 是一种通用的 T2I 奖励模型，能够捕捉文本与图像的对齐程度、视觉保真度以及安全性。\n* **QwenVL** 是一种实验性的奖励模型，采用提示工程方法。\n* **Aesthetic** 是一种基于 CLIP 的线性回归模型，用于预测图像的美学评分。\n* **JPEG_Compressibility** 以图像大小作为质量的代理指标。\n* **UnifiedReward** 是用于多模态理解和生成的最先进奖励模型，位居人类偏好排行榜榜首。\n\n        \n## ✨ 重要超参数\n您可以通过调整 `config\u002Fgrpo.py` 中的参数来优化不同的超参数。经验表明，`config.sample.train_batch_size * num_gpu \u002F config.sample.num_image_per_prompt * config.sample.num_batches_per_epoch = 48`，即 `group_number=48`，`group_size=24`。\n此外，建议将 `config.train.gradient_accumulation_steps = config.sample.num_batches_per_epoch \u002F\u002F 2`。\n\n\n## 🤗 致谢\n本仓库基于 [ddpo-pytorch](https:\u002F\u002Fgithub.com\u002Fkvablack\u002Fddpo-pytorch) 和 [diffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers)。我们感谢作者们为 AIGC 社区所做的宝贵贡献。特别感谢 Kevin Black 提供的优秀 *ddpo-pytorch* 仓库。\n\n## ⭐引用\n如果您在研究或项目中使用了 Flow-GRPO，我们将不胜感激您能引用以下论文：\n```\n@article{liu2025flow,\n  title={Flow-grpo: Training flow matching models via online rl},\n  author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Li, Yangguang and Liu, Jiaheng and Wang, Xintao and Wan, Pengfei and Zhang, Di and Ouyang, Wanli},\n  journal={arXiv preprint arXiv:2505.05470},\n  year={2025}\n}\n```\n如果您在研究或项目中使用了 GRPO-Guard，我们将不胜感激您能引用以下论文：\n```\n@misc{wang2025grpoguardmitigatingimplicitoveroptimization,\n    title={GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping}, \n    author={Jing Wang and Jiajun Liang and Jie Liu and Henglin Liu and Gongye Liu and Jun Zheng and Wanyuan Pang and Ao Ma and Zhenyu Xie and Xintao Wang and Meng Wang and Pengfei Wan and Xiaodan Liang},\n    year={2025},\n    eprint={2510.22319},\n    archivePrefix={arXiv},\n    primaryClass={cs.CV},\n    url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.22319}, \n}\n```\n如果您在研究或项目中使用了 Flow-DPO，我们将不胜感激您能引用以下论文：\n```\n@article{liu2025improving,\n  title={Improving video generation with human feedback},\n  author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Yuan, Ziyang and Liu, Xiaokun and Zheng, Mingwu and Wu, Xiele and Wang, Qiulin and Qin, Wenyu and Xia, Menghan and others},\n  journal={arXiv preprint arXiv:2501.13918},\n  year={2025}\n}\n```","# Flow-GRPO 快速上手指南\n\nFlow-GRPO 是一个通过在线强化学习（Online RL）训练流匹配（Flow Matching）模型的开源工具，支持 SD3.5、FLUX.1、Qwen-Image 等主流模型，并提供 GRPO-Guard 防止过优化。\n\n## 1. 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐 Ubuntu 20.04+)\n- **Python**: 3.10.16\n- **GPU**: 支持 CUDA 的 NVIDIA 显卡（多卡训练需配置 NCCL）\n- **存储**: 预留足够空间存放基础模型和奖励模型（建议 50GB+）\n\n### 前置依赖\n确保已安装 `conda` 和 `git`。若在国内网络环境下，建议配置 pip 国内镜像源以加速下载：\n```bash\npip config set global.index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 2. 安装步骤\n\n### 2.1 克隆代码与创建环境\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo.git\ncd flow_grpo\nconda create -n flow_grpo python=3.10.16\nconda activate flow_grpo\npip install -e .\n```\n\n### 2.2 预下载基础模型\n为避免多卡训练时重复下载，请提前手动下载以下模型：\n\n**生成模型**:\n- SD3.5: `stabilityai\u002Fstable-diffusion-3.5-medium`\n- FLUX: `black-forest-labs\u002FFLUX.1-dev`\n\n**奖励模型**:\n- PickScore: `laion\u002FCLIP-ViT-H-14-laion2B-s32B-b79K`, `yuvalkirstain\u002FPickScore_v1`\n- CLIPScore\u002FAesthetic: `openai\u002Fclip-vit-large-patch14`\n\n> **提示**: 可使用 Hugging Face 镜像站或国内加速工具下载模型权重。\n\n### 2.3 配置奖励模型环境\n由于不同奖励模型依赖冲突，建议按需单独配置环境：\n\n**OCR 任务 (PaddleOCR)**:\n```bash\npip install paddlepaddle-gpu==2.6.2\npip install paddleocr==2.9.1\npip install python-Levenshtein\n# 初始化下载 OCR 模型\npython -c \"from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=False, lang='en', use_gpu=False, show_log=False)\"\n```\n\n**UnifiedReward 任务 (SGLang)**:\n```bash\nconda create -n sglang python=3.10.16\nconda activate sglang\npip install \"sglang[all]\"\n# 启动奖励服务\npython -m sglang.launch_server --model-path CodeGoat24\u002FUnifiedReward-7b-v1.5 --api-key flowgrpo --port 17140 --chat-template chatml-llava --enable-p2p-check --mem-fraction-static 0.85\n```\n\n**其他奖励模型 (GenEval\u002FDeQA)**:\n请参考 [reward-server](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Freward-server) 仓库指示单独创建虚拟环境安装。\n\n**ImageReward**:\n```bash\npip install image-reward\npip install git+https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP.git\n```\n\n## 3. 基本使用\n\n### 3.1 开始训练 (单节点)\n进入项目根目录，根据目标模型运行对应的脚本：\n\n**训练 SD3.5**:\n```bash\nbash scripts\u002Fsingle_node\u002Fgrpo.sh\n```\n\n**训练 FLUX.1**:\n```bash\nbash scripts\u002Fsingle_node\u002Fgrpo_flux.sh\n```\n\n### 3.2 启用高级特性\n- **加速训练 (Flow-GRPO-Fast)**: 仅训练 1-2 个去噪步，大幅降低显存和时间消耗。配置参考 `config\u002Fgrpo.py` 中的 `*_fast_nocfg` 系列配置。\n- **防止过优化 (GRPO-Guard)**: 针对文本渲染等任务，使用 GRPO-Guard 机制保持图像质量。\n  ```bash\n  # 主节点\n  bash scripts\u002Fmulti_node\u002Fsd3_grpo_guard.sh 0\n  # 其他节点\n  bash scripts\u002Fmulti_node\u002Fsd3_grpo_guard.sh 1\n  ```\n\n### 3.3 关键参数建议\n为获得最佳训练效率，建议在配置文件中应用以下策略：\n- **No CFG**: 训练和测试均关闭 Classifier-Free Guidance，RL 过程会自动蒸馏该能力。\n- **CPS Sampling**: 设置 `noise_level = 0.8` 以启用系数保留采样，提升 GenEval 指标。\n- **Window Mechanism**: 仅对部分步骤进行训练（配合 Flow-GRPO-Fast）。","某电商设计团队正利用 SD3.5 模型批量生成带有精准品牌文案的商品海报，但在实际落地中遭遇了严重的“图文不符”与渲染模糊问题。\n\n### 没有 flow_grpo 时\n- **文字渲染不可控**：模型生成的海报中品牌标语经常缺笔少画或拼写错误，需人工反复重绘筛选，效率极低。\n- **对象计数不准**：在生成“三件套装”等特定数量商品时，模型常出现多画或少画物体的幻觉，难以满足严格的商品展示需求。\n- **训练资源浪费**：传统微调方法依赖 Classifier-Free Guidance (CFG) 推理，导致训练和采样速度慢，且难以直接对齐人类审美偏好（如 PickScore）。\n- **编辑一致性差**：在进行局部修图时，修改后的区域与原图风格割裂，缺乏语义连贯性。\n\n### 使用 flow_grpo 后\n- **文本精准呈现**：通过引入在线强化学习，flow_grpo 显著提升了文字渲染能力，生成的品牌文案清晰准确，几乎无需后期修正。\n- **逻辑计数可靠**：利用 GenEval 作为奖励信号，模型能精确控制生成物体的数量，完美还原“三件套装”等复杂指令。\n- **训练高效且免 CFG**：flow_grpo 支持无 CFG 训练与系数保持采样（CPS），在将训练步数大幅缩减的同时，实现了类似蒸馏的效果，推理速度显著提升。\n- **审美与编辑对齐**：基于人类偏好（PickScore）进行优化，生成的图像更符合设计师审美，且在图像编辑任务中能保持极高的前后一致性。\n\nflow_grpo 通过将在线强化学习融入流匹配模型，从根本上解决了生成式 AI 在商业场景中“看得见却用不了”的精度与效率瓶颈。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyifan123_flow_grpo_aada1412.png","yifan123","Jie Liu","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fyifan123_fa1dd3a2.png","Ph.D. student @ MMLab, CUHK","The Chinese University of Hong Kong",null,"jieliu.ljie@gmail.com","jieliu.site","https:\u002F\u002Fgithub.com\u002Fyifan123",[82,86],{"name":83,"color":84,"percentage":85},"Python","#3572A5",99.5,{"name":87,"color":88,"percentage":89},"Shell","#89e051",0.5,2165,146,"2026-04-06T07:26:48","MIT",4,"Linux","必需 NVIDIA GPU（文中提及使用 accelerate 进行多卡训练及 paddlepaddle-gpu），具体型号和显存未说明，需支持 CUDA","未说明",{"notes":99,"python":100,"dependencies":101},"1. 建议使用 conda 创建名为 flow_grpo 的虚拟环境。2. 奖励模型（Reward Models）依赖复杂且版本易冲突，官方建议参考 ddpo-pytorch 模式，为不同奖励模型（如 GenEval, DeQA, UnifiedReward）单独创建 Conda 环境或使用远程服务器部署。3. 多 GPU 训练前需预先下载基础模型（如 SD3.5, FLUX.1）以避免存储浪费。4. 部分功能（如 OCR）需额外安装 PaddleOCR 并预下载模型。5. UnifiedReward 需使用 sglang 部署服务。","3.10.16",[102,103,104,105,106,107,108,109,110],"torch","accelerate","transformers","diffusers","paddlepaddle-gpu==2.6.2","paddleocr==2.9.1","sglang","image-reward","CLIP",[15,14],"2026-03-27T02:49:30.150509","2026-04-07T02:05:26.045826",[115,120,125,130,135,139],{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},20743,"如何复现论文中的结果并加速训练过程？","为了复现结果并加速训练，建议采取以下措施：\n1. **关闭 KL 项**：虽然可能导致奖励黑客（reward hacking），但能显著加速训练。验证新优化算法时可先禁用 KL 项。\n2. **调整超参数**：`config.sample.num_image_per_prompt` 设置为 6 太小，会导致优势估计方差过大，推荐设置为 24。\n3. **推荐配置示例**（4 卡 GPU）：\n   - `config.sample.train_batch_size = 6`\n   - `config.sample.num_image_per_prompt = 24`\n   - `config.sample.num_batches_per_epoch = 48`\n4. **优化 KL 计算**：建议将 `config.train.beta` 从 0.004 改为 0.04，并在计算 KL 项时移除分母中的 `dt`（即将代码中的 `std_dev_t * torch.sqrt(-1 * dt)` 改为 `std_dev_t`），这有助于在奖励快速上升的同时减少图像质量下降。","https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fissues\u002F22",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},20744,"如何评估 GenEval、Pickscore 或图像质量等指标？需要单独的评估脚本吗？","不需要单独的评估脚本。\n1. **GenEval**：直接使用 **drawbench** 提示词作为测试提示词即可。由于仓库支持多奖励（multi-reward），可以在训练前运行评估以获取图像质量和偏好分数的所有结果。\n2. **Pickscore 和 OCR**：直接报告测试集准确率，这些指标已在训练期间的 WandB 日志中记录。\n3. **OOD 数据集**：如果需要分布外（OOD）实验的数据集，可以从提供的链接下载（如 `geneval_ood60_20.zip`）。","https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fissues\u002F17",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},20745,"在 Wan2.1 14B 模型上进行 OCR 实验时，8 卡 H20 环境下 Reward 不稳定或效果差的原因是什么？","根据社区实验反馈，主要原因及建议如下：\n1. **模型能力限制**：对于 Wan1.3B 版本，Reward 稳定上升但可视化效果差，原因是模型本身生成能力过弱。\n2. **Batch Size 问题**：对于 14B 版本，使用 FSDP 在 8 卡上训练时 Reward 不稳定，初步判断是因为 **batch_size 过小**。\n3. **解决方案**：尝试增大 `batch_size` 可能复现 1.3B 版本的 Reward 上升过程。算法本身没有问题，需确保配置中的 `num_image_per_prompt` 足够大以减少方差。","https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fissues\u002F103",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},20746,"DPO 训练中 w_diff 不下降但 w_l_diff 下降，这代表什么？如何判断优化方向是否有效？","这种现象可能意味着模型主要通过“远离 lose 图像风格”而非“靠近 win 图像风格”进行优化。\n1. **合理曲线趋势**：有效的优化应保证 `w_diff` 稳定下降（不升高），同时 `l_diff` 不过度升高以致主导训练，或者 `l_diff` 变化不大。\n2. **解决策略**：可以通过降低模型远离 lose 图片获取的训练信号来解决，例如修改损失权重为 `w_l_diff = w_diff - α * l_diff`，并根据训练集实际情况调整 α（α \u003C 1）。\n3. **数据分布建议**：Win-Lose 数据对最好都来自分布内（例如均由原模型生成或混合强模型），如果 Win 来自强模型而 Lose 来自训练模型且差异过大，可能导致 Loss 过快收敛到 0，Acc 接近 100%，但这可能超出 DPO 的有效优化范围。","https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fissues\u002F190",{"id":136,"question_zh":137,"answer_zh":138,"source_url":134},20747,"DPO 训练中 Beta 参数应该如何设置？默认值是否合适？","Beta 参数不宜设置过大。\n1. **问题分析**：官方代码默认 Beta 值为 4000（4k），这会导致参考模型（ref）的约束过大，限制模型学习能力。\n2. **推荐设置**：根据实验经验，Beta 值在 **100 左右** 比较合理，能够平衡约束与学习效果。",{"id":140,"question_zh":141,"answer_zh":142,"source_url":129},20748,"在使用 Flow-GRPO 进行视频迁移或多奖励任务时，有哪些关键配置注意事项？","1. **多奖励任务**：只需使用 drawbench prompts 作为测试 prompts，无需额外脚本即可同时评估图像质量和偏好分数。\n2. **视频迁移**：目前社区反馈在 14B 模型上需注意显存和 Batch Size 的平衡，若 Reward 不稳定请优先尝试增大 `num_image_per_prompt` 和 `train_batch_size`。\n3. **SFT 数据使用**：虽然支持在 RL 训练中使用 SFT 数据防止质量下降（`config.train.sft`），但在某些实验中该选项未被使用，可根据实际需求开启。",[]]