[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-OpenBMB--VisRAG":3,"tool-OpenBMB--VisRAG":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160015,2,"2026-04-18T11:30:52",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":72,"owner_avatar_url":73,"owner_bio":74,"owner_company":75,"owner_location":75,"owner_email":76,"owner_twitter":72,"owner_website":77,"owner_url":78,"languages":79,"stars":106,"forks":107,"last_commit_at":108,"license":109,"difficulty_score":110,"env_os":111,"env_gpu":112,"env_ram":113,"env_deps":114,"category_tags":126,"github_topics":128,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":137,"updated_at":138,"faqs":139,"releases":169},9153,"OpenBMB\u002FVisRAG","VisRAG","Parsing-free RAG supported by VLMs","VisRAG 是一款创新的视觉检索增强生成（RAG）框架，旨在让多模态大模型直接“看懂”文档图像并回答问题。与传统 RAG 必须先通过 OCR 或解析工具将文档转为文本不同，VisRAG 利用视觉语言模型直接将文档作为图像进行嵌入和检索。这一机制有效解决了传统流程中因格式转换导致的信息丢失问题，最大程度保留了原始文档中的图表、版式等关键视觉信息。\n\n其最新升级版 EVisRAG（VisRAG 2.0）进一步引入了“证据引导”的多图像推理能力。它能先对检索到的多张图片进行独立的语言化观察以提取证据，再综合这些线索进行逻辑推理，从而更精准地回答复杂问题。此外，该项目采用了独特的奖励范围化 GRPO 训练策略，通过细粒度的令牌级奖励，同步优化模型的视觉感知与逻辑推理能力。\n\nVisRAG 非常适合 AI 研究人员、开发者以及需要处理大量扫描文档、复杂报表或图文混合资料的企业用户。对于希望构建高精度文档问答系统，且不愿受限于传统文本解析精度的团队来说，这是一个极具价值的开源选择。目前项目已开放模型权重、代码及完整的评估基准，支持在社区中快速复现与应用。","# VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation\n[![Github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisRAG-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2510.09733-ff0000.svg?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.09733)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2410.10594-ff0000.svg?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10594)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FEvisRAG_7B-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FEVisRAG-7B)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FEvisRAG_3B-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FEVisRAG-3B)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisRAG_Ret-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVisRAG-Ret)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisRAG_Collection-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fopenbmb\u002Fvisrag-6717bbfb471bb018a49f1c69)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisRAG_Pipeline-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ftcy6\u002FVisRAG_Pipeline)\n[![Google Colab](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisRAG_Pipeline-ffffff?style=for-the-badge&logo=googlecolab&logoColor=f9ab00)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F11KV9adDNXPfHiuFAfXNOvtYJKcyR8JZH?usp=sharing)\n\n\u003Cp align=\"center\">•\n \u003Ca href=\"#-introduction\"> 📖 Introduction \u003C\u002Fa> •\n \u003Ca href=\"#-news\">🎉 News\u003C\u002Fa> •\n \u003Ca href=\"#-visrag-pipeline\">✨ VisRAG Pipeline\u003C\u002Fa> •\n \u003Ca href=\"#%EF%B8%8F-setup\">⚙️ Setup\u003C\u002Fa> •\n \u003Ca href=\"#%EF%B8%8F-training\">⚡️ Training\u003C\u002Fa> \n\u003C\u002Fp>\n\u003Cp align=\"center\">•\n \u003Ca href=\"#-evaluation\">📃 Evaluation\u003C\u002Fa> •\n \u003Ca href=\"#-usage\">🔧 Usage\u003C\u002Fa> •\n \u003Ca href=\"#-license\">📄 Lisense\u003C\u002Fa> •\n \u003Ca href=\"#-contact\">📧 Contact\u003C\u002Fa> •\n \u003Ca href=\"#-star-history\">📈 Star History\u003C\u002Fa>\n\u003C\u002Fp>\n\n# 📖 Introduction\n**EVisRAG (VisRAG 2.0)** is an evidence-guided Vision Retrieval-augmented Generation framework that equips VLMs for multi-image questions by first linguistically observing retrieved images to collect per-image evidence, then reasoning over those cues to answer. **EVisRAG** trains with Reward-Scoped GRPO, applying fine-grained token-level rewards to jointly optimize visual perception and reasoning.\n\n\u003Cp align=\"center\">\u003Cimg width=800 src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_readme_51e03562499d.png\"\u002F>\u003C\u002Fp>\n\n\n**VisRAG** is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, **VisRAG** maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.\n\n\u003Cp align=\"center\">\u003Cimg width=800 src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_readme_0a9e32e38af2.png\"\u002F>\u003C\u002Fp>\n\n# 🎉 News\n\n* 20251207: Released all [benchmarks](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fopenbmb\u002Fvisrag) on HuggingFace.\n* 20251118: Both EVisRAG and VisRAG can be easily reproduced within [UltraRAG v2](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FUltraRAG).\n* 20251022: We upload all evaluation benchmarks in [VisRAG Collections](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fopenbmb\u002Fvisrag-6717bbfb471bb018a49f1c69) \n* 20251014: Released [EVisRAG-3B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FEVisRAG-3B) on HuggingFace.\n* 20251014: Released **EVisRAG (VisRAG 2.0)**, an end-to-end Vision-Language Model. Released our [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.09733) on arXiv. Released our [Model](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FEVisRAG-7B) on HuggingFace. Released our [Code](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG) on GitHub\n* 20241111: Released our [VisRAG Pipeline](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG\u002Ftree\u002Fmaster\u002Fvisrag_scripts\u002Fdemo\u002Fvisrag_pipeline) on GitHub, now supporting visual understanding across multiple PDF documents.\n* 20241104: Released our [VisRAG Pipeline](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ftcy6\u002FVisRAG_Pipeline) on Hugging Face Space.\n* 20241031: Released our [VisRAG Pipeline](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F11KV9adDNXPfHiuFAfXNOvtYJKcyR8JZH?usp=sharing) on Colab. Released codes for converting files to images, which could be found at `visrag_scripts\u002Ffile2img`.\n* 20241015: Released our train data and test data on Hugging Face which can be found in the [VisRAG](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fopenbmb\u002Fvisrag-6717bbfb471bb018a49f1c69) Collection on Hugging Face. It is referenced at the beginning of this page.\n* 20241014: Released our [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10594) on arXiv. Released our [Model](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVisRAG-Ret) on Hugging Face. Released our [Code](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG) on GitHub.\n\n# ✨ VisRAG Pipeline\n## EVisRAG\n\n**EVisRAG** is an end-to-end framework that equips VLMs with precise visual perception during reasoning in multi-image scenarios. We trained and released VLRMs with EVisRAG built on [Qwen2.5-VL-7B-Instruct](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-VL-7B-Instruct), and [Qwen2.5-VL-3B-Instruct](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-VL-3B-Instruct).\n\n## VisRAG-Ret\n\n**VisRAG-Ret** is a document embedding model built on [MiniCPM-V 2.0](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-V-2), a vision-language model that integrates [SigLIP](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) as the vision encoder and [MiniCPM-2B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-sft-bf16) as the language model.\n\n## VisRAG-Gen\n\nIn the paper, we use MiniCPM-V 2.0, MiniCPM-V 2.6, and GPT-4o as the generators. Actually, you can use any VLMs you like!\n\n# ⚙️ Setup\n## EVisRAG\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG.git\nconda create --name EVisRAG python==3.10\nconda activate EVisRAG\ncd EVisRAG\npip install -r EVisRAG_requirements.txt\n```\n## VisRAG\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG.git\nconda create --name VisRAG python==3.10.8\nconda activate VisRAG\nconda install nvidia\u002Flabel\u002Fcuda-11.8.0::cuda-toolkit\ncd VisRAG\npip install -r requirements.txt\npip install -e .\ncd timm_modified\npip install -e .\ncd ..\n```\nNote:\n1. `timm_modified` is an enhanced version of the `timm` library that supports gradient checkpointing, which we use in our training process to reduce memory usage.\n\n# ⚡️ Training\n## EVisRAG\n\nTo train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs.\n\n\u003Cp align=\"center\">\u003Cimg width=800 src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_readme_19e078f63324.png\"\u002F>\u003C\u002Fp>\n\n***Stage1: SFT*** (based on [LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory))\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory.git \nbash evisrag_scripts\u002Ffull_sft.sh\n```\n\n***Stage2: RS-GRPO*** (based on [Easy-R1](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FEasyR1))\n\n```bash\nbash evisrag_scripts\u002Frun_rsgrpo.sh\n```\n\nNotes:\n\n1. The training data is available on Hugging Face under `EVisRAG-Train`, which is referenced at the beginning of this page.\n2. We adopt a two-stage training strategy. In the first stage, please clone `LLaMA-Factory` and update the model path in the full_sft.sh script. In the second stage, we built our customized algorithm `RS-GRPO` based on `Easy-R1`, specifically designed for EVisRAG, whose implementation can be found in `src\u002FRS-GRPO`.\n\n## VisRAG-Ret\n\nOur training dataset of 362,110 Query-Document (Q-D) Pairs for **VisRAG-Ret** is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%). \n\n```bash\nbash visrag_scripts\u002Ftrain_retriever\u002Ftrain.sh 2048 16 8 0.02 1 true false config\u002Fdeepspeed.json 1e-5 false wmean causal 1 true 2 false \u003Cmodel_dir> \u003Crepo_name_or_path>\n```\nNote:\n1. Our training data can be found in the `VisRAG` collection on Hugging Face, referenced at the beginning of this page. Please note that we have separated the `In-domain-data` and `Synthetic-data` due to their distinct differences. If you wish to train with the complete dataset, you’ll need to merge and shuffle them manually.\n2. The parameters listed above are those used in our paper and can be used to reproduce the results.\n3. `\u003Crepo_name_or_path>` can be any of the following: `openbmb\u002FVisRAG-Ret-Train-In-domain-data`, `openbmb\u002FVisRAG-Ret-Train-Synthetic-data`, the directory path of a repository downloaded from `Hugging Face`, or the directory containing your own training data.\n4. If you wish to train using your own datasets, remove the `--from_hf_repo` line from the `train.sh` script. Additionally, ensure that your dataset directory contains a `metadata.json` file, which must include a `length` field specifying the total number of samples in the dataset.\n5. Our training framework is modified based on [OpenMatch](https:\u002F\u002Fgithub.com\u002FOpenMatch\u002FOpenMatch).\n\n## VisRAG-Gen\n\nThe generation part does not use any fine-tuning, we directly use off-the-shelf LLMs\u002FVLMs for generation.\n\n# 📃 Evaluation\n## EVisRAG\n```bash\nbash evisrag_scripts\u002Fpredict.sh\nbash evisrag_scripts\u002Feval.sh \n```\n\nNotes:\n\n1. The test data is available on Hugging Face under `EVisRAG-Test-xxx`, as referenced at the beginning of this page.\n2. To run evaluation, first execute the `predict.sh` script. The model outputs will be saved in the preds directory. Then, use the `eval.sh` script to evaluate the predictions. The metrics `EM`, `Accuracy`, and `F1` will be reported directly.\n\n## VisRAG-Ret\n\n```bash\nbash visrag_scripts\u002Feval_retriever\u002Feval.sh 512 2048 16 8 wmean causal ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA \u003Cckpt_path>\n```\n\nNote: \n1. Our test data can be found in the `VisRAG` Collection on Hugging Face, which is referenced at the beginning of this page.\n2. The parameters listed above are those used in our paper and can be used to reproduce the results.\n3. The evaluation script is configured to use datasets from Hugging Face by default. If you prefer to evaluate using locally downloaded dataset repositories, you can modify the `CORPUS_PATH`, `QUERY_PATH`, `QRELS_PATH` variables in the evaluation script to point to the local repository directory.\n\n## VisRAG-Gen\nThere are three settings in our generation: text-based generation, single-image-VLM-based generation and multi-image-VLM-based generation. Under single-image-VLM-based generation, there are two additional settings: page concatenation and weighted selection. For detailed information about these settings, please refer to our paper.\n```bash\npython visrag_scripts\u002Fgenerate\u002Fgenerate.py \\\n--model_name \u003Cmodel_name> \\\n--model_name_or_path \u003Cmodel_path> \\\n--dataset_name \u003Cdataset_name> \\\n--dataset_name_or_path \u003Cdataset_path> \\\n--rank \u003Cprocess_rank> \\ \n--world_size \u003Cworld_size> \\\n--topk \u003Cnumber of docs retrieved for generation> \\\n--results_root_dir \u003Cretrieval_results_dir> \\\n--task_type \u003Ctask_type> \\\n--concatenate_type \u003Cimage_concatenate_type> \\\n--output_dir \u003Coutput_dir>\n```\nNote:\n1. `use_positive_sample` determines whether to use only the positive document for the query. Enable this to exclude retrieved documents and omit `topk` and `results_root_dir`. If disabled, you must specify `topk` (number of retrieved documents) and organize `results_root_dir` as `results_root_dir\u002Fdataset_name\u002F*.trec`.\n2. `concatenate_type` is only needed when `task_type` is set to `page_concatenation`. Omit this if not required.\n3. Always specify `model_name_or_path`, `dataset_name_or_path`, and `output_dir`.\n4. Use `--openai_api_key` only if GPT-based evaluation is needed.\n\n# 🔧 Usage\n## EVisRAG\n\nModel on Hugging Face: https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FEVisRAG-7B\n\n```python\nfrom transformers import AutoProcessor\nfrom vllm import LLM, SamplingParams\nfrom qwen_vl_utils import process_vision_info\n\ndef evidence_promot_grpo(query):\n    return f\"\"\"You are an AI Visual QA assistant. I will provide you with a question and several images. Please follow the four steps below:\n\nStep 1: Observe the Images\nFirst, analyze the question and consider what types of images may contain relevant information. Then, examine each image one by one, paying special attention to aspects related to the question. Identify whether each image contains any potentially relevant information.\nWrap your observations within \u003Cobserve>\u003C\u002Fobserve> tags.\n\nStep 2: Record Evidences from Images\nAfter reviewing all images, record the evidence you find for each image within \u003Cevidence>\u003C\u002Fevidence> tags.\nIf you are certain that an image contains no relevant information, record it as: [i]: no relevant information(where i denotes the index of the image).\nIf an image contains relevant evidence, record it as: [j]: [the evidence you find for the question](where j is the index of the image).\n\nStep 3: Reason Based on the Question and Evidences\nBased on the recorded evidences, reason about the answer to the question.\nInclude your step-by-step reasoning within \u003Cthink>\u003C\u002Fthink> tags.\n\nStep 4: Answer the Question\nProvide your final answer based only on the evidences you found in the images.\nWrap your answer within \u003Canswer>\u003C\u002Fanswer> tags.\nAvoid adding unnecessary contents in your final answer, like if the question is a yes\u002Fno question, simply answer \"yes\" or \"no\".\nIf none of the images contain sufficient information to answer the question, respond with \u003Canswer>insufficient to answer\u003C\u002Fanswer>.\n\nFormatting Requirements:\nUse the exact tags \u003Cobserve>, \u003Cevidence>, \u003Cthink>, and \u003Canswer> for structured output.\nIt is possible that none, one, or several images contain relevant evidence.\nIf you find no evidence or few evidences, and insufficient to help you answer the question, follow the instruction above for insufficient information.\n\nQuestion and images are provided below. Please follow the steps as instructed.\nQuestion: {query}\n\"\"\"\n\nmodel_path = \"xxx\"\nprocessor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, padding_side='left')\n\nimgs, query = [\"imgpath1\", \"imgpath2\", ..., \"imgpathX\"], \"What xxx?\"\ninput_prompt = evidence_promot_grpo(query)\n\ncontent = [{\"type\": \"text\", \"text\": input_prompt}]\nfor imgP in imgs:\n    content.append({\n        \"type\": \"image\",\n        \"image\": imgP\n    })\nmsg = [{\n          \"role\": \"user\",\n          \"content\": content,\n      }]\n\nllm = LLM(\n    model=model_path,\n    tensor_parallel_size=1,\n    dtype=\"bfloat16\",\n    limit_mm_per_prompt={\"image\":5, \"video\":0},\n)\n\nsampling_params = SamplingParams(\n    temperature=0.1,\n    repetition_penalty=1.05,\n    max_tokens=2048,\n)\n\nprompt = processor.apply_chat_template(\n    msg,\n    tokenize=False,\n    add_generation_prompt=True,\n)\n\nimage_inputs, _ = process_vision_info(msg)\n\nmsg_input = [{\n    \"prompt\": prompt,\n    \"multi_modal_data\": {\"image\": image_inputs},\n}]\n\noutput_texts = llm.generate(msg_input,\n    sampling_params=sampling_params,\n)\n\nprint(output_texts[0].outputs[0].text)\n```\n\n\n## VisRAG-Ret\n\nModel on Hugging Face: https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVisRAG-Ret\n\n```python\nfrom transformers import AutoModel, AutoTokenizer\nimport torch\nimport torch.nn.functional as F\nfrom PIL import Image\nimport os\n\ndef weighted_mean_pooling(hidden, attention_mask):\n    attention_mask_ = attention_mask * attention_mask.cumsum(dim=1)\n    s = torch.sum(hidden * attention_mask_.unsqueeze(-1).float(), dim=1)\n    d = attention_mask_.sum(dim=1, keepdim=True).float()\n    reps = s \u002F d\n    return reps\n\n@torch.no_grad()\ndef encode(text_or_image_list):\n    \n    if (isinstance(text_or_image_list[0], str)):\n        inputs = {\n            \"text\": text_or_image_list,\n            'image': [None] * len(text_or_image_list),\n            'tokenizer': tokenizer\n        }\n    else:\n        inputs = {\n            \"text\": [''] * len(text_or_image_list),\n            'image': text_or_image_list,\n            'tokenizer': tokenizer\n        }\n    outputs = model(**inputs)\n    attention_mask = outputs.attention_mask\n    hidden = outputs.last_hidden_state\n\n    reps = weighted_mean_pooling(hidden, attention_mask)   \n    embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()\n    return embeddings\n\nmodel_name_or_path = \"openbmb\u002FVisRAG-Ret\"\ntokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)\nmodel = AutoModel.from_pretrained(model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True)\nmodel.eval()\n\nscript_dir = os.path.dirname(os.path.realpath(__file__))\nqueries = [\"What does a dog look like?\"]\npassages = [\n    Image.open(os.path.join(script_dir, 'test_image\u002Fcat.jpeg')).convert('RGB'),\n    Image.open(os.path.join(script_dir, 'test_image\u002Fdog.jpg')).convert('RGB'),\n]\n\nINSTRUCTION = \"Represent this query for retrieving relevant documents: \"\nqueries = [INSTRUCTION + query for query in queries]\n\nembeddings_query = encode(queries)\nembeddings_doc = encode(passages)\n\nscores = (embeddings_query @ embeddings_doc.T)\nprint(scores.tolist())\n```\n## VisRAG-Gen\nFor `VisRAG-Gen`, you can explore the `VisRAG Pipeline` on Google Colab which includes both `VisRAG-Ret` and `VisRAG-Gen` to try out this simple demonstration.\n\n\n# 📄 License\n\n* The code in this repo is released under the [Apache-2.0](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fblob\u002Fmain\u002FLICENSE) License. \n* The usage of **VisRAG-Ret** model weights must strictly follow [MiniCPM Model License.md](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fblob\u002Fmain\u002FMiniCPM%20Model%20License.md).\n* The models and weights of **VisRAG-Ret** are completely free for academic research. After filling out a [\"questionnaire\"](https:\u002F\u002Fmodelbest.feishu.cn\u002Fshare\u002Fbase\u002Fform\u002FshrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, **VisRAG-Ret** weights are also available for free commercial use.\n\n# 📧 Contact\n## EVisRAG\n- Yubo Sun: syb2000417@stu.pku.edu.cn\n- Chunyi Peng: hm.cypeng@gmail.com\n## VisRAG\n- Shi Yu: yus21@mails.tsinghua.edu.cn\n- Chaoyue Tang: tcy006@gmail.com\n\n# 📈 Star History\n\n\u003Ca href=\"https:\u002F\u002Fstar-history.com\u002F#openbmb\u002FVisRAG&Date\">\n \u003Cpicture>\n   \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_readme_c2e4ab0d714d.png&theme=dark\" \u002F>\n   \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_readme_c2e4ab0d714d.png\" \u002F>\n   \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_readme_c2e4ab0d714d.png\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>\n","# VisRAG 2.0：视觉检索增强生成中的证据引导多图像推理\n[![Github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisRAG-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2510.09733-ff0000.svg?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.09733)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2410.10594-ff0000.svg?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10594)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FEvisRAG_7B-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FEVisRAG-7B)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FEvisRAG_3B-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FEVisRAG-3B)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisRAG_Ret-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVisRAG-Ret)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisRAG_Collection-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fopenbmb\u002Fvisrag-6717bbfb471bb018a49f1c69)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisRAG_Pipeline-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ftcy6\u002FVisRAG_Pipeline)\n[![Google Colab](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisRAG_Pipeline-ffffff?style=for-the-badge&logo=googlecolab&logoColor=f9ab00)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F11KV9adDNXPfHiuFAfXNOvtYJKcyR8JZH?usp=sharing)\n\n\u003Cp align=\"center\">•\n \u003Ca href=\"#-introduction\"> 📖 Introduction \u003C\u002Fa> •\n \u003Ca href=\"#-news\">🎉 News\u003C\u002Fa> •\n \u003Ca href=\"#-visrag-pipeline\">✨ VisRAG Pipeline\u003C\u002Fa> •\n \u003Ca href=\"#%EF%B8%8F-setup\">⚙️ Setup\u003C\u002Fa> •\n \u003Ca href=\"#%EF%B8%8F-training\">⚡️ Training\u003C\u002Fa> \n\u003C\u002Fp>\n\u003Cp align=\"center\">•\n \u003Ca href=\"#-evaluation\">📃 Evaluation\u003C\u002Fa> •\n \u003Ca href=\"#-usage\">🔧 Usage\u003C\u002Fa> •\n \u003Ca href=\"#-license\">📄 Lisense\u003C\u002Fa> •\n \u003Ca href=\"#-contact\">📧 Contact\u003C\u002Fa> •\n \u003Ca href=\"#-star-history\">📈 Star History\u003C\u002Fa>\n\u003C\u002Fp>\n\n# 📖 Introduction\n**EVisRAG (VisRAG 2.0)** 是一种证据引导的视觉检索增强生成框架，它通过首先对检索到的图像进行语言观察以收集每张图像的证据，然后基于这些线索进行推理来回答问题，从而赋予视觉语言模型（VLM）处理多图像问题的能力。**EVisRAG** 使用奖励范围内的GRPO进行训练，通过应用细粒度的token级别奖励来联合优化视觉感知和推理能力。\n\n\u003Cp align=\"center\">\u003Cimg width=800 src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_readme_51e03562499d.png\"\u002F>\u003C\u002Fp>\n\n\n**VisRAG** 是一种基于视觉语言模型（VLM）的新型RAG流水线。在该流水线中，不再先解析文档获取文本，而是直接将文档作为图像使用VLM进行嵌入，然后进行检索以增强VLM的生成能力。与传统的基于文本的RAG相比，**VisRAG** 最大化了原始文档中数据信息的保留和利用，消除了解析过程中引入的信息损失。\n\n\u003Cp align=\"center\">\u003Cimg width=800 src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_readme_0a9e32e38af2.png\"\u002F>\u003C\u002Fp>\n\n# 🎉 News\n\n* 20251207: 在HuggingFace上发布了所有[基准测试](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fopenbmb\u002Fvisrag)。\n* 20251118: EVisRAG和VisRAG都可以在[UltraRAG v2](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FUltraRAG)中轻松复现。\n* 20251022: 我们将所有评估基准上传到了[HuggingFace收藏集](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fopenbmb\u002Fvisrag-6717bbfb471bb018a49f1c69)。\n* 20251014: 在HuggingFace上发布了[EVisRAG-3B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FEVisRAG-3B)。\n* 20251014: 发布了**EVisRAG (VisRAG 2.0)**，一个端到端的视觉语言模型。在arXiv上发布了我们的[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.09733)。在HuggingFace上发布了我们的[模型](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FEVisRAG-7B)。在GitHub上发布了我们的[代码](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG)。\n* 20241111: 在GitHub上发布了我们的[VisRAG流水线](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG\u002Ftree\u002Fmaster\u002Fvisrag_scripts\u002Fdemo\u002Fvisrag_pipeline)，现在支持跨多个PDF文档的视觉理解。\n* 20241104: 在Hugging Face Space上发布了我们的[VisRAG流水线](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ftcy6\u002FVisRAG_Pipeline)。\n* 20241031: 在Colab上发布了我们的[VisRAG流水线](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F11KV9adDNXPfHiuFAfXNOvtYJKcyR8JZH?usp=sharing)。发布了用于将文件转换为图像的代码，可在`visrag_scripts\u002Ffile2img`中找到。\n* 20241015: 在HuggingFace上发布了我们的训练数据和测试数据，可以在HuggingFace上的[VisRAG](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fopenbmb\u002Fvisrag-6717bbfb471bb018a49f1c69)收藏集中找到。本页开头已提及。\n* 20241014: 在arXiv上发布了我们的[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10594)。在HuggingFace上发布了我们的[模型](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVisRAG-Ret)。在GitHub上发布了我们的[代码](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG)。\n\n# ✨ VisRAG Pipeline\n## EVisRAG\n\n**EVisRAG** 是一个端到端的框架，在多图像场景下为VLM提供精确的视觉感知能力。我们基于[Qwen2.5-VL-7B-Instruct](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-VL-7B-Instruct)和[Qwen2.5-VL-3B-Instruct](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-VL-3B-Instruct)训练并发布了内置EVisRAG的VLRM。\n\n## VisRAG-Ret\n\n**VisRAG-Ret** 是一个基于[MiniCPM-V 2.0](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-V-2)的文档嵌入模型，该模型整合了[SigLIP](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384)作为视觉编码器，以及[MiniCPM-2B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-sft-bf16)作为语言模型。\n\n## VisRAG-Gen\n\n在论文中，我们使用MiniCPM-V 2.0、MiniCPM-V 2.6和GPT-4o作为生成器。实际上，你可以使用任何你喜欢的VLM！\n\n# ⚙️ Setup\n## EVisRAG\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG.git\nconda create --name EVisRAG python==3.10\nconda activate EVisRAG\ncd EVisRAG\npip install -r EVisRAG_requirements.txt\n```\n## VisRAG\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG.git\nconda create --name VisRAG python==3.10.8\nconda activate VisRAG\nconda install nvidia\u002Flabel\u002Fcuda-11.8.0::cuda-toolkit\ncd VisRAG\npip install -r requirements.txt\npip install -e .\ncd timm_modified\npip install -e .\ncd ..\n```\n注意：\n1. `timm_modified` 是`timm`库的一个增强版本，支持梯度检查点技术，我们在训练过程中使用它来减少内存占用。\n\n# ⚡️ Training\n\n## EVisRAG\n\n为了有效地训练EVisRAG，我们引入了奖励范围分组相对策略优化（RS-GRPO），该方法将细粒度的奖励与特定范围的标记绑定，从而联合优化视觉语言模型的视觉感知和推理能力。\n\n\u003Cp align=\"center\">\u003Cimg width=800 src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_readme_19e078f63324.png\"\u002F>\u003C\u002Fp>\n\n***阶段1：SFT***（基于[LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory)）\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory.git \nbash evisrag_scripts\u002Ffull_sft.sh\n```\n\n***阶段2：RS-GRPO***（基于[Easy-R1](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FEasyR1)）\n\n```bash\nbash evisrag_scripts\u002Frun_rsgrpo.sh\n```\n\n注：\n\n1. 训练数据可在Hugging Face上找到，名称为`EVisRAG-Train`，本页开头已列出。\n2. 我们采用两阶段训练策略。第一阶段，请克隆`LLaMA-Factory`并更新`full_sft.sh`脚本中的模型路径。第二阶段，我们在`Easy-R1`的基础上构建了专为EVisRAG设计的自定义算法`RS-GRPO`，其具体实现可在`src\u002FRS-GRPO`中找到。\n\n## VisRAG-Ret\n\n我们用于**VisRAG-Ret**的362,110对查询-文档（Q-D）训练数据集由公开可用的学术数据集的训练集（34%）以及一个合成数据集组成，该合成数据集由网络爬取的PDF文档页面构成，并辅以VLM生成的伪查询（GPT-4o）（66%）。\n\n```bash\nbash visrag_scripts\u002Ftrain_retriever\u002Ftrain.sh 2048 16 8 0.02 1 true false config\u002Fdeepspeed.json 1e-5 false wmean causal 1 true 2 false \u003Cmodel_dir> \u003Crepo_name_or_path>\n```\n注：\n1. 我们的训练数据可在Hugging Face上的`VisRAG`集合中找到，本页开头已列出。请注意，我们已将“领域内数据”和“合成数据”分开，因为它们存在显著差异。若希望使用完整数据集进行训练，需手动合并并打乱数据。\n2. 上述参数为我们论文中使用的参数，可用于复现实验结果。\n3. `\u003Crepo_name_or_path>`可以是以下任一选项：`openbmb\u002FVisRAG-Ret-Train-In-domain-data`、`openbmb\u002FVisRAG-Ret-Train-Synthetic-data`、从Hugging Face下载的仓库目录路径，或包含您自有训练数据的目录。\n4. 若希望使用您自己的数据集进行训练，请移除`train.sh`脚本中的`--from_hf_repo`一行。此外，确保您的数据集目录中包含一个`metadata.json`文件，其中必须包含一个`length`字段，用于指定数据集中样本的总数。\n5. 我们的训练框架是在[OpenMatch](https:\u002F\u002Fgithub.com\u002FOpenMatch\u002FOpenMatch)的基础上修改而成。\n\n## VisRAG-Gen\n\n生成部分未进行任何微调，我们直接使用现成的LLM\u002FVLM进行生成。\n\n# 📃 评估\n## EVisRAG\n```bash\nbash evisrag_scripts\u002Fpredict.sh\nbash evisrag_scripts\u002Feval.sh \n```\n\n注：\n\n1. 测试数据可在Hugging Face上找到，名称为`EVisRAG-Test-xxx`，本页开头已列出。\n2. 运行评估时，首先执行`predict.sh`脚本，模型输出将保存在`preds`目录中。然后使用`eval.sh`脚本对预测结果进行评估，将直接报告`EM`、`Accuracy`和`F1`等指标。\n\n## VisRAG-Ret\n\n```bash\nbash visrag_scripts\u002Feval_retriever\u002Feval.sh 512 2048 16 8 wmean causal ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA \u003Cckpt_path>\n```\n\n注：\n1. 我们的测试数据可在Hugging Face上的`VisRAG`集合中找到，本页开头已列出。\n2. 上述参数为我们论文中使用的参数，可用于复现实验结果。\n3. 评估脚本默认配置为使用Hugging Face上的数据集。若您希望使用本地下载的数据集仓库进行评估，可修改评估脚本中的`CORPUS_PATH`、`QUERY_PATH`、`QRELS_PATH`变量，使其指向本地仓库目录。\n\n## VisRAG-Gen\n我们的生成方式分为三种：基于文本的生成、基于单张图像的VLM生成以及基于多张图像的VLM生成。在基于单张图像的VLM生成中，又有两种额外设置：页面拼接和加权选择。有关这些设置的详细信息，请参阅我们的论文。\n```bash\npython visrag_scripts\u002Fgenerate\u002Fgenerate.py \\\n--model_name \u003Cmodel_name> \\\n--model_name_or_path \u003Cmodel_path> \\\n--dataset_name \u003Cdataset_name> \\\n--dataset_name_or_path \u003Cdataset_path> \\\n--rank \u003Cprocess_rank> \\ \n--world_size \u003Cworld_size> \\\n--topk \u003C用于生成的检索文档数量> \\\n--results_root_dir \u003C检索结果目录> \\\n--task_type \u003C任务类型> \\\n--concatenate_type \u003C图像拼接类型> \\\n--output_dir \u003C输出目录>\n```\n\n注：\n1. `use_positive_sample`决定是否仅使用查询的正例文档。启用此选项可排除检索到的文档，并省略`topk`和`results_root_dir`。若禁用，则必须指定`topk`（检索文档的数量）并按`results_root_dir\u002Fdataset_name\u002F*.trec`的格式组织`results_root_dir`。\n2. `concatenate_type`仅在`task_type`设置为`page_concatenation`时需要。若无需此设置，则可忽略。\n3. 始终需指定`model_name_or_path`、`dataset_name_or_path`和`output_dir`。\n4. 仅在需要基于GPT的评估时才使用`--openai_api_key`。\n\n# 🔧 使用\n\n## EVisRAG\n\nHugging Face 上的模型：https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FEVisRAG-7B\n\n```python\nfrom transformers import AutoProcessor\nfrom vllm import LLM, SamplingParams\nfrom qwen_vl_utils import process_vision_info\n\ndef evidence_promot_grpo(query):\n    return f\"\"\"你是一位AI视觉问答助手。我会提供一个问题和几张图片，请按照以下四个步骤操作：\n\n第一步：观察图片\n首先分析问题，思考哪些类型的图片可能包含相关信息。然后逐一检查每张图片，特别关注与问题相关的内容。判断每张图片是否包含潜在的相关信息。\n将你的观察结果用 \u003Cobserve>\u003C\u002Fobserve> 标签包裹起来。\n\n第二步：记录图片中的证据\n在查看完所有图片后，将你在每张图片中找到的证据用 \u003Cevidence>\u003C\u002Fevidence> 标签记录下来。\n如果你确定某张图片不包含任何相关信息，则记录为：[i]: 无相关信息（其中 i 表示图片的索引）。\n如果一张图片包含相关证据，则记录为：[j]: [你为该问题找到的证据]（其中 j 是图片的索引）。\n\n第三步：根据问题和证据进行推理\n基于已记录的证据，推导出问题的答案。\n将你的逐步推理过程用 \u003Cthink>\u003C\u002Fthink> 标签包裹起来。\n\n第四步：回答问题\n仅根据你在图片中找到的证据给出最终答案。\n将你的答案用 \u003Canswer>\u003C\u002Fanswer> 标签包裹起来。\n避免在最终答案中添加不必要的内容，例如，如果问题是“是\u002F否”问题，只需回答“是”或“否”。\n如果所有图片都不足以回答问题，则回复 \u003Canswer>无法回答\u003C\u002Fanswer>。\n\n格式要求：\n请严格按照 \u003Cobserve>、\u003Cevidence>、\u003Cthink> 和 \u003Canswer> 标签进行结构化输出。\n可能没有任何图片、仅有一张图片或有多张图片包含相关证据。\n如果你没有找到任何证据，或者证据不足无法帮助你回答问题，请按照上述关于信息不足的指示操作。\n\n问题和图片如下所示。请按照指示步骤操作。\n问题：{query}\n\"\"\"\n\nmodel_path = \"xxx\"\nprocessor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, padding_side='left')\n\nimgs, query = [\"imgpath1\", \"imgpath2\", ..., \"imgpathX\"], \"What xxx?\"\ninput_prompt = evidence_promot_grpo(query)\n\ncontent = [{\"type\": \"text\", \"text\": input_prompt}]\nfor imgP in imgs:\n    content.append({\n        \"type\": \"image\",\n        \"image\": imgP\n    })\nmsg = [{\n          \"role\": \"user\",\n          \"content\": content,\n      }]\n\nllm = LLM(\n    model=model_path,\n    tensor_parallel_size=1,\n    dtype=\"bfloat16\",\n    limit_mm_per_prompt={\"image\":5, \"video\":0},\n)\n\nsampling_params = SamplingParams(\n    temperature=0.1,\n    repetition_penalty=1.05,\n    max_tokens=2048,\n)\n\nprompt = processor.apply_chat_template(\n    msg,\n    tokenize=False,\n    add_generation_prompt=True,\n)\n\nimage_inputs, _ = process_vision_info(msg)\n\nmsg_input = [{\n    \"prompt\": prompt,\n    \"multi_modal_data\": {\"image\": image_inputs},\n}]\n\noutput_texts = llm.generate(msg_input,\n    sampling_params=sampling_params,\n)\n\nprint(output_texts[0].outputs[0].text)\n```\n\n\n## VisRAG-Ret\n\nHugging Face 上的模型：https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVisRAG-Ret\n\n```python\nfrom transformers import AutoModel, AutoTokenizer\nimport torch\nimport torch.nn.functional as F\nfrom PIL import Image\nimport os\n\ndef weighted_mean_pooling(hidden, attention_mask):\n    attention_mask_ = attention_mask * attention_mask.cumsum(dim=1)\n    s = torch.sum(hidden * attention_mask_.unsqueeze(-1).float(), dim=1)\n    d = attention_mask_.sum(dim=1, keepdim=True).float()\n    reps = s \u002F d\n    return reps\n\n@torch.no_grad()\ndef encode(text_or_image_list):\n    \n    if (isinstance(text_or_image_list[0], str)):\n        inputs = {\n            \"text\": text_or_image_list,\n            'image': [None] * len(text_or_image_list),\n            'tokenizer': tokenizer\n        }\n    else:\n        inputs = {\n            \"text\": [''] * len(text_or_image_list),\n            'image': text_or_image_list,\n            'tokenizer': tokenizer\n        }\n    outputs = model(**inputs)\n    attention_mask = outputs.attention_mask\n    hidden = outputs.last_hidden_state\n\n    reps = weighted_mean_pooling(hidden, attention_mask)   \n    embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()\n    return embeddings\n\nmodel_name_or_path = \"openbmb\u002FVisRAG-Ret\"\ntokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)\nmodel = AutoModel.from_pretrained(model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True)\nmodel.eval()\n\nscript_dir = os.path.dirname(os.path.realpath(__file__))\nqueries = [\"狗长什么样？\"]\npassages = [\n    Image.open(os.path.join(script_dir, 'test_image\u002Fcat.jpeg')).convert('RGB'),\n    Image.open(os.path.join(script_dir, 'test_image\u002Fdog.jpg')).convert('RGB'),\n]\n\nINSTRUCTION = \"将此查询表示为用于检索相关文档的形式：\"\nqueries = [INSTRUCTION + query for query in queries]\n\nembeddings_query = encode(queries)\nembeddings_doc = encode(passages)\n\nscores = (embeddings_query @ embeddings_doc.T)\nprint(scores.tolist())\n```\n## VisRAG-Gen\n对于 `VisRAG-Gen`，你可以在 Google Colab 上探索包含 `VisRAG-Ret` 和 `VisRAG-Gen` 的 `VisRAG 流程`，以体验这个简单的演示。\n\n# 📄 许可证\n\n* 本仓库中的代码遵循 [Apache-2.0](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fblob\u002Fmain\u002FLICENSE) 许可协议。\n* **VisRAG-Ret** 模型权重的使用必须严格遵守 [MiniCPM 模型许可证.md](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fblob\u002Fmain\u002FMiniCPM%20Model%20License.md)。\n* **VisRAG-Ret** 的模型和权重完全免费供学术研究使用。填写一份 [\"问卷\"](https:\u002F\u002Fmodelbest.feishu.cn\u002Fshare\u002Fbase\u002Fform\u002FshrcnpV5ZT9EJ6xYjh3Kx0J6v8g) 进行注册后，**VisRAG-Ret** 权重也可免费用于商业用途。\n\n# 📧 联系方式\n## EVisRAG\n- Yubo Sun: syb2000417@stu.pku.edu.cn\n- Chunyi Peng: hm.cypeng@gmail.com\n## VisRAG\n- Shi Yu: yus21@mails.tsinghua.edu.cn\n- Chaoyue Tang: tcy006@gmail.com\n\n# 📈 星标历史\n\n\u003Ca href=\"https:\u002F\u002Fstar-history.com\u002F#openbmb\u002FVisRAG&Date\">\n \u003Cpicture>\n   \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_readme_c2e4ab0d714d.png&theme=dark\" \u002F>\n   \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_readme_c2e4ab0d714d.png\" \u002F>\n   \u003Cimg alt=\"星标历史图表\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_readme_c2e4ab0d714d.png\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>","# VisRAG 快速上手指南\n\nVisRAG 是一个基于视觉语言模型（VLM）的检索增强生成（RAG）框架。与传统文本 RAG 不同，它直接将文档作为图像嵌入和检索，最大程度保留原始文档中的视觉信息。**EVisRAG (VisRAG 2.0)** 进一步引入了证据引导的多图像推理能力，通过奖励范围化的 GRPO 算法联合优化视觉感知与推理。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu)\n*   **Python**: 3.10 (EVisRAG) 或 3.10.8 (VisRAG)\n*   **GPU**: NVIDIA GPU (需安装 CUDA Toolkit)\n    *   VisRAG 训练\u002F推理推荐 CUDA 11.8\n*   **依赖管理**: Conda (推荐)\n\n## 安装步骤\n\n根据您想使用的模块（EVisRAG 或原版 VisRAG），选择对应的安装流程。\n\n### 方案 A：安装 EVisRAG (VisRAG 2.0)\n适用于多图像推理和端到端视觉问答任务。\n\n```bash\n# 1. 克隆代码库\ngit clone https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG.git\ncd VisRAG\n\n# 2. 创建并激活 Conda 环境\nconda create --name EVisRAG python==3.10\nconda activate EVisRAG\n\n# 3. 安装依赖\npip install -r EVisRAG_requirements.txt\n```\n\n### 方案 B：安装 VisRAG (含检索器训练)\n适用于构建完整的文档图像检索与生成流水线。\n\n```bash\n# 1. 克隆代码库\ngit clone https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG.git\ncd VisRAG\n\n# 2. 创建并激活 Conda 环境 (注意版本为 3.10.8)\nconda create --name VisRAG python==3.10.8\nconda activate VisRAG\n\n# 3. 安装 CUDA Toolkit (以 11.8 为例)\nconda install nvidia\u002Flabel\u002Fcuda-11.8.0::cuda-toolkit\n\n# 4. 安装核心依赖\npip install -r requirements.txt\npip install -e .\n\n# 5. 安装修改版的 timm 库 (支持梯度检查点以节省显存)\ncd timm_modified\npip install -e .\ncd ..\n```\n\n> **提示**：国内用户若遇到 pip 下载缓慢，可添加 `-i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple` 参数加速安装。\n\n## 基本使用\n\n以下展示如何使用 **EVisRAG** 进行最简单的推理演示。该模型基于 Qwen2.5-VL，能够直接处理多图像输入并进行证据引导的回答。\n\n### Python 推理示例\n\n确保已安装 `transformers`, `vllm`, `qwen_vl_utils` 等依赖，并从 Hugging Face 下载模型 (`openbmb\u002FEVisRAG-7B`)。\n\n```python\nfrom transformers import AutoProcessor\nfrom vllm import LLM, SamplingParams\nfrom qwen_vl_utils import process_vision_info\n\ndef evidence_prompt_grpo(query):\n    return f\"\"\"You are an AI Visual QA assistant. I will provide you with a question and several images. \n    First, observe each image carefully to extract relevant evidence. \n    Then, reason based on the collected evidence to answer the question.\n    \n    Question: {query}\n    \"\"\"\n\n# 初始化模型和处理器\nmodel_path = \"openbmb\u002FEVisRAG-7B\" # 或本地路径\nprocessor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)\nllm = LLM(model=model_path, trust_remote_code=True, gpu_memory_utilization=0.9)\n\n# 准备输入数据 (示例：一个问题 + 多张图片)\nmessages = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"text\", \"text\": evidence_prompt_grpo(\"图中有多少个红色的圆圈？\")},\n            {\"type\": \"image\", \"image\": \"path\u002Fto\u002Fimage1.jpg\"},\n            {\"type\": \"image\", \"image\": \"path\u002Fto\u002Fimage2.jpg\"},\n        ]\n    }\n]\n\n# 处理视觉信息\nprompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\nimage_inputs, _ = process_vision_info(messages)\n\n# 生成回答\nsampling_params = SamplingParams(temperature=0.7, max_tokens=512)\noutputs = llm.generate(\n    prompts=[prompt],\n    sampling_params=sampling_params,\n    multi_modal_data={\"image\": image_inputs}\n)\n\n# 输出结果\nprint(outputs[0].outputs[0].text)\n```\n\n### 运行评估脚本 (可选)\n\n如果您需要复现论文中的评估结果，可以使用官方提供的脚本：\n\n```bash\n# 1. 生成预测结果\nbash evisrag_scripts\u002Fpredict.sh\n\n# 2. 执行评估 (输出 EM, Accuracy, F1 等指标)\nbash evisrag_scripts\u002Feval.sh\n```\n\n> **注意**：运行评估前，请确保测试数据已下载至指定目录（参考项目首页的 HuggingFace Collection 链接）。","某医疗科研团队需要从海量历史病历扫描件（含手写笔记、复杂图表和影像报告）中快速检索并总结特定疾病的临床特征。\n\n### 没有 VisRAG 时\n- **信息严重丢失**：传统 OCR 工具无法准确识别手写医嘱和模糊印章，导致关键诊断依据在文本化过程中被遗漏或误读。\n- **图表理解失效**：病历中的趋势图和影像截图被强行转为纯文字描述，失去了视觉上的空间关系和细节特征，模型无法基于图像推理。\n- **流程繁琐低效**：必须先花费大量时间清洗和解析文档，一旦解析失败需人工介入修正，严重拖慢科研数据的准备周期。\n- **上下文割裂**：分块提取的文本碎片破坏了病历原本的版面逻辑，导致回答多模态问题时缺乏完整的证据链支持。\n\n### 使用 VisRAG 后\n- **原生视觉保留**：VisRAG 直接将病历图片作为整体嵌入检索，完整保留了手写字迹、印章和原始排版，杜绝了 OCR 带来的信息损耗。\n- **多图联合推理**：模型能同时观察并对比多张检索到的影像报告和化验单，像医生一样通过视觉线索进行跨页面的证据引导式推理。\n- **免解析直连**：省去了耗时的文档解析与清洗步骤，上传扫描即可直接构建知识库，让非结构化数据秒级转化为可查询的智能资产。\n- **证据可追溯**：生成的结论直接关联到原始图片的具体区域，研究人员可直观核对视觉证据，大幅提升了医疗问答的可信度。\n\nVisRAG 通过“免解析、全视觉”的创新架构，让 AI 真正读懂了复杂文档中的图像细节，将非结构化档案变成了高价值的智能知识库。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VisRAG_51e03562.png","OpenBMB","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FOpenBMB_02e4bd39.png","OpenBMB (Open Lab for Big Model Base) aims to build foundation models and systems towards AGI.",null,"openbmb@gmail.com","https:\u002F\u002Fwww.openbmb.cn","https:\u002F\u002Fgithub.com\u002FOpenBMB",[80,84,88,92,96,100,103],{"name":81,"color":82,"percentage":83},"Python","#3572A5",87.8,{"name":85,"color":86,"percentage":87},"MDX","#fcb32c",11.9,{"name":89,"color":90,"percentage":91},"Shell","#89e051",0.2,{"name":93,"color":94,"percentage":95},"Dockerfile","#384d54",0.1,{"name":97,"color":98,"percentage":99},"Jinja","#a52a22",0,{"name":101,"color":102,"percentage":99},"Makefile","#427819",{"name":104,"color":105,"percentage":99},"JavaScript","#f1e05a",947,71,"2026-04-17T13:58:32","Apache-2.0",4,"Linux","需要 NVIDIA GPU，需安装 CUDA 11.8.0 Toolkit (visrag_scripts\u002Ftrain_retriever\u002Ftrain.sh 中隐含多卡训练需求，具体显存未说明，建议大显存以支持多图像推理和训练)","未说明",{"notes":115,"python":116,"dependencies":117},"1. VisRAG 组件需要手动安装修改版的 timm 库 (`timm_modified`) 以支持梯度检查点。2. 训练分为两个阶段：SFT 阶段依赖 LLaMA-Factory，RS-GRPO 阶段依赖 Easy-R1。3. 检索器训练基于 OpenMatch 框架修改。4. 明确指定使用 conda 创建环境并安装 CUDA 11.8.0 toolkit。5. 模型基于 Qwen2.5-VL 和 MiniCPM-V 架构。","3.10.8 (VisRAG), 3.10 (EVisRAG)",[118,119,120,121,122,123,124,125],"torch (隐含)","transformers (隐含)","timm (修改版)","LLaMA-Factory (用于 SFT 训练)","Easy-R1 (用于 RS-GRPO 训练)","OpenMatch (用于检索器训练框架)","vllm (用于 EVisRAG 推理)","qwen_vl_utils",[127,14,35],"其他",[129,130,131,132,133,134,135,136],"rag","retrieval","retrieval-augmented-generation","vision-language-model","multi-modal","multi-modality","document-retrieval","document-understanding","2026-03-27T02:49:30.150509","2026-04-19T03:03:13.669687",[140,145,150,155,160,165],{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},41089,"如何使用自己的数据集为 VisRAG-Ret 生成训练数据？","项目已开源使用 GPT-4o 生成训练数据的代码，位于 `scripts\u002Fdata\u002Fbatch_api.py`。使用前需将 PDF 文件的每一页保存为 JSON 格式，确保每个 JSON 文件包含 `image_base64` 字段，并将这些文件放入 `input_dir` 目录。然后运行命令：`python batch_api.py input_dir output_dir`，脚本将为每一页生成查询并保存到 `output_dir`。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG\u002Fissues\u002F25",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},41090,"遇到 'TypeError: MiniCPMV.forward() missing 1 required positional argument: data' 错误如何解决？","该错误通常是因为加载了错误的模型版本。代码逻辑通过检查 `model_dir\u002Fconfig.json` 中的 `_name_or_path` 是否包含 `MiniCPM-V-2-0` 来加载定制模型。如果配置文件中的名称被修改为 `MiniCPM-V-2`（官方模型名），会导致加载远程原始代码而非本地定制代码。解决方法是拉取最新代码（维护者已增强逻辑鲁棒性），并确保模型配置名称正确匹配。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG\u002Fissues\u002F6",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},41091,"在基于 OCR 的 RAG 实验中，文本是如何分块（chunking）处理的？","在使用 OCR 提取文本进行 RAG 时，项目采用按页（page by page）提取文本的方式，即每一页作为一个独立的文本块进行处理，未进行跨页合并或其他复杂的分组策略。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG\u002Fissues\u002F46",{"id":156,"question_zh":157,"answer_zh":158,"source_url":159},41092,"PlotQA 数据集上的文本提取过程有什么特殊之处吗？为什么复现结果不一致？","项目中并未使用 GPT-4o 进行 PlotQA 的文本提取。主要实验采用了两种方法：1. **OCR 方法**：使用 PPOCR 并结合相邻合并策略（参考附录 B.1）；2. **Captioner 方法**：使用基于 MiniCPM-V 2.0 训练的专用描述生成模型。结果差异可能源于使用了不同的提取工具或策略，建议检查是否复现了上述特定配置。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG\u002Fissues\u002F59",{"id":161,"question_zh":162,"answer_zh":163,"source_url":164},41093,"训练数据集中的 context_ls 字段是否开源？如果没有该如何获取？","本项目的训练数据未使用 `context_ls` 字段，该字段与模型训练无关，移除后不影响实验复现与模型性能。若研究者确实需要该字段，可自行从原始公开数据集（如 ChartQA、InfographicVQA、MMLongBench）中提取并处理获得相应信息。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisRAG\u002Fissues\u002F79",{"id":166,"question_zh":167,"answer_zh":168,"source_url":154},41094,"在哪里可以找到 MiniCPM (OCR) 模型的检索结果文件（.trec）？","维护者已整理好 `MiniCPM (OCR)` 的检索 .trec 文件，包含两个版本：一个是同时在合成数据和域内数据上训练的模型结果，另一个是仅在合成数据上训练的模型结果。文件可通过提供的 Google Drive 链接下载获取。",[]]