[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-open-compass--VLMEvalKit":3,"tool-open-compass--VLMEvalKit":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",153609,2,"2026-04-13T11:34:59",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":10,"env_os":98,"env_gpu":99,"env_ram":100,"env_deps":101,"category_tags":110,"github_topics":112,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":132,"updated_at":133,"faqs":134,"releases":170},7175,"open-compass\u002FVLMEvalKit","VLMEvalKit","Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks","VLMEvalKit 是一款专为大型视觉 - 语言模型（LVLMs）打造的开源评估工具箱。它旨在解决多模态模型评估中数据准备繁琐、基准测试分散的痛点，让用户只需一条命令即可在 80 多个主流基准上对超过 220 种模型进行高效评测，无需在不同代码库间反复切换。\n\n这款工具特别适合 AI 研究人员、算法工程师及开发者使用，无论是想要快速验证新模型性能，还是希望系统对比不同架构的优劣，都能从中获益。VLMEvalKit 采用基于生成的评估范式，同时支持“精确匹配”和“大模型辅助答案提取”两种判分机制，确保评估结果既严谨又灵活。\n\n其技术亮点在于持续进化的适应性：不仅最新支持了针对长文本输出的 TSV 格式保存，避免数据截断，还专门优化了对具备“思考模式”模型的处理，能精准解析思维链内容。此外，它紧跟前沿，已率先集成 Video-MME-v2、SeePhys 等最新权威基准，帮助社区实时追踪多模态技术在视频理解与物理推理等领域的最新进展。通过统一的接口与丰富的记录，VLMEvalKit 正成为推动多模态模型标准化评估的重要基础设施。","![LOGO](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopen-compass_VLMEvalKit_readme_f32f9dd11351.jpg)\n\n\u003Cb>A Toolkit for Evaluating Large Vision-Language Models. \u003C\u002Fb>\n\n[![][github-contributors-shield]][github-contributors-link] • [![][github-forks-shield]][github-forks-link] • [![][github-stars-shield]][github-stars-link] • [![][github-issues-shield]][github-issues-link] • [![][github-license-shield]][github-license-link]\n\nEnglish | [简体中文](\u002Fdocs\u002Fzh-CN\u002FREADME_zh-CN.md) | [日本語](\u002Fdocs\u002Fja\u002FREADME_ja.md)\n\n\u003Ca href=\"https:\u002F\u002Frank.opencompass.org.cn\u002Fleaderboard-multimodal\">🏆 OC Learderboard \u003C\u002Fa> •\n\u003Ca href=\"#%EF%B8%8F-quickstart\">🏗️Quickstart \u003C\u002Fa> •\n\u003Ca href=\"#-datasets-models-and-evaluation-results\">📊Datasets & Models \u003C\u002Fa> •\n\u003Ca href=\"#%EF%B8%8F-development-guide\">🛠️Development \u003C\u002Fa>\n\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopencompass\u002Fopen_vlm_leaderboard\">🤗 HF Leaderboard\u003C\u002Fa> •\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FVLMEval\u002FOpenVLMRecords\">🤗 Evaluation Records\u003C\u002Fa> •\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopencompass\u002Fopenvlm_video_leaderboard\">🤗 HF Video Leaderboard\u003C\u002Fa> •\n\n\u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FevDT4GZmxN\">🔊 Discord\u003C\u002Fa> •\n\u003Ca href=\"https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2407.11691\">📝 Report\u003C\u002Fa> •\n\u003Ca href=\"#-the-goal-of-vlmevalkit\">🎯Goal \u003C\u002Fa> •\n\u003Ca href=\"#%EF%B8%8F-citation\">🖊️Citation \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n**VLMEvalKit** (the python package name is **vlmeval**) is an **open-source evaluation toolkit** of **large vision-language models (LVLMs)**. It enables **one-command evaluation** of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt **generation-based evaluation** for all LVLMs, and provide the evaluation results obtained with both **exact matching** and **LLM-based answer extraction**.\n\n## Recent Codebase Changes\n- **[2025-09-12]** **Major Update: Improved Handling for Models with Thinking Mode**\n\n    A new feature in [PR 1229](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F1175) that improves support for models with thinking mode. VLMEvalKit now allows for the use of a custom `split_thinking` function. **We strongly recommend this for models with thinking mode to ensure the accuracy of evaluation**.  To use this new functionality, please enable the Environment Variable: `SPLIT_THINK=True`. By default, the function will parse content within `\u003Cthink>...\u003C\u002Fthink>` tags and store it in the `thinking` key of the output. For more advanced customization, you can also create a `split_think` function for model. Please see the InternVL implementation for an example.\n- **[2025-09-12]** **Major Update: Improved Handling for Long Response(More than 16k\u002F32k)**\n\n    A new feature in [PR 1229](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F1175) that improves support for models with long response outputs. VLMEvalKit can now save prediction files in TSV format. **Since individual cells in an `.xlsx` file are limited to 32,767 characters, we strongly recommend using this feature for models that generate long responses (e.g., exceeding 16k or 32k tokens) to prevent data truncation.** To use this new functionality, please enable the Environment Variable: `PRED_FORMAT=tsv`.\n- **[2025-08-04]** In [PR 1175](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F1175), we refine the `can_infer_option` and `can_infer_text`, which increasingly route the evaluation to LLM choice extractors and empirically leads to slight performance improvement for MCQ benchmarks.\n\n## 🆕 News\n\n- **[2026-04-08]** Supported [**Video-MME-v2**](https:\u002F\u002Fgithub.com\u002FMME-Benchmarks\u002FVideo-MME-v2). Video-MME-v2 is an authoritative benchmark towards the next stage in video understanding evaluation. 🔥🔥🔥\n- **[2025-07-07]** Supported [**SeePhys**](https:\u002F\u002Fseephys.github.io\u002F), which is a ​full spectrum multimodal benchmark for evaluating physics reasoning across different knowledge levels. thanks to [**Quinn777**](https:\u002F\u002Fgithub.com\u002FQuinn777) 🔥🔥🔥\n- **[2025-07-02]** Supported [**OvisU1**](https:\u002F\u002Fhuggingface.co\u002FAIDC-AI\u002FOvis-U1-3B), thanks to [**liyang-7**](https:\u002F\u002Fgithub.com\u002Fliyang-7) 🔥🔥🔥\n- **[2025-06-16]** Supported [**PhyX**](https:\u002F\u002Fphyx-bench.github.io\u002F), a benchmark aiming to assess capacity for physics-grounded reasoning in visual scenarios. 🔥🔥🔥\n- **[2025-05-24]** To facilitate faster evaluations for large-scale or thinking models, **VLMEvalKit supports multi-node distributed inference** using **LMDeploy**  (supports *InternVL Series, QwenVL Series, LLaMa4*) or **VLLM**(supports *QwenVL Series, LLaMa4*). You can activate this feature by adding the ```use_lmdeploy``` or ```use_vllm``` flag to your custom model configuration in [config.py](vlmeval\u002Fconfig.py) . Leverage these tools to significantly speed up your evaluation workflows 🔥🔥🔥\n- **[2025-05-24]** Supported Models: **InternVL3 Series, Gemini-2.5-Pro, Kimi-VL, LLaMA4, NVILA, Qwen2.5-Omni, Phi4, SmolVLM2, Grok, SAIL-VL-1.5, WeThink-Qwen2.5VL-7B, Bailingmm, VLM-R1, Taichu-VLR**. Supported Benchmarks: **HLE-Bench, MMVP, MM-AlignBench, Creation-MMBench, MM-IFEval, OmniDocBench, OCR-Reasoning, EMMA, ChaXiv，MedXpertQA, Physics, MSEarthMCQ, MicroBench, MMSci, VGRP-Bench, wildDoc, TDBench, VisuLogic, CVBench, LEGO-Puzzles, Video-MMLU, QBench-Video, MME-CoT, VLM2Bench, VMCBench, MOAT, Spatial457 Benchmark**. Please refer to [**VLMEvalKit Features**](https:\u002F\u002Faicarrier.feishu.cn\u002Fwiki\u002FQp7wwSzQ9iK1Y6kNUJVcr6zTnPe?table=tblsdEpLieDoCxtb) for more details. Thanks to all contributors 🔥🔥🔥\n- **[2025-02-20]** Supported Models: **InternVL2.5 Series, Qwen2.5VL Series, QVQ-72B, Doubao-VL, Janus-Pro-7B, MiniCPM-o-2.6, InternVL2-MPO, LLaVA-CoT, Hunyuan-Standard-Vision, Ovis2, Valley, SAIL-VL, Ross, Long-VITA, EMU3, SmolVLM**. Supported Benchmarks: **MMMU-Pro, WeMath, 3DSRBench, LogicVista, VL-RewardBench, CC-OCR, CG-Bench, CMMMU, WorldSense**. Thanks to all contributors 🔥🔥🔥\n- **[2024-12-11]** Supported [**NaturalBench**](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBaiqiL\u002FNaturalBench), a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.\n- **[2024-12-02]** Supported [**VisOnlyQA**](https:\u002F\u002Fgithub.com\u002Fpsunlpgroup\u002FVisOnlyQA\u002F), a benchmark for evaluating the visual perception capabilities 🔥🔥🔥\n- **[2024-11-26]** Supported [**Ovis1.6-Gemma2-27B**](https:\u002F\u002Fhuggingface.co\u002FAIDC-AI\u002FOvis1.6-Gemma2-27B), thanks to [**runninglsy**](https:\u002F\u002Fgithub.com\u002Frunninglsy) 🔥🔥🔥\n- **[2024-11-25]** Create a new flag `VLMEVALKIT_USE_MODELSCOPE`. By setting this environment variable, you can download the video benchmarks supported from [**modelscope**](https:\u002F\u002Fwww.modelscope.cn) 🔥🔥🔥\n\n## 🏗️ QuickStart\n\nSee [[QuickStart](\u002Fdocs\u002Fen\u002FQuickstart.md) | [快速开始](\u002Fdocs\u002Fzh-CN\u002FQuickstart.md)] for a quick start guide.\n\n## 📊 Datasets, Models, and Evaluation Results\n\n### Evaluation Results\n\n**The performance numbers on our official multi-modal leaderboards can be downloaded from here!**\n\n[**OpenVLM Leaderboard**](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopencompass\u002Fopen_vlm_leaderboard): [**Download All DETAILED Results**](http:\u002F\u002Fopencompass.openxlab.space\u002Fassets\u002FOpenVLM.json).\n\nCheck **Supported Benchmarks** Tab in [**VLMEvalKit Features**](https:\u002F\u002Faicarrier.feishu.cn\u002Fwiki\u002FQp7wwSzQ9iK1Y6kNUJVcr6zTnPe?table=tblsdEpLieDoCxtb) to view all supported image & video benchmarks (70+).\n\nCheck **Supported LMMs** Tab in [**VLMEvalKit Features**](https:\u002F\u002Faicarrier.feishu.cn\u002Fwiki\u002FQp7wwSzQ9iK1Y6kNUJVcr6zTnPe?table=tblsdEpLieDoCxtb) to view all supported LMMs, including commercial APIs, open-source models, and more (200+).\n\n**Transformers Version Recommendation:**\n\nNote that some VLMs may not be able to run under certain transformer versions, we recommend the following settings to evaluate each VLM:\n\n- **Please use** `transformers==4.33.0` **for**: `Qwen series`, `Monkey series`, `InternLM-XComposer Series`, `mPLUG-Owl2`, `OpenFlamingo v2`, `IDEFICS series`, `VisualGLM`, `MMAlaya`, `ShareCaptioner`, `MiniGPT-4 series`, `InstructBLIP series`, `PandaGPT`, `VXVERSE`.\n- **Please use** `transformers==4.36.2` **for**: `Moondream1`.\n- **Please use** `transformers==4.37.0` **for**: `LLaVA series`, `ShareGPT4V series`, `TransCore-M`, `LLaVA (XTuner)`, `CogVLM Series`, `EMU2 Series`, `Yi-VL Series`, `MiniCPM-[V1\u002FV2]`, `OmniLMM-12B`, `DeepSeek-VL series`, `InternVL series`, `Cambrian Series`, `VILA Series`, `Llama-3-MixSenseV1_1`, `Parrot-7B`, `PLLaVA Series`.\n- **Please use** `transformers==4.40.0` **for**: `IDEFICS2`, `Bunny-Llama3`, `MiniCPM-Llama3-V2.5`, `360VL-70B`, `Phi-3-Vision`, `WeMM`.\n- **Please use** `transformers==4.42.0` **for**: `AKI`.\n- **Please use** `transformers==4.44.0` **for**: `Moondream2`, `H2OVL series`.\n- **Please use** `transformers==4.45.0` **for**: `Aria`.\n- **Please use** `transformers==4.48.0` (or `4.46.0`) **for**: `LLaVA-Next series` (e.g., `llava-hf\u002Fllava-v1.6-vicuna-7b-hf`).\n- **Please use** `transformers==latest` **for**: `PaliGemma-3B`, `Chameleon series`, `Video-LLaVA-7B-HF`, `Ovis series`, `Mantis series`, `MiniCPM-V2.6`, `OmChat-v2.0-13B-sinlge-beta`, `Idefics-3`, `GLM-4v-9B`, `VideoChat2-HD`, `RBDash_72b`, `Llama-3.2 series`, `Kosmos series`.\n- **Please use** `transformers==4.50.3` (or `4.46.1` or `4.51` or `4.53`) **for**: `Molmo series`.\n- **Please use** `transformers>=5.2.0` **for**: `Qwen3.5 series`.\n\n**Torchvision Version Recommendation:**\n\nNote that some VLMs may not be able to run under certain torchvision versions, we recommend the following settings to evaluate each VLM:\n\n- **Please use** `torchvision>=0.16` **for**: `Moondream series` and `Aria`\n\n**Flash-attn Version Recommendation:**\n\nNote that some VLMs may not be able to run under certain flash-attention versions, we recommend the following settings to evaluate each VLM:\n\n- **Please use** `pip install flash-attn --no-build-isolation` **for**: `Aria`\n\n```python\n# Demo\nfrom vlmeval.config import supported_VLM\nmodel = supported_VLM['idefics_9b_instruct']()\n# Forward Single Image\nret = model.generate(['assets\u002Fapple.jpg', 'What is in this image?'])\nprint(ret)  # The image features a red apple with a leaf on it.\n# Forward Multiple Images\nret = model.generate(['assets\u002Fapple.jpg', 'assets\u002Fapple.jpg', 'How many apples are there in the provided images? '])\nprint(ret)  # There are two apples in the provided images.\n```\n\n## 🛠️ Development Guide\n\nTo develop custom benchmarks, VLMs, or simply contribute other codes to **VLMEvalKit**, please refer to [[Development_Guide](\u002Fdocs\u002Fen\u002FDevelopment.md) | [开发指南](\u002Fdocs\u002Fzh-CN\u002FDevelopment.md)].\n\n**Call for contributions**\n\nTo promote the contribution from the community and share the corresponding credit (in the next report update):\n\n- All Contributions will be acknowledged in the report.\n- Contributors with 3 or more major contributions (implementing an MLLM, benchmark, or major feature) can join the author list of [VLMEvalKit Technical Report](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2407.11691) on ArXiv. Eligible contributors can create an issue or dm kennyutc in [VLMEvalKit Discord Channel](https:\u002F\u002Fdiscord.com\u002Finvite\u002FevDT4GZmxN).\n\nHere is a [contributor list](\u002Fdocs\u002Fen\u002FContributors.md) we curated based on the records.\n\n## 🎯 The Goal of VLMEvalKit\n\n**The codebase is designed to:**\n\n1. Provide an **easy-to-use**, **opensource evaluation toolkit** to make it convenient for researchers & developers to evaluate existing LVLMs and make evaluation results **easy to reproduce**.\n2. Make it easy for VLM developers to evaluate their own models. To evaluate the VLM on multiple supported benchmarks, one just need to **implement a single `generate_inner()` function**, all other workloads (data downloading, data preprocessing, prediction inference, metric calculation) are handled by the codebase.\n\n**The codebase is not designed to:**\n\n1. Reproduce the exact accuracy number reported in the original papers of all **3rd party benchmarks**. The reason can be two-fold:\n   1. VLMEvalKit uses **generation-based evaluation** for all VLMs (and optionally with **LLM-based answer extraction**). Meanwhile, some benchmarks may use different approaches (SEEDBench uses PPL-based evaluation, *eg.*). For those benchmarks, we compare both scores in the corresponding result. We encourage developers to support other evaluation paradigms in the codebase.\n   2. By default, we use the same prompt template for all VLMs to evaluate on a benchmark. Meanwhile, **some VLMs may have their specific prompt templates** (some may not covered by the codebase at this time). We encourage VLM developers to implement their own prompt template in VLMEvalKit, if that is not covered currently. That will help to improve the reproducibility.\n\n## 🖊️ Citation\n\nIf you find this work helpful, please consider to **star🌟** this repo. Thanks for your support!\n\n[![Stargazers repo roster for @open-compass\u002FVLMEvalKit](https:\u002F\u002Freporoster.com\u002Fstars\u002Fopen-compass\u002FVLMEvalKit)](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fstargazers)\n\nIf you use VLMEvalKit in your research or wish to refer to published OpenSource evaluation results, please use the following BibTeX entry and the BibTex entry corresponding to the specific VLM \u002F benchmark you used.\n\n```bib\n@inproceedings{duan2024vlmevalkit,\n  title={Vlmevalkit: An open-source toolkit for evaluating large multi-modality models},\n  author={Duan, Haodong and Yang, Junming and Qiao, Yuxuan and Fang, Xinyu and Chen, Lin and Liu, Yuan and Dong, Xiaoyi and Zang, Yuhang and Zhang, Pan and Wang, Jiaqi and others},\n  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},\n  pages={11198--11201},\n  year={2024}\n}\n```\n\n\u003Cp align=\"right\">\u003Ca href=\"#top\">🔝Back to top\u003C\u002Fa>\u003C\u002Fp>\n\n[github-contributors-link]: https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fgraphs\u002Fcontributors\n[github-contributors-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002Fopen-compass\u002FVLMEvalKit?color=c4f042&labelColor=black&style=flat-square\n[github-forks-link]: https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fnetwork\u002Fmembers\n[github-forks-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fopen-compass\u002FVLMEvalKit?color=8ae8ff&labelColor=black&style=flat-square\n[github-issues-link]: https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fissues\n[github-issues-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Fopen-compass\u002FVLMEvalKit?color=ff80eb&labelColor=black&style=flat-square\n[github-license-link]: https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fblob\u002Fmain\u002FLICENSE\n[github-license-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fopen-compass\u002FVLMEvalKit?color=white&labelColor=black&style=flat-square\n[github-stars-link]: https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fstargazers\n[github-stars-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopen-compass\u002FVLMEvalKit?color=ffcb47&labelColor=black&style=flat-square\n","![LOGO](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopen-compass_VLMEvalKit_readme_f32f9dd11351.jpg)\n\n\u003Cb>大型视觉-语言模型评估工具包。\u003C\u002Fb>\n\n[![][github-contributors-shield]][github-contributors-link] • [![][github-forks-shield]][github-forks-link] • [![][github-stars-shield]][github-stars-link] • [![][github-issues-shield]][github-issues-link] • [![][github-license-shield]][github-license-link]\n\nEnglish | [简体中文](\u002Fdocs\u002Fzh-CN\u002FREADME_zh-CN.md) | [日本語](\u002Fdocs\u002Fja\u002FREADME_ja.md)\n\n\u003Ca href=\"https:\u002F\u002Frank.opencompass.org.cn\u002Fleaderboard-multimodal\">🏆 OC排行榜 \u003C\u002Fa> •\n\u003Ca href=\"#%EF%B8%8F-quickstart\">🏗️快速入门 \u003C\u002Fa> •\n\u003Ca href=\"#-datasets-models-and-evaluation-results\">📊数据集与模型 \u003C\u002Fa> •\n\u003Ca href=\"#%EF%B8%8F-development-guide\">🛠️开发指南 \u003C\u002Fa>\n\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopencompass\u002Fopen_vlm_leaderboard\">🤗 HF排行榜\u003C\u002Fa> •\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FVLMEval\u002FOpenVLMRecords\">🤗 评估记录\u003C\u002Fa> •\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopencompass\u002Fopenvlm_video_leaderboard\">🤗 HF视频排行榜\u003C\u002Fa> •\n\n\u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FevDT4GZmxN\">🔊 Discord\u003C\u002Fa> •\n\u003Ca href=\"https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2407.11691\">📝 报告\u003C\u002Fa> •\n\u003Ca href=\"#-the-goal-of-vlmevalkit\">🎯目标 \u003C\u002Fa> •\n\u003Ca href=\"#%EF%B8%8F-citation\">🖊️引用 \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n**VLMEvalKit**（Python软件包名为 **vlmeval**）是一个针对 **大型视觉-语言模型（LVLMs）** 的 **开源评估工具包**。它能够在多个基准测试上实现对 LVLMs 的 **一键式评估**，而无需在多个代码库中进行繁琐的数据准备工作。在 VLMEvalKit 中，我们对所有 LVLMs 采用 **基于生成的评估方法**，并提供通过 **精确匹配** 和 **基于 LLM 的答案提取** 所获得的评估结果。\n\n## 最近的代码库变更\n- **[2025-09-12]** **重大更新：改进了对具有思考模式模型的处理**\n\n    在 [PR 1229](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F1175) 中新增了一项功能，用于更好地支持具有思考模式的模型。VLMEvalKit 现在允许使用自定义的 `split_thinking` 函数。**我们强烈建议具有思考模式的模型使用此功能，以确保评估的准确性**。要使用此新功能，请启用环境变量：`SPLIT_THINK=True`。默认情况下，该函数会解析 `\u003Cthink>... \u003C\u002Fthink>` 标签内的内容，并将其存储在输出的 `thinking` 键中。如需更高级的自定义，您也可以为模型创建一个 `split_think` 函数。请参阅 InternVL 的实现作为示例。\n- **[2025-09-12]** **重大更新：改进了对长响应（超过16k\u002F32k）的处理**\n\n    在 [PR 1229](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F1175) 中新增了一项功能，用于更好地支持具有长响应输出的模型。VLMEvalKit 现在可以将预测文件保存为 TSV 格式。**由于 `.xlsx` 文件中的单个单元格最多只能容纳 32,767 个字符，因此我们强烈建议对于生成长响应的模型（例如超过 16k 或 32k 个标记）使用此功能，以防止数据截断。** 要使用此新功能，请启用环境变量：`PRED_FORMAT=tsv`。\n- **[2025-08-04]** 在 [PR 1175](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F1175) 中，我们优化了 `can_infer_option` 和 `can_infer_text`，这使得评估更加倾向于使用 LLM 选项提取器，并且在多项选择题基准测试中带来了轻微的性能提升。\n\n## 🆕 新闻\n\n- **[2026-04-08]** 支持了 [**Video-MME-v2**](https:\u002F\u002Fgithub.com\u002FMME-Benchmarks\u002FVideo-MME-v2)。Video-MME-v2 是面向视频理解评估下一阶段的权威基准测试。🔥🔥🔥\n- **[2025-07-07]** 支持了 [**SeePhys**](https:\u002F\u002Fseephys.github.io\u002F)，这是一个涵盖不同知识水平的物理推理评估的全谱多模态基准测试。感谢 [**Quinn777**](https:\u002F\u002Fgithub.com\u002FQuinn777) 🔥🔥🔥\n- **[2025-07-02]** 支持了 [**OvisU1**](https:\u002F\u002Fhuggingface.co\u002FAIDC-AI\u002FOvis-U1-3B)，感谢 [**liyang-7**](https:\u002F\u002Fgithub.com\u002Fliyang-7) 🔥🔥🔥\n- **[2025-06-16]** 支持了 [**PhyX**](https:\u002F\u002Fphyx-bench.github.io\u002F)，这是一个旨在评估视觉场景中基于物理学推理能力的基准测试。🔥🔥🔥\n- **[2025-05-24]** 为了便于对大规模或具有思考模式的模型进行更快的评估，**VLMEvalKit 支持使用 LMDeploy**（支持 *InternVL 系列、QwenVL 系列、LLaMa4*）或 **VLLM**（支持 *QwenVL 系列、LLaMa4*）进行多节点分布式推理。您可以通过在 [config.py](vlmeval\u002Fconfig.py) 中的自定义模型配置中添加 ```use_lmdeploy``` 或 ```use_vllm``` 标志来激活此功能。利用这些工具可以显著加快您的评估流程 🔥🔥🔥\n- **[2025-05-24]** 支持的模型：**InternVL3 系列、Gemini-2.5-Pro、Kimi-VL、LLaMA4、NVILA、Qwen2.5-Omni、Phi4、SmolVLM2、Grok、SAIL-VL-1.5、WeThink-Qwen2.5VL-7B、Bailingmm、VLM-R1、Taichu-VLR**。支持的基准测试：**HLE-Bench、MMVP、MM-AlignBench、Creation-MMBench、MM-IFEval、OmniDocBench、OCR-Reasoning、EMMA、ChaXiv、MedXpertQA、Physics、MSEarthMCQ、MicroBench、MMSci、VGRP-Bench、wildDoc、TDBench、VisuLogic、CVBench、LEGO-Puzzles、Video-MMLU、QBench-Video、MME-CoT、VLM2Bench、VMCBench、MOAT、Spatial457 Benchmark**。更多详情请参阅 [**VLMEvalKit 功能**](https:\u002F\u002Faicarrier.feishu.cn\u002Fwiki\u002FQp7wwSzQ9iK1Y6kNUJVcr6zTnPe?table=tblsdEpLieDoCxtb)。感谢所有贡献者 🔥🔥🔥\n- **[2025-02-20]** 支持的模型：**InternVL2.5 系列、Qwen2.5VL 系列、QVQ-72B、Doubao-VL、Janus-Pro-7B、MiniCPM-o-2.6、InternVL2-MPO、LLaVA-CoT、Hunyuan-Standard-Vision、Ovis2、Valley、SAIL-VL、Ross、Long-VITA、EMU3、SmolVLM**。支持的基准测试：**MMMU-Pro、WeMath、3DSRBench、LogicVista、VL-RewardBench、CC-OCR、CG-Bench、CMMMU、WorldSense**。感谢所有贡献者 🔥🔥🔥\n- **[2024-12-11]** 支持了 [**NaturalBench**](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBaiqiL\u002FNaturalBench)，这是一个以视觉为中心的 VQA 基准测试（NeurIPS'24），它通过关于自然图像的简单问题来挑战视觉-语言模型。\n- **[2024-12-02]** 支持了 [**VisOnlyQA**](https:\u002F\u002Fgithub.com\u002Fpsunlpgroup\u002FVisOnlyQA\u002F)，这是一个用于评估视觉感知能力的基准测试 🔥🔥🔥\n- **[2024-11-26]** 支持了 [**Ovis1.6-Gemma2-27B**](https:\u002F\u002Fhuggingface.co\u002FAIDC-AI\u002FOvis1.6-Gemma2-27B)，感谢 [**runninglsy**](https:\u002F\u002Fgithub.com\u002Frunninglsy) 🔥🔥🔥\n- **[2024-11-25]** 创建了一个新的标志 `VLMEVALKIT_USE_MODELSCOPE`。通过设置此环境变量，您可以从 [**modelscope**](https:\u002F\u002Fwww.modelscope.cn) 下载支持的视频基准测试 🔥🔥🔥\n\n## 🏗️ 快速入门\n\n有关快速入门指南，请参阅 [[QuickStart](\u002Fdocs\u002Fen\u002FQuickstart.md) | [快速开始](\u002Fdocs\u002Fzh-CN\u002FQuickstart.md)]。\n\n## 📊 数据集、模型和评估结果\n\n### 评估结果\n\n**我们官方多模态排行榜上的性能数据可从此处下载！**\n\n[**OpenVLM 排行榜**](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopencompass\u002Fopen_vlm_leaderboard)：[**下载所有详细结果**](http:\u002F\u002Fopencompass.openxlab.space\u002Fassets\u002FOpenVLM.json)。\n\n请查看 [**VLMEvalKit 功能**](https:\u002F\u002Faicarrier.feishu.cn\u002Fwiki\u002FQp7wwSzQ9iK1Y6kNUJVcr6zTnPe?table=tblsdEpLieDoCxtb) 中的“支持的基准”选项卡，以查看所有受支持的图像和视频基准（70+）。\n\n请查看 [**VLMEvalKit 功能**](https:\u002F\u002Faicarrier.feishu.cn\u002Fwiki\u002FQp7wwSzQ9iK1Y6kNUJVcr6zTnPe?table=tblsdEpLieDoCxtb) 中的“支持的 LMM”选项卡，以查看所有受支持的 LMM，包括商业 API、开源模型等（200+）。\n\n**Transformers 版本推荐：**\n\n请注意，某些 VLM 在特定版本的 Transformers 下可能无法运行。我们建议使用以下设置来评估每种 VLM：\n\n- **请使用** `transformers==4.33.0` **用于**：`Qwen 系列`、`Monkey 系列`、`InternLM-XComposer 系列`、`mPLUG-Owl2`、`OpenFlamingo v2`、`IDEFICS 系列`、`VisualGLM`、`MMAlaya`、`ShareCaptioner`、`MiniGPT-4 系列`、`InstructBLIP 系列`、`PandaGPT`、`VXVERSE`。\n- **请使用** `transformers==4.36.2` **用于**：`Moondream1`。\n- **请使用** `transformers==4.37.0` **用于**：`LLaVA 系列`、`ShareGPT4V 系列`、`TransCore-M`、`LLaVA (XTuner)`、`CogVLM 系列`、`EMU2 系列`、`Yi-VL 系列`、`MiniCPM-[V1\u002FV2]`、`OmniLMM-12B`、`DeepSeek-VL 系列`、`InternVL 系列`、`Cambrian 系列`、`VILA 系列`、`Llama-3-MixSenseV1_1`、`Parrot-7B`、`PLLaVA 系列`。\n- **请使用** `transformers==4.40.0` **用于**：`IDEFICS2`、`Bunny-Llama3`、`MiniCPM-Llama3-V2.5`、`360VL-70B`、`Phi-3-Vision`、`WeMM`。\n- **请使用** `transformers==4.42.0` **用于**：`AKI`。\n- **请使用** `transformers==4.44.0` **用于**：`Moondream2`、`H2OVL 系列`。\n- **请使用** `transformers==4.45.0` **用于**：`Aria`。\n- **请使用** `transformers==4.48.0`（或 `4.46.0`） **用于**：`LLaVA-Next 系列`（例如 `llava-hf\u002Fllava-v1.6-vicuna-7b-hf`）。\n- **请使用** `transformers==latest` **用于**：`PaliGemma-3B`、`Chameleon 系列`、`Video-LLaVA-7B-HF`、`Ovis 系列`、`Mantis 系列`、`MiniCPM-V2.6`、`OmChat-v2.0-13B-sinlge-beta`、`Idefics-3`、`GLM-4v-9B`、`VideoChat2-HD`、`RBDash_72b`、`Llama-3.2 系列`、`Kosmos 系列`。\n- **请使用** `transformers==4.50.3`（或 `4.46.1`、`4.51` 或 `4.53`） **用于**：`Molmo 系列`。\n- **请使用** `transformers>=5.2.0` **用于**：`Qwen3.5 系列`。\n\n**Torchvision 版本推荐：**\n\n请注意，某些 VLM 在特定版本的 torchvision 下可能无法运行。我们建议使用以下设置来评估每种 VLM：\n\n- **请使用** `torchvision>=0.16` **用于**：`Moondream 系列` 和 `Aria`。\n\n**Flash-attn 版本推荐：**\n\n请注意，某些 VLM 在特定版本的 flash-attention 下可能无法运行。我们建议使用以下设置来评估每种 VLM：\n\n- **请使用** `pip install flash-attn --no-build-isolation` **用于**：`Aria`。\n\n```python\n# 示例\nfrom vlmeval.config import supported_VLM\nmodel = supported_VLM['idefics_9b_instruct']()\n# 单张图片推理\nret = model.generate(['assets\u002Fapple.jpg', '这张图片里有什么？'])\nprint(ret)  # 图片中有一颗带有叶子的红苹果。\n# 多张图片推理\nret = model.generate(['assets\u002Fapple.jpg', 'assets\u002Fapple.jpg', '提供的图片中有几颗苹果？'])\nprint(ret)  # 提供的图片中有两颗苹果。\n```\n\n## 🛠️ 开发指南\n\n如需开发自定义基准、VLM，或仅为 **VLMEvalKit** 贡献其他代码，请参阅 [[Development_Guide](\u002Fdocs\u002Fen\u002FDevelopment.md) | [开发指南](\u002Fdocs\u002Fzh-CN\u002FDevelopment.md)]。\n\n**征集贡献**\n\n为鼓励社区贡献并分享相应荣誉（在下一次报告更新中）：\n\n- 所有贡献都将在报告中被致谢。\n- 拥有3项或以上重大贡献（实现一个 MLLM、基准或主要功能）的贡献者，可加入 ArXiv 上的 [VLMEvalKit 技术报告](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2407.11691) 的作者名单。符合条件的贡献者可在 [VLMEvalKit Discord 频道](https:\u002F\u002Fdiscord.com\u002Finvite\u002FevDT4GZmxN) 中创建议题或私信 kennyutc。\n\n我们根据记录整理了一份 [贡献者列表](\u002Fdocs\u002Fen\u002FContributors.md)。\n\n## 🎯 VLMEvalKit 的目标\n\n**该代码库旨在：**\n\n1. 提供一个**易于使用**、**开源的评估工具包**，方便研究人员和开发者评估现有的 LVLM，并使评估结果**易于复现**。\n2. 方便 VLM 开发者评估自己的模型。只需**实现一个 `generate_inner()` 函数**，即可在多个受支持的基准上评估 VLM；其余工作负载（数据下载、数据预处理、预测推理、指标计算）均由代码库自动处理。\n\n**该代码库并非旨在：**\n\n1. 复现所有**第三方基准**原始论文中报告的确切准确率。原因有两点：\n   1. VLMEvalKit 对所有 VLM 采用**基于生成的评估方法**（并可选地结合**基于 LLM 的答案提取**）。而部分基准可能采用不同的评估方式（例如 SEEDBench 使用 PPL 评估）。对于这些基准，我们在相应结果中同时比较两种分数。我们鼓励开发者在代码库中支持其他评估范式。\n   2. 默认情况下，我们对所有 VLM 在同一基准上使用相同的提示模板进行评估。然而，**某些 VLM 可能有其特定的提示模板**（目前代码库可能尚未覆盖）。我们鼓励 VLM 开发者在 VLMEvalKit 中实现他们自己的提示模板，若当前尚未支持的话。这将有助于提高结果的可重复性。\n\n## 🖊️ 引用\n\n如果您觉得这项工作对您有帮助，请考虑给本仓库**加星🌟**。感谢您的支持！\n\n[![@open-compass\u002FVLMEvalKit 的星辰榜](https:\u002F\u002Freporoster.com\u002Fstars\u002Fopen-compass\u002FVLMEvalKit)](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fstargazers)\n\n如果您在研究中使用了 VLMEvalKit，或希望引用已发布的开源评估结果，请使用以下 BibTeX 条目，以及您所使用的特定多模态大模型或基准测试对应的 BibTeX 条目。\n\n```bibtex\n@inproceedings{duan2024vlmevalkit,\n  title={Vlmevalkit: 用于评估大型多模态模型的开源工具包},\n  author={Duan, Haodong and Yang, Junming and Qiao, Yuxuan and Fang, Xinyu and Chen, Lin and Liu, Yuan and Dong, Xiaoyi and Zang, Yuhang and Zhang, Pan and Wang, Jiaqi and others},\n  booktitle={第32届 ACM 国际多媒体会议论文集},\n  pages={11198--11201},\n  year={2024}\n}\n```\n\n\u003Cp align=\"right\">\u003Ca href=\"#top\">🔝返回顶部\u003C\u002Fa>\u003C\u002Fp>\n\n[github-contributors-link]: https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fgraphs\u002Fcontributors\n[github-contributors-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002Fopen-compass\u002FVLMEvalKit?color=c4f042&labelColor=black&style=flat-square\n[github-forks-link]: https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fnetwork\u002Fmembers\n[github-forks-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fopen-compass\u002FVLMEvalKit?color=8ae8ff&labelColor=black&style=flat-square\n[github-issues-link]: https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fissues\n[github-issues-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Fopen-compass\u002FVLMEvalKit?color=ff80eb&labelColor=black&style=flat-square\n[github-license-link]: https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fblob\u002Fmain\u002FLICENSE\n[github-license-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fopen-compass\u002FVLMEvalKit?color=white&labelColor=black&style=flat-square\n[github-stars-link]: https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fstargazers\n[github-stars-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopen-compass\u002FVLMEvalKit?color=ffcb47&labelColor=black&style=flat-square","# VLMEvalKit 快速上手指南\n\nVLMEvalKit 是一个开源的大型视觉 - 语言模型（LVLM）评估工具包，支持一键式评估多种基准测试，无需繁琐的数据准备工作。\n\n## 1. 环境准备\n\n### 系统要求\n- **Python**: 建议 Python 3.8+\n- **操作系统**: Linux (推荐), macOS, Windows\n\n### 前置依赖与版本注意\n不同模型对 `transformers` 和 `torchvision` 版本有特定要求。在开始之前，请根据你要评估的模型选择正确的环境版本：\n\n*   **Qwen 系列 \u002F InternLM-XComposer \u002F MiniGPT-4 等**: `transformers==4.33.0`\n*   **LLaVA 系列 \u002F CogVLM \u002F DeepSeek-VL \u002F InternVL 系列**: `transformers==4.37.0`\n*   **LLaVA-Next 系列**: `transformers==4.48.0` (或 4.46.0)\n*   **最新模型 (Ovis, GLM-4v, Llama-3.2 等)**: `transformers==latest`\n*   **Qwen3.5 系列**: `transformers>=5.2.0`\n*   **Moondream \u002F Aria**: 需 `torchvision>=0.16`\n*   **Aria**: 需安装 `flash-attn` (`pip install flash-attn --no-build-isolation`)\n\n> **提示**：建议在虚拟环境中安装，以避免版本冲突。\n\n## 2. 安装步骤\n\n### 基础安装\n从 GitHub 克隆仓库并安装依赖：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit.git\ncd VLMEvalKit\npip install -e .\n```\n\n### 国内加速方案（推荐）\n如果你在中国大陆，建议使用镜像源加速下载，并配置 ModelScope 以快速获取视频基准数据：\n\n```bash\n# 使用清华源安装依赖\npip install -e . -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 启用 ModelScope 下载视频基准数据 (可选)\nexport VLMEVALKIT_USE_MODELSCOPE=1\n```\n\n## 3. 基本使用\n\nVLMEvalKit 的核心优势是极简的 API。以下是一个最简单的 Python 调用示例，演示如何加载模型并进行单图或多图推理。\n\n### 代码示例\n\n```python\nfrom vlmeval.config import supported_VLM\n\n# 1. 初始化模型 (以 idefics_9b_instruct 为例)\n# 你可以在 supported_VLM 字典中查找其他已支持的模型名称\nmodel = supported_VLM['idefics_9b_instruct']()\n\n# 2. 单张图片推理\nret = model.generate(['assets\u002Fapple.jpg', 'What is in this image?'])\nprint(ret)  \n# 输出示例：The image features a red apple with a leaf on it.\n\n# 3. 多张图片推理\nret = model.generate([\n    'assets\u002Fapple.jpg', \n    'assets\u002Fapple.jpg', \n    'How many apples are there in the provided images?'\n])\nprint(ret)  \n# 输出示例：There are two apples in the provided images.\n```\n\n### 高级功能提示\n*   **思维链模式支持**：对于具有思考模式（Thinking Mode）的模型，建议设置环境变量以正确解析 `\u003Cthink>` 标签：\n    ```bash\n    export SPLIT_THINK=True\n    ```\n*   **长文本输出支持**：若模型生成长回复（超过 16k\u002F32k tokens），建议将预测结果保存为 TSV 格式以防 Excel 截断：\n    ```bash\n    export PRED_FORMAT=tsv\n    ```\n*   **分布式加速**：对于大规模评估，可在配置文件 `vlmeval\u002Fconfig.py` 中添加 `use_lmdeploy` 或 `use_vllm` 标志来启用分布式推理加速。\n\n更多详细的基准测试列表、模型支持情况及开发指南，请参阅项目官方文档或飞书特性表。","某多模态算法团队正在研发新一代视觉语言模型，急需在发布前对 5 个候选模型进行全面的性能基准测试，以筛选出最优版本。\n\n### 没有 VLMEvalKit 时\n- **环境配置繁琐**：团队需手动克隆 80+ 个不同数据集的仓库，逐一处理数据格式转换，耗时数天且极易出错。\n- **评估标准不一**：针对不同模型需编写独立的推理脚本，难以统一采用“生成式评估”或\"LLM 辅助提取”等先进策略，导致结果不可比。\n- **长文本支持缺失**：面对输出超过 32k token 的复杂推理任务，传统 Excel 存储方式直接截断内容，导致关键逻辑链丢失，无法准确评估模型能力。\n- **新特性适配困难**：对于具备“思维链（Thinking Mode）”的新模型，缺乏自动解析 `\u003Cthink>` 标签的机制，人工清洗答案效率极低。\n\n### 使用 VLMEvalKit 后\n- **一键启动评测**：只需一条命令即可自动拉取并预处理 220+ 种模型所需的 80+ 个基准数据集，将准备时间从数天缩短至几分钟。\n- **统一评估范式**：内置标准化的生成式评估流程，自动切换精确匹配或 LLM 提取模式，确保所有候选模型在公平条件下产出可比数据。\n- **无损长文记录**：通过设置 `PRED_FORMAT=tsv`，完美支持超长响应保存，彻底解决因单元格字符限制导致的数据截断问题。\n- **智能思维解析**：开启 `SPLIT_THINK=True` 后，自动识别并分离模型的思考过程与最终答案，精准评估具备深度推理能力模型的真实水平。\n\nVLMEvalKit 将原本碎片化、高门槛的多模态评测工作转化为标准化、自动化的流水线，让研发团队能专注于模型迭代而非数据杂务。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopen-compass_VLMEvalKit_f32f9dd1.jpg","open-compass","OpenCompass","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fopen-compass_6ef39538.png","",null,"opencompass@pjlab.org.cn","OpenCompassX","opencompass.org.cn","https:\u002F\u002Fgithub.com\u002Fopen-compass",[82,86,90],{"name":83,"color":84,"percentage":85},"Python","#3572A5",99.7,{"name":87,"color":88,"percentage":89},"Jupyter Notebook","#DA5B0B",0.3,{"name":91,"color":92,"percentage":93},"Shell","#89e051",0,4030,678,"2026-04-13T04:50:20","Apache-2.0","未说明","需要 NVIDIA GPU（隐含，因依赖 flash-attn 和 CUDA），具体显存需求取决于所选模型（如支持长文本或多节点推理），需安装 flash-attn（部分模型如 Aria 需特定安装方式）","未说明（建议根据模型大小配置，大规模或思维链模型推荐多节点分布式推理）",{"notes":102,"python":98,"dependencies":103},"核心依赖 transformers 的版本严格依赖于具体评估的模型（例如 Qwen 系列需 4.33.0，LLaVA 系列需 4.37.0，Qwen3.5 需>=5.2.0 等），请务必根据目标模型切换版本。支持通过设置环境变量 SPLIT_THINK=True 处理思维链模型，设置 PRED_FORMAT=tsv 防止长文本输出截断。支持使用 LMDeploy 或 VLLM 进行多节点分布式推理以加速评估。视频基准测试可通过设置 VLMEVALKIT_USE_MODELSCOPE 从 ModelScope 下载。",[104,105,106,107,108,109],"torch","torchvision>=0.16 (针对 Moondream, Aria)","transformers (版本依模型而定，范围 4.33.0 - 5.2.0+)","flash-attn (部分模型必需)","LMDeploy (可选，用于加速 InternVL, QwenVL, LLaMa4)","VLLM (可选，用于加速 QwenVL, LLaMa4)",[14,111,35,15,13,52],"其他",[113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131],"gpt-4v","large-language-models","llava","multi-modal","openai","vqa","llm","openai-api","qwen","gpt","computer-vision","pytorch","gpt4","chatgpt","clip","vit","evaluation","claude","gemini","2026-03-27T02:49:30.150509","2026-04-13T23:53:18.996720",[135,140,145,150,155,160,165],{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},32203,"如何配置本地数据集路径以避免重复下载或链接失效？","可以将 `LMUData` 环境变量设置为所有数据存储的根目录路径。例如，设置 `LMUData=.\u002Fdata`，然后在 `data\u002F` 目录下直接存放对应的 `.tsv` 文件（如 `MMBench_DEV_EN.tsv`）。这样工具会优先读取本地文件，无需重新下载。","https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fissues\u002F412",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},32204,"如何使用 GPT API 进行多选题任务的评判？","默认的多选题评判方法是精确匹配（exact-matching）。若要使用 GPT API 评判，需配置相应的 Judge Model。目前为了榜单公平性，官方统一使用 `chatgpt-0125` 报告结果。用户也可以尝试使用 `GPT-4o` 作为评判模型，但需注意不同模型评判结果可能存在差异。","https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fissues\u002F305",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},32205,"运行大模型（如 InternVL）多卡推理时出现 OOM 错误或只使用单卡怎么办？","如果在代码中使用了 `device_map='auto'` 加载模型，请移除手动将模型移动到 CUDA 的代码（即删除 `self.model = self.model.cuda()` 或 `self.model.to(device)`）。让 HuggingFace Accelerate 或 Transformers 自动管理设备映射即可解决多卡负载不均和显存溢出问题。","https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fissues\u002F224",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},32206,"是否支持通过部署 API 服务的方式评测模型，或支持 NPU（如 Ascend）环境？","支持。可以将模型部署为 OpenAI 格式的 API 服务，然后在配置文件中设置 `api_base` 和 `key`。对于 Ascend NPU 环境，建议安装 `torch-npu` 并使用 `vllm_ascend` 搭建本地 LLM 服务，随后在配置中指定 base url 和 api_key 进行评测，避免直接使用 `pip install -e .` 可能遇到的兼容性问题。","https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fissues\u002F1158",{"id":156,"question_zh":157,"answer_zh":158,"source_url":159},32207,"安装 flash-attn2 时编译 wheel 速度极慢或卡住如何解决？","这通常是由于网络问题导致无法拉取预编译的 wheel 包。解决方案包括：1. 前往 flash-attention 官方仓库 Releases 页面手动下载对应版本的 wheel 文件上传至服务器安装；2. 配置服务器网络以访问 GitHub；3. 添加 `--verbose` 参数查看更详细的报错信息以确认是否为网络问题。","https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fissues\u002F532",{"id":161,"question_zh":162,"answer_zh":163,"source_url":164},32208,"运行 3DSRBench 评测时报错 'No module named vlmeval.dataset.utils.sr3d' 怎么办？","该问题是由于缺少 `sr3d` 模块导致的，已在后续版本（PR #1428）中修复。请确保更新 VLMEvalKit 到最新版本。如果仍报错，检查 `vlmeval\u002Fdataset\u002Fimage_mcq.py` 中 `_3DSRBench` 类的实现，确保相关依赖已正确安装或代码逻辑已更新。","https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fissues\u002F1304",{"id":166,"question_zh":167,"answer_zh":168,"source_url":169},32209,"评测 MMBench-Video 视频数据集时报错 'string indices must be integers' 是什么原因？","该错误通常发生在 `build_prompt_nopack` 函数中，表明传入的 `line` 参数类型异常（变成了字符串而非字典）。这可能是特定模型（如某些非标准 Video LLM）未正确实现 `build_custom_prompt` 而回退到了默认处理逻辑。建议检查模型适配代码，或参考 InternVL2 等正常工作的模型，确保其走自定义的 prompt 构建流程。","https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fissues\u002F530",[171,175,180,184],{"id":172,"version":173,"summary_zh":76,"released_at":174},246272,"v0.3rc1","2025-06-21T17:45:35",{"id":176,"version":177,"summary_zh":178,"released_at":179},246273,"v0.2","起草一个新的发布版本，供内部评估使用。","2025-03-24T08:44:44",{"id":181,"version":182,"summary_zh":76,"released_at":183},246274,"v0.2rc1","2024-06-29T15:46:48",{"id":185,"version":186,"summary_zh":187,"released_at":188},246275,"v0.1","## 变更内容\n* [功能] 支持 multi_generate，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F1 中实现\n* [工具] 小幅更新 1205，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F3 中完成\n* TranCore-M 20231208，由 @PCIResearch 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F8 中发布\n* [结果] 更新 TransCore 结果，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F9 中完成\n* COREMM 评估基准，由 @youngfly11 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F6 中提出\n* 添加 MMVet，由 @llllIlllll 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F7 中完成\n* [文档] 更新 README 12.11，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F10 中完成\n* 更新 mmvet_eval，由 @llllIlllll 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F11 中完成\n* [功能] 添加 run.py 并简化评估流程，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F12 中实现\n* [文档] 优化 README，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F13 中完成\n* [修复] 修复 README，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F14 中完成\n* 添加数据集 MD5 校验以确保完整性，由 @FangXinyu-0913 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F15 中实现\n* [功能] 支持两种视觉 API，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F19 中实现\n* 添加 COCO 数据集，由 @FangXinyu-0913 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F16 中完成\n* [修复] 修复 1221，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F21 中完成\n* [功能] 提升 API 评估的鲁棒性，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F22 中实现\n* [重构] 重构自定义提示并修复 mPLUG-Owl2 准确率，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F23 中完成\n* [数据集] VQA 数据集，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F25 中提供\n* [修复] 修复 bug，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F26 中完成\n* 添加 MMMU 数据集，由 @llllIlllll 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F18 中完成\n* 添加 QwenVLPlus API，由 @llllIlllll 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F27 中实现\n* [结果] 更新 MMMU 准确率，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F30 中完成\n* [功能] 支持 `LLaVA_XTuner` 模型，由 @LZHgrla 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F17 中实现\n* [结果] 更新 XTuner 性能，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F31 中完成\n* [结果] 更新 COCO 字幕生成结果，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F35 中完成\n* 添加 ChartQA 数据集，由 @FangXinyu-0913 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F28 中完成\n* [功能]: 添加 ScienceQA，由 @YuanLiuuuuuu 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F24 中实现\n* [数据集] MathVista 数据集，由 @llllIlllll 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F29 中提供\n* [数据集] HallusionBench，由 @kennymckormick 在 https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F38 中推出\n* 添加 sharedcaptioner 和 cogvlm","2024-01-22T05:52:49"]