[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-BradyFU--Awesome-Multimodal-Large-Language-Models":3,"tool-BradyFU--Awesome-Multimodal-Large-Language-Models":65},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159636,2,"2026-04-17T23:33:34",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},8553,"spec-kit","github\u002Fspec-kit","Spec Kit 是一款专为提升软件开发效率而设计的开源工具包，旨在帮助团队快速落地“规格驱动开发”（Spec-Driven Development）模式。传统开发中，需求文档往往与代码实现脱节，导致沟通成本高且结果不可控；而 Spec Kit 通过将规格说明书转化为可执行的指令，让 AI 直接依据明确的业务场景生成高质量代码，从而减少从零开始的随意编码，确保产出结果的可预测性。\n\n该工具特别适合希望利用 AI 辅助编程的开发者、技术负责人及初创团队。无论是启动全新项目还是在现有工程中引入规范化流程，用户只需通过简单的命令行操作，即可初始化项目并集成主流的 AI 编程助手。其核心技术亮点在于“规格即代码”的理念，支持社区扩展与预设模板，允许用户根据特定技术栈定制开发流程。此外，Spec Kit 强调官方维护的安全性，提供稳定的版本管理，帮助开发者在享受 AI 红利的同时，依然牢牢掌握架构设计的主动权，真正实现从“凭感觉写代码”到“按规格建系统”的转变。",88749,"2026-04-17T09:48:14",[15,26,14,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":10,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,51,52,53,14,54,15,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,51,54],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":81,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":81,"stars":84,"forks":85,"last_commit_at":86,"license":81,"difficulty_score":62,"env_os":87,"env_gpu":88,"env_ram":88,"env_deps":89,"category_tags":92,"github_topics":93,"view_count":10,"oss_zip_url":81,"oss_zip_packed_at":81,"status":16,"created_at":107,"updated_at":108,"faqs":109,"releases":140},8691,"BradyFU\u002FAwesome-Multimodal-Large-Language-Models","Awesome-Multimodal-Large-Language-Models",":sparkles::sparkles:Latest Advances on Multimodal Large Language Models","Awesome-Multimodal-Large-Language-Models 是一个专注于多模态大语言模型（MLLMs）的开源资源汇总平台，由南京大学 MiG 团队维护。它系统性地整理了该领域最新的学术论文、综述报告、基准测试数据集以及开源项目代码，旨在解决研究人员和开发者在快速迭代的 AI 浪潮中难以高效获取高质量资料、缺乏统一评估标准等痛点。\n\n无论是希望深入了解行业前沿的研究学者，还是正在寻找可靠评测工具或基线模型的算法工程师，都能在这里找到极具价值的参考。其核心亮点在于不仅收录了关于多模态理解与生成的权威综述，还推出了具有影响力的 VITA 系列模型（支持实时视听交互及百万级上下文长度）和 MME 系列评测基准（涵盖视频分析、高分辨率真实场景等复杂任务）。这些成果为社区提供了从理论调研到实际验证的一站式解决方案，帮助用户更便捷地追踪技术趋势、复现先进算法并推动多模态智能的实际应用落地。","# Awesome-Multimodal-Large-Language-Models\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBradyFU_Awesome-Multimodal-Large-Language-Models_readme_161dcbf89c15.jpg\" width=\"100%\" height=\"100%\">\n\u003C\u002Fp>\n\n## ✨ Highlights of NJU-MiG\n\n> 🔥🔥 **Surveys of MLLMs**  |  **[💬 WeChat (MLLM微信交流群)](.\u002Fimages\u002Fwechat-group.png)**\n\n- 🌟 **MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs**  \narXiv 2025, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.15296.pdf), [Project](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FBenchmarks) \n\n- 🌟 **A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges**  \narXiv 2025, [Paper](https:\u002F\u002Fwww.techrxiv.org\u002Fdoi\u002Fpdf\u002F10.36227\u002Ftechrxiv.176289261.16802577), [Project](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FUnified) \n\n- **A Survey on Multimodal Large Language Models**  \nNSR 2024, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13549.pdf), [Project](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models)\n\n\n---\n\n\n> 🔥🔥 **VITA Series Omni MLLMs** | **[💬 WeChat (VITA微信交流群)](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA\u002Fblob\u002Fmain\u002Fasset\u002Fwechat-group.jpg)**\n\n- **VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction**  \nNeurIPS 2025 Highlight, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.01957.pdf), [Project](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA)\n\n- **VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting**  \narXiv 2025, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.21817.pdf), [Project](https:\u002F\u002Flxysl.github.io\u002FVITA-E\u002F)\n\n- **VITA: Towards Open-Source Interactive Omni Multimodal LLM**  \narXiv 2024, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.05211.pdf), [Project](https:\u002F\u002Fvita-home.github.io\u002F)\n\n- **Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy**  \narXiv 2025, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.05177.pdf), [Project](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FLong-VITA)\n\n- **VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model**  \nNeurIPS 2025, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.03739.pdf), [Project](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA-Audio)\n\n\n---\n\n\n> 🔥🔥 **MME Series MLLM Benchmarks**\n\n- 🔥 **Video-MME-v2: Towards the Next Stage in Video Understanding Evaluation**\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBradyFU_Awesome-Multimodal-Large-Language-Models_readme_88797976d3c8.png\" width=\"100%\" height=\"100%\">\n\u003C\u002Fp>\n\n\u003Cfont size=7>\u003Cdiv align='center' > [[🍎 Project Page](https:\u002F\u002Fvideo-mme-v2.netlify.app\u002F)] [[📖 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2604.05015)] [[🤗 Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMME-Benchmarks\u002FVideo-MME-v2)] [[🏆 Leaderboard](https:\u002F\u002Fvideo-mme-v2.netlify.app\u002F#leaderboard)]  \u003C\u002Fdiv>\u003C\u002Ffont>\n\n- 🌟 **MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs**  \narXiv 2025, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.15296.pdf), [Project](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FBenchmarks)\n\n- **MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models**  \nNeurIPS 2025 DB Highlight, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13394.pdf), [Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flmms-lab\u002FMME), [Eval Tool](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Fblob\u002FEvaluation\u002Ftools\u002Feval_tool.zip), [✒️ Citation](.\u002Fimages\u002Fbib_mme.txt)\n\n- **Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis**  \nCVPR 2025, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.21075.pdf), [Project](https:\u002F\u002Fvideo-mme.github.io\u002F), [Dataset](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME?tab=readme-ov-file#-dataset)\n\n- **MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?**  \nICLR 2025, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.13257.pdf), [Project](https:\u002F\u002Fmme-realworld.github.io\u002F), [Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyifanzhang114\u002FMME-RealWorld)\n\n\n\n\n\n\n---\n\n\u003Cfont size=5>\u003Ccenter>\u003Cb> Table of Contents \u003C\u002Fb> \u003C\u002Fcenter>\u003C\u002Ffont>\n- [Awesome Papers](#awesome-papers)\n  - [Multimodal Instruction Tuning (& Latest Works)](#multimodal-instruction-tuning--latest-works)\n  - [Multimodal Hallucination](#multimodal-hallucination)\n  - [Multimodal In-Context Learning](#multimodal-in-context-learning)\n  - [Multimodal Chain-of-Thought](#multimodal-chain-of-thought)\n  - [LLM-Aided Visual Reasoning](#llm-aided-visual-reasoning)\n  - [Foundation Models](#foundation-models)\n  - [Evaluation](#evaluation)\n  - [Multimodal RLHF](#multimodal-rlhf)\n  - [Others](#others)\n- [Awesome Datasets](#awesome-datasets)\n  - [Datasets of Pre-Training for Alignment](#datasets-of-pre-training-for-alignment)\n  - [Datasets of Multimodal Instruction Tuning](#datasets-of-multimodal-instruction-tuning)\n  - [Datasets of In-Context Learning](#datasets-of-in-context-learning)\n  - [Datasets of Multimodal Chain-of-Thought](#datasets-of-multimodal-chain-of-thought)\n  - [Datasets of Multimodal RLHF](#datasets-of-multimodal-rlhf)\n  - [Benchmarks for Evaluation](#benchmarks-for-evaluation)\n  - [Others](#others-1)\n---\n\n# Awesome Papers\n\n## Multimodal Instruction Tuning (& Latest Works)\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMME-Benchmarks\u002FVideo-MME-v2.svg?style=social&label=Star) \u003Cbr> [**Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2604.05015) \u003Cbr> | arXiv | 2026-04-06 | [Github](https:\u002F\u002Fgithub.com\u002FMME-Benchmarks\u002FVideo-MME-v2) | [Demo](https:\u002F\u002Fvideo-mme-v2.netlify.app\u002F) |\n| [**Introducing Muse Spark: Scaling Towards Personal Superintelligence**](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fintroducing-muse-spark-msl\u002F) | Blog | 2026-04-08 | - | [Demo](https:\u002F\u002Fmeta.ai\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FVITA-QinYu.svg?style=social&label=Star) \u003Cbr> [**VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing**](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA-QinYu) \u003Cbr> | arXiv | 2026-04-03 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA-QinYu) | Local Demo |\n| [**Gemma 4: Byte for byte, the most capable open models**](https:\u002F\u002Fdeepmind.google\u002Fmodels\u002Fgemma\u002Fgemma-4\u002F) | Blog | 2026-04-02 | - | [Demo](https:\u002F\u002Faistudio.google.com\u002Fprompts\u002Fnew_chat?model=gemma-4-31b-it&utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=) |\n| [**Qwen3.5-Omni: Scaling Up, Toward Native Omni-Modal AGI**](https:\u002F\u002Fqwen.ai\u002Fblog?id=qwen3.5-omni) | Blog | 2026-03-30 | - | [Demo](https:\u002F\u002Fchat.qwen.ai\u002F?spm=a2ty_o06.30285417.0.0.6d26c921GDrWrb) |\n| [**Xiaomi MiMo-V2-Omni**](https:\u002F\u002Fmimo.xiaomi.com\u002Fmimo-v2-omni) | Blog | 2026-03-18 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL-U.svg?style=social&label=Star) \u003Cbr> [**InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2603.09877) \u003Cbr> | arXiv | 2026-03-10 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL-U) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FOmni-Diffusion.svg?style=social&label=Star) \u003Cbr> [**Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2603.06577) \u003Cbr> | arXiv | 2026-03-06 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FOmni-Diffusion) | - |\n| [**Beyond Language Modeling: An Exploration of Multimodal Pretraining**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2603.03276) | arXiv | 2026-03-03 | - | - |\n| [**Gemini 3.1 Pro: A smarter model for your most complex tasks**](https:\u002F\u002Fblog.google\u002Finnovation-and-ai\u002Fmodels-and-research\u002Fgemini-models\u002Fgemini-3-1-pro\u002F) | Blog | 2026-02-19 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3.5.svg?style=social&label=Star) \u003Cbr> [**Qwen3.5: Towards Native Multimodal Agents**](https:\u002F\u002Fqwen.ai\u002Fblog?id=qwen3.5) \u003Cbr> | Blog | 2026-02-16 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3.5) | [Demo](https:\u002F\u002Fchat.qwen.ai\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenBMB\u002FMiniCPM-o.svg?style=social&label=Star) \u003Cbr> [**MiniCPM-o 4.5**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-o-4_5) \u003Cbr> | Blog | 2026-02-06 | [Github](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-o) | [Demo](https:\u002F\u002Fminicpm-omni.openbmb.cn\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-OCR-2.svg?style=social&label=Star) \u003Cbr> [**DeepSeek-OCR 2: Visual Causal Flow**](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-OCR-2\u002Fblob\u002Fmain\u002FDeepSeek_OCR2_paper.pdf) \u003Cbr> | DeepSeek | 2026-01-27 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-OCR-2) | - |\n| [**Seed1.8 Model Card: Towards Generalized Real-World Agency**](https:\u002F\u002Flf3-static.bytednsdoc.com\u002Fobj\u002Feden-cn\u002Flapzild-tss\u002FljhwZthlaukjlkulzlp\u002Fresearch\u002FSeed-1.8-Modelcard.pdf) | Bytedance Seed | 2025-12-18 | - | - |\n| [**Introducing GPT-5.2**](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-gpt-5-2\u002F) | OpenAI | 2025-12-11 | - | - |\n| [**Introducing Mistral 3**](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fmistral-3) | Blog | 2025-12-02 | [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fmistralai\u002Fmistral-large-3) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3-VL.svg?style=social&label=Star) \u003Cbr> [**Qwen3-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.21631) \u003Cbr> | arXiv | 2025-11-26 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen3-VL-Demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu3.5.svg?style=social&label=Star) \u003Cbr> [**Emu3.5: Native Multimodal Models are World Learners**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.26583) \u003Cbr> | arXiv | 2025-10-30 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu3.5) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent\u002FVITA.svg?style=social&label=Star) \u003Cbr> [**VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.21817.pdf) \u003Cbr> | arXiv | 2025-10-21 | [Github](https:\u002F\u002Fgithub.com\u002FTencent\u002FVITA\u002Ftree\u002FVITA-E) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-OCR.svg?style=social&label=Star) \u003Cbr> [**DeepSeek-OCR: Contexts Optical Compression**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.18234) \u003Cbr> | arXiv | 2025-10-21 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-OCR) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FOmniVinci.svg?style=social&label=Star) \u003Cbr> [**OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.15870) \u003Cbr> | arXiv | 2025-10-17 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FOmniVinci) | - |\n| [**NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.13721) | arXiv | 2025-10-16 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSenseTime-FVG\u002FInteractiveOmni.svg?style=social&label=Star) \u003Cbr> [**InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.13747) | arXiv | 2025-10-15 | [Github](https:\u002F\u002Fgithub.com\u002FSenseTime-FVG\u002FInteractiveOmni) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent\u002FVITA.svg?style=social&label=Star) \u003Cbr> [**VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.09607) \u003Cbr> | arXiv | 2025-10-10 | [Github](https:\u002F\u002Fgithub.com\u002FTencent\u002FVITA\u002Ftree\u002FVITA-VLA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEvolvingLMMs-Lab\u002FLLaVA-OneVision-1.5.svg?style=social&label=Star) \u003Cbr> [**LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.23661) \u003Cbr> | arXiv | 2025-10-09 | [Github](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002FLLaVA-OneVision-1.5) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flmms-lab\u002FLLaVA-OneVision-1.5) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3-Omni.svg?style=social&label=Star) \u003Cbr> [**Qwen3-Omni Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.17765) \u003Cbr> | arXiv | 2025-09-22 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-Omni) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen3-Omni-Demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.18265) \u003Cbr> | arXiv | 2025-08-27 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Fchat.intern-ai.org.cn\u002F) |\n| **MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and Video Understanding on Your Phone** | - | 2025-08-26 | [Github](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-o) | [Demo](http:\u002F\u002F101.126.42.235:30910\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfzhang114\u002FThyme.svg?style=social&label=Star) \u003Cbr> [**Thyme: Think Beyond Images**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.11630) \u003Cbr> | arXiv | 2025-08-18 | [Github](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme) | [Demo](https:\u002F\u002Fthyme-vl.github.io\u002F) |\n| [**Introducing GPT-5**](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-gpt-5\u002F) | OpenAI | 2025-08-07 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frednote-hilab\u002Fdots.vlm1.svg?style=social&label=Star) \u003Cbr> [**dots.vlm1**](https:\u002F\u002Fgithub.com\u002Frednote-hilab\u002Fdots.vlm1) \u003Cbr> | rednote-hilab | 2025-08-06 | [Github](https:\u002F\u002Fgithub.com\u002Frednote-hilab\u002Fdots.vlm1) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Frednote-hilab\u002Fdots-vlm1-demo) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FGLM-4.1V-Thinking.svg?style=social&label=Star) \u003Cbr> [**Step3: Cost-Effective Multimodal Intelligence**](https:\u002F\u002Fstepfun.ai\u002Fresearch\u002Fstep3) \u003Cbr> | StepFun | 2025-07-31 | [Github](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep3) | [Demo](https:\u002F\u002Fstepfun.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FGLM-4.1V-Thinking.svg?style=social&label=Star) \u003Cbr> [**GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.01006) \u003Cbr> | arXiv | 2025-07-02 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FGLM-4.1V-Thinking) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTHUDM\u002FGLM-4.1V-9B-Thinking-API-Demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlxtGH\u002FDenseWorld-1M.svg?style=social&label=Star) \u003Cbr> [**DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.24102) \u003Cbr> | arXiv | 2025-06-30 | [Github](https:\u002F\u002Fgithub.com\u002FlxtGH\u002FDenseWorld-1M) | - |\n| [**Qwen VLo: From \"Understanding\" the World to \"Depicting\" It**](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen-vlo\u002F) | Qwen | 2025-06-26 | - | [Demo](https:\u002F\u002Fchat.qwen.ai\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEvolvingLMMs-Lab\u002Fmultimodal-search-r1.svg?style=social&label=Star) \u003Cbr> [**MMSearch-R1: Incentivizing LMMs to Search**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.20670) \u003Cbr> | arXiv | 2025-06-25 | [Github](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Fmultimodal-search-r1) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FShow-o.svg?style=social&label=Star) \u003Cbr> [**Show-o2: Improved Native Unified Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.15564) \u003Cbr> | arXiv | 2025-06-18 | [Github](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FShow-o) | - |\n| [**Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities**](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_v2_5_report.pdf) | Google | 2025-06-17 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fegolife-ai\u002FEgo-R1.svg?style=social&label=Star) \u003Cbr> [**Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.13654) \u003Cbr> | arXiv | 2025-06-16 | [Github](https:\u002F\u002Fgithub.com\u002Fegolife-ai\u002FEgo-R1) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FXiaomiMiMo\u002FMiMo-VL.svg?style=social&label=Star) \u003Cbr> [**MiMo-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.03569) \u003Cbr> | arXiv | 2025-06-04 | [Github](https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo-VL) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwusize\u002FOpenUni.svg?style=social&label=Star) \u003Cbr> [**OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.23661) \u003Cbr> | arXiv | 2025-05-29 | [Github](https:\u002F\u002Fgithub.com\u002Fwusize\u002FOpenUni) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance-seed\u002FBAGEL.svg?style=social&label=Star) \u003Cbr> [**Emerging Properties in Unified Multimodal Pretraining**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.14683) \u003Cbr> | arXiv | 2025-05-23 | [Github](https:\u002F\u002Fgithub.com\u002Fbytedance-seed\u002FBAGEL) | [Demo](https:\u002F\u002Fdemo.bagel-ai.org\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FMMaDA.svg?style=social&label=Star) \u003Cbr> [**MMaDA: Multimodal Large Diffusion Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.15809) \u003Cbr> | arXiv | 2025-05-21 | [Github](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FMMaDA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FGen-Verse\u002FMMaDA) |\n| [**UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.14682) | arXiv | 2025-05-20 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJiuhaiChen\u002FBLIP3o.svg?style=social&label=Star) \u003Cbr> [**BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.09568) \u003Cbr> | arXiv | 2025-05-14 | [Github](https:\u002F\u002Fgithub.com\u002FJiuhaiChen\u002FBLIP3o) | Local Demo |\n| [**Seed1.5-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.07062) | arXiv | 2025-05-11 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHITsz-TMG\u002FAwesome-Large-Multimodal-Reasoning-Models.svg?style=social&label=Star) \u003Cbr> [**Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.04921) \u003Cbr> | arXiv | 2025-05-08 | [Github](https:\u002F\u002Fgithub.com\u002FHITsz-TMG\u002FAwesome-Large-Multimodal-Reasoning-Models) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FVITA-Audio.svg?style=social&label=Star) \u003Cbr> [**VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.03739) \u003Cbr> | arXiv | 2025-05-06 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA-Audio) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-R1V.svg?style=social&label=Star) \u003Cbr> [**Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.16656) \u003Cbr> | arXiv | 2025-04-23 | [Github](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-R1V) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FEAGLE.svg?style=social&label=Star) \u003Cbr> [**Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.15271) \u003Cbr> | arXiv | 2025-04-21 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FEAGLE) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fquicksviewer\u002Fquicksviewer.svg?style=social&label=Star) \u003Cbr> [**An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.15270) \u003Cbr> | arXiv | 2025-04-21 | [Github](https:\u002F\u002Fgithub.com\u002Fquicksviewer\u002Fquicksviewer) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10479) \u003Cbr> | arXiv | 2025-04-14 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Finternvl.opengvlab.com\u002F) |\n| [**Introducing GPT-4.1 in the API**](https:\u002F\u002Fopenai.com\u002Findex\u002Fgpt-4-1\u002F) | OpenAI | 2025-04-14 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMoonshotAI\u002FKimi-VL.svg?style=social&label=Star) \u003Cbr> [**Kimi-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.07491) \u003Cbr> | arXiv | 2025-04-10 | [Github](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmoonshotai\u002FKimi-VL-A3B-Thinking) |\n| [**The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation**](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fllama-4-multimodal-intelligence\u002F) | Meta | 2025-04-05 | [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fmeta-llama\u002Fllama-4-67f0c30d9fe03840bc9d0164) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2.5-Omni.svg?style=social&label=Star) \u003Cbr> [**Qwen2.5-Omni Technical Report**](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Omni\u002Fblob\u002Fmain\u002Fassets\u002FQwen2.5_Omni.pdf) \u003Cbr> | Qwen | 2025-03-26 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Omni) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen2.5-Omni-7B-Demo) |\n| [**Addendum to GPT-4o System Card: Native image generation**](https:\u002F\u002Fcdn.openai.com\u002F11998be9-5319-4302-bfbf-1167e093f1fb\u002FNative_Image_Generation_System_Card.pdf) | OpenAI | 2025-03-25 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FSparrow.svg?style=social&label=Star) \u003Cbr> [**Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.19951) \u003Cbr> | arXiv | 2025-03-17 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FSparrow) | - |\n| [**Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.01879) | arXiv | 2025-03-07 | - | - |\n| [**Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.01743) | arXiv | 2025-03-03 | [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-4-multimodal-instruct) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002Fphi-4-multimodal) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FLong-VITA.svg?style=social&label=Star) \u003Cbr> [**Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.05177) \u003Cbr> | arXiv | 2025-02-19 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FLong-VITA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2.5-VL.svg?style=social&label=Star) \u003Cbr> [**Qwen2.5-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.13923) \u003Cbr> | arXiv | 2025-02-19 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen2.5-VL) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaichuan-inc\u002FBaichuan-Omni-1.5.svg?style=social&label=Star) \u003Cbr> [**Baichuan-Omni-1.5 Technical Report**](https:\u002F\u002Fgithub.com\u002Fbaichuan-inc\u002FBaichuan-Omni-1.5\u002Fblob\u002Fmain\u002Fbaichuan_omni_1_5.pdf) \u003Cbr> | Tech Report | 2025-01-26 | [Github](https:\u002F\u002Fgithub.com\u002Fbaichuan-inc\u002FBaichuan-Omni-1.5) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FLlamaV-o1.svg?style=social&label=Star) \u003Cbr> [**LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.06186) \u003Cbr> | arXiv | 2025-01-10 | [Github](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FLlamaV-o1) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FVITA.svg?style=social&label=Star) \u003Cbr> [**VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.01957) \u003Cbr> | arXiv | 2025-01-03 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2-VL.svg?style=social&label=Star) \u003Cbr> [**QVQ: To See the World with Wisdom**](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqvq-72b-preview\u002F) \u003Cbr> | Qwen | 2024-12-25 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2-VL) | [Demo](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqvq-72b-preview\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-VL2.svg?style=social&label=Star) \u003Cbr> [**DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.10302) \u003Cbr> | arXiv | 2024-12-13 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-VL2) | - |\n| [**Apollo: An Exploration of Video Understanding in Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.10360) | arXiv | 2024-12-13 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.09596) \u003Cbr> | arXiv | 2024-12-12 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer\u002Ftree\u002Fmain\u002FInternLM-XComposer-2.5-OmniLive) | Local Demo |\n| [**StreamChat: Chatting with Streaming Video**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.08646) | arXiv | 2024-12-11 | Coming soon | - |\n| [**CompCap: Improving Multimodal Large Language Models with Composite Captions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.05243) | arXiv | 2024-12-06 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgls0425\u002FLinVT.svg?style=social&label=Star) \u003Cbr> [**LinVT: Empower Your Image-level Large Language Model to Understand Videos**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.05185) \u003Cbr> | arXiv | 2024-12-06 | [Github](https:\u002F\u002Fgithub.com\u002Fgls0425\u002FLinVT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.05271) \u003Cbr> | arXiv | 2024-12-06 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Finternvl.opengvlab.com) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FVILA.svg?style=social&label=Star) \u003Cbr> [**NVILA: Efficient Frontier Visual Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.04468) \u003Cbr> | arXiv | 2024-12-05 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVILA) | [Demo](https:\u002F\u002Fvila.mit.edu) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Finst-it\u002Finst-it.svg?style=social&label=Star) \u003Cbr> [**Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03565) \u003Cbr> | arXiv | 2024-12-04 | [Github](https:\u002F\u002Fgithub.com\u002Finst-it\u002Finst-it) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTimeMarker-LLM\u002FTimeMarker.svg?style=social&label=Star) \u003Cbr> [**TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.18211) \u003Cbr> | arXiv | 2024-11-27 | [Github](https:\u002F\u002Fgithub.com\u002FTimeMarker-LLM\u002FTimeMarker\u002F) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIDEA-Research\u002FChatRex.svg?style=social&label=Star) \u003Cbr> [**ChatRex: Taming Multimodal LLM for Joint Perception and Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.18363) \u003Cbr> | arXiv | 2024-11-27 | [Github](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FChatRex) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FLongVU.svg?style=social&label=Star) \u003Cbr> [**LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.17434) \u003Cbr> | arXiv | 2024-10-22 | [Github](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FLongVU) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVision-CAIR\u002FLongVU) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshikiw\u002FModality-Integration-Rate.svg?style=social&label=Star) \u003Cbr> [**Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.07167) \u003Cbr> | arXiv | 2024-10-09 | [Github](https:\u002F\u002Fgithub.com\u002Fshikiw\u002FModality-Integration-Rate) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frese1f\u002Faurora.svg?style=social&label=Star) \u003Cbr> [**AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.03051) \u003Cbr> | arXiv | 2024-10-04 | [Github](https:\u002F\u002Fgithub.com\u002Frese1f\u002Faurora) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Femova-ollm\u002FEMOVA.svg?style=social&label=Star) \u003Cbr> [**EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.18042) \u003Cbr> | CVPR | 2024-09-26 | [Github](https:\u002F\u002Fgithub.com\u002Femova-ollm\u002FEMOVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FEmova-ollm\u002FEMOVA-demo) | \n| [**Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.17146) | arXiv | 2024-09-25 | [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fallenai\u002FMolmoE-1B-0924) | [Demo](https:\u002F\u002Fmolmo.allenai.org) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2-VL.svg?style=social&label=Star) \u003Cbr> [**Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.12191) \u003Cbr> | arXiv | 2024-09-18 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen2-VL) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIDEA-FinAI\u002FChartMoE.svg?style=social&label=Star) \u003Cbr> [**ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.03277) \u003Cbr> | ICLR | 2024-09-05 | [Github](https:\u002F\u002Fgithub.com\u002FIDEA-FinAI\u002FChartMoE) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFreedomIntelligence\u002FLongLLaVA.svg?style=social&label=Star) \u003Cbr> [**LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.02889) \u003Cbr> | arXiv | 2024-09-04 | [Github](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FLongLLaVA) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FEagle.svg?style=social&label=Star) \u003Cbr> [**EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15998) \u003Cbr> | arXiv | 2024-08-28 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FEagle) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FNVEagle\u002FEagle-X5-13B-Chat) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshufangxun\u002FLLaVA-MoD.svg?style=social&label=Star) \u003Cbr> [**LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15881) \u003Cbr> | arXiv | 2024-08-28 | [Github](https:\u002F\u002Fgithub.com\u002Fshufangxun\u002FLLaVA-MoD) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-Owl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models**](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2408.04840) \u003Cbr> | arXiv | 2024-08-09 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-Owl) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FVITA.svg?style=social&label=Star) \u003Cbr> [**VITA: Towards Open-Source Interactive Omni Multimodal LLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.05211) \u003Cbr> | arXiv | 2024-08-09 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLaVA-VL\u002FLLaVA-NeXT.svg?style=social&label=Star) \u003Cbr> [**LLaVA-OneVision: Easy Visual Task Transfer**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.03326) \u003Cbr> | arXiv | 2024-08-06 | [Github](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-NeXT) | [Demo](https:\u002F\u002Fllava-onevision.lmms-lab.com) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenBMB\u002FMiniCPM-V.svg?style=social&label=Star) \u003Cbr> [**MiniCPM-V: A GPT-4V Level MLLM on Your Phone**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.01800) \u003Cbr> | arXiv | 2024-08-03 | [Github](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-V) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopenbmb\u002FMiniCPM-Llama3-V-2_5) |\n| [**VILA^2: VILA Augmented VILA**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.17453) | arXiv | 2024-07-24 | - | - |\n| [**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.15841) | arXiv | 2024-07-22 | - | - |\n| [**EVLM: An Efficient Vision-Language Model for Visual Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.14177) | arXiv | 2024-07-19 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjiyt17\u002FIDA-VLM.svg?style=social&label=Star) \u003Cbr> [**IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.07577) \u003Cbr> | arXiv | 2024-07-10 | [Github](https:\u002F\u002Fgithub.com\u002Fjiyt17\u002FIDA-VLM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.03320) \u003Cbr> | arXiv | 2024-07-03 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer) | [Demo](https:\u002F\u002Fopenxlab.org.cn\u002Fapps\u002Fdetail\u002FWillowBreeze\u002FInternLM-XComposer) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlxtGH\u002FOMG-Seg.svg?style=social&label=Star) \u003Cbr> [**OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.19389) \u003Cbr> | arXiv | 2024-06-27 | [Github](https:\u002F\u002Fgithub.com\u002FlxtGH\u002FOMG-Seg) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZZZHANG-jx\u002FDocKylin.svg?style=social&label=Star) \u003Cbr> [**DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.19101) \u003Cbr> | AAAI | 2024-06-27 | [Github](https:\u002F\u002Fgithub.com\u002FZZZHANG-jx\u002FDocKylin) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcambrian-mllm\u002Fcambrian.svg?style=social&label=Star) \u003Cbr> [**Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.16860) \u003Cbr> | arXiv | 2024-06-24 | [Github](https:\u002F\u002Fgithub.com\u002Fcambrian-mllm\u002Fcambrian) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEvolvingLMMs-Lab\u002FLongVA.svg?style=social&label=Star) \u003Cbr> [**Long Context Transfer from Language to Vision**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.16852) \u003Cbr> | arXiv | 2024-06-24 | [Github](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002FLongVA) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FSALMONN.svg?style=social&label=Star) \u003Cbr> [**video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.15704) \u003Cbr> | ICML | 2024-06-22 | [Github](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FSALMONN) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByungKwanLee\u002FTroL.svg?style=social&label=Star) \u003Cbr> [**TroL: Traversal of Layers for Large Language and Vision Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.12246) \u003Cbr> | EMNLP | 2024-06-18 | [Github](https:\u002F\u002Fgithub.com\u002FByungKwanLee\u002FTroL) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEVE.svg?style=social&label=Star) \u003Cbr> [**Unveiling Encoder-Free Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.11832) \u003Cbr> | arXiv | 2024-06-17 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEVE) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FVideoLLM-online.svg?style=social&label=Star) \u003Cbr> [**VideoLLM-online: Online Video Large Language Model for Streaming Video**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.11816) \u003Cbr> | CVPR | 2024-06-17 | [Github](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FVideoLLM-online) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwentaoyuan\u002FRoboPoint.svg?style=social&label=Star) \u003Cbr> [**RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.10721) \u003Cbr> | CoRL | 2024-06-15 | [Github](https:\u002F\u002Fgithub.com\u002Fwentaoyuan\u002FRoboPoint) | [Demo](https:\u002F\u002F007e03d34429a2517b.gradio.live\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwlin-at\u002FCaD-VI) \u003Cbr> [**Comparison Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09240) \u003Cbr> | arXiv | 2024-06-13 | [Github](https:\u002F\u002Fwlin-at.github.io\u002Fcad_vi) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfzhang114\u002FSliME.svg?style=social&label=Star) \u003Cbr> [**Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.08487) \u003Cbr> | arXiv | 2024-06-12 | [Github](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FSliME) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoLLaMA2.svg?style=social&label=Star) \u003Cbr> [**VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.07476) \u003Cbr> | arXiv | 2024-06-11 | [Github](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIDC-AI\u002FParrot.svg?style=social&label=Star) \u003Cbr> [**Parrot: Multilingual Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.02539) \u003Cbr> | arXiv | 2024-06-04 | [Github](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FParrot) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIDC-AI\u002FOvis.svg?style=social&label=Star) \u003Cbr> [**Ovis: Structural Embedding Alignment for Multimodal Large Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.20797) \u003Cbr> | arXiv | 2024-05-31 | [Github](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FOvis\u002F) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgordonhu608\u002FMQT-LLaVA.svg?style=social&label=Star) \u003Cbr> [**Matryoshka Query Transformer for Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.19315) \u003Cbr> | arXiv | 2024-05-29 | [Github](https:\u002F\u002Fgithub.com\u002Fgordonhu608\u002FMQT-LLaVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fgordonhu\u002FMQT-LLaVA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falibaba\u002Fconv-llava.svg?style=social&label=Star) \u003Cbr> [**ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.15738) \u003Cbr> | arXiv | 2024-05-24 | [Github](https:\u002F\u002Fgithub.com\u002Falibaba\u002Fconv-llava) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByungKwanLee\u002FMeteor.svg?style=social&label=Star) \u003Cbr> [**Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.15574) \u003Cbr> | arXiv | 2024-05-24 | [Github](https:\u002F\u002Fgithub.com\u002FByungKwanLee\u002FMeteor) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FBK-Lee\u002FMeteor) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYifanXu74\u002FLibra.svg?style=social&label=Star) \u003Cbr> [**Libra: Building Decoupled Vision System on Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.10140) \u003Cbr> | ICML | 2024-05-16 | [Github](https:\u002F\u002Fgithub.com\u002FYifanXu74\u002FLibra) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSHI-Labs\u002FCuMo.svg?style=social&label=Star) \u003Cbr> [**CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05949) \u003Cbr> | arXiv | 2024-05-09 | [Github](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FCuMo) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.16821) \u003Cbr> | arXiv | 2024-04-25 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Finternvl.opengvlab.com) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgraphic-design-ai\u002Fgraphist.svg?style=social&label=Star) \u003Cbr> [**Graphic Design with Large Multimodal Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.14368) \u003Cbr> | arXiv | 2024-04-22 | [Github](https:\u002F\u002Fgithub.com\u002Fgraphic-design-ai\u002Fgraphist) | - |\n| [**BRAVE: Broadening the visual encoding of vision-language models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.07204) | ECCV | 2024-04-10 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.06512.pdf) \u003Cbr> | arXiv | 2024-04-09 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FWillow123\u002FInternLM-XComposer) |\n| [**Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.05719.pdf) | arXiv | 2024-04-08 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fboheumd\u002FMA-LMM.svg?style=social&label=Star) \u003Cbr> [**MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.05726.pdf) \u003Cbr> | CVPR | 2024-04-08 | [Github](https:\u002F\u002Fgithub.com\u002Fboheumd\u002FMA-LMM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FVitron.svg?style=social&label=Star) \u003Cbr> [**VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing**](https:\u002F\u002Fhaofei.vip\u002Fdownloads\u002Fpapers\u002FSkywork_Vitron_2024.pdf) \u003Cbr> | NeurIPS | 2024-04-04 | [Github](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FVitron) | Local Demo |\n| [**TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3654674) | ACM TKDD | 2024-03-28 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FLITA.svg?style=social&label=Star) \u003Cbr> [**LITA: Language Instructed Temporal-Localization Assistant**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.19046) | arXiv | 2024-03-27 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FLITA) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FMiniGemini.svg?style=social&label=Star) \u003Cbr> [**Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.18814.pdf) \u003Cbr> | arXiv | 2024-03-27 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FMiniGemini) | [Demo](http:\u002F\u002F103.170.5.190:7860) |\n| [**MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.09611.pdf) | arXiv | 2024-03-14 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByungKwanLee\u002FMoAI.svg?style=social&label=Star) \u003Cbr> [**MoAI: Mixture of All Intelligence for Large Language and Vision Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.07508.pdf) \u003Cbr> | arXiv | 2024-03-12 | [Github](https:\u002F\u002Fgithub.com\u002FByungKwanLee\u002FMoAI) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-VL.svg?style=social&label=Star) \u003Cbr> [**DeepSeek-VL: Towards Real-World Vision-Language Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.05525) \u003Cbr> | arXiv | 2024-03-08 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdeepseek-ai\u002FDeepSeek-VL-7B) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuliang-Liu\u002FMonkey.svg?style=social&label=Star) \u003Cbr> [**TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.04473.pdf) \u003Cbr> | arXiv | 2024-03-07 | [Github](https:\u002F\u002Fgithub.com\u002FYuliang-Liu\u002FMonkey) | [Demo](http:\u002F\u002Fvlrlab-monkey.xyz:7684) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002Fall-seeing.svg?style=social&label=Star) \u003Cbr> [**The All-Seeing Project V2: Towards General Relation Comprehension of the Open World**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.19474.pdf) | arXiv | 2024-02-29 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002Fall-seeing) | - |\n| [**GROUNDHOG: Grounding Large Language Models to Holistic Segmentation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.16846.pdf) | CVPR | 2024-02-26 | Coming soon | Coming soon |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenMOSS\u002FAnyGPT.svg?style=social&label=Star) \u003Cbr> [**AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.12226.pdf) \u003Cbr> | arXiv | 2024-02-19 | [Github](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FAnyGPT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDCDmllm\u002FMomentor.svg?style=social&label=Star) \u003Cbr> [**Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11435.pdf) \u003Cbr> | arXiv | 2024-02-18 | [Github](https:\u002F\u002Fgithub.com\u002FDCDmllm\u002FMomentor) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFreedomIntelligence\u002FALLaVA.svg?style=social&label=Star) \u003Cbr> [**ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11684.pdf) \u003Cbr> | arXiv | 2024-02-18 | [Github](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FALLaVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002FFreedomIntelligence\u002FALLaVA-3B) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByungKwanLee\u002FCoLLaVO-Crayon-Large-Language-and-Vision-mOdel.svg?style=social&label=Star) \u003Cbr> [**CoLLaVO: Crayon Large Language and Vision mOdel**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11248.pdf) \u003Cbr> | arXiv | 2024-02-17 | [Github](https:\u002F\u002Fgithub.com\u002FByungKwanLee\u002FCoLLaVO-Crayon-Large-Language-and-Vision-mOdel) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTRI-ML\u002Fprismatic-vlms.svg?style=social&label=Star) \u003Cbr> [**Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.07865) \u003Cbr> | ICML | 2024-02-12 | [Github](https:\u002F\u002Fgithub.com\u002FTRI-ML\u002Fprismatic-vlms) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FCogCoM.svg?style=social&label=Star) \u003Cbr> [**CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.04236.pdf) \u003Cbr> | arXiv | 2024-02-06 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogCoM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMeituan-AutoML\u002FMobileVLM.svg?style=social&label=Star) \u003Cbr> [**MobileVLM V2: Faster and Stronger Baseline for Vision Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.03766.pdf) \u003Cbr> | arXiv | 2024-02-06 | [Github](https:\u002F\u002Fgithub.com\u002FMeituan-AutoML\u002FMobileVLM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWEIYanbin1999\u002FGITA.svg?style=social&label=Star) \u003Cbr> [**GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.02130) \u003Cbr> | NeurIPS | 2024-02-03 | [Github](https:\u002F\u002Fgithub.com\u002FWEIYanbin1999\u002FGITA\u002F) | - |\n| [**Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.17981.pdf) | arXiv | 2024-01-31 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaotian-liu\u002FLLaVA.svg?style=social&label=Star) \u003Cbr> [**LLaVA-NeXT: Improved reasoning, OCR, and world knowledge**](https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-01-30-llava-next\u002F) | Blog | 2024-01-30 | [Github](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) | [Demo](https:\u002F\u002Fllava.hliu.cc) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FMoE-LLaVA.svg?style=social&label=Star) \u003Cbr> [**MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.15947.pdf) \u003Cbr> | arXiv | 2024-01-29 | [Github](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FMoE-LLaVA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.16420.pdf) \u003Cbr> | arXiv | 2024-01-29 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer) | [Demo](https:\u002F\u002Fopenxlab.org.cn\u002Fapps\u002Fdetail\u002FWillowBreeze\u002FInternLM-XComposer) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F01-ai\u002FYi.svg?style=social&label=Star) \u003Cbr> [**Yi-VL**](https:\u002F\u002Fgithub.com\u002F01-ai\u002FYi\u002Ftree\u002Fmain\u002FVL) \u003Cbr> | - | 2024-01-23 | [Github](https:\u002F\u002Fgithub.com\u002F01-ai\u002FYi\u002Ftree\u002Fmain\u002FVL) | Local Demo |\n| [**SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.12168.pdf) | arXiv | 2024-01-22 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FChartAst.svg?style=social&label=Star) \u003Cbr> [**ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.02384) \u003Cbr> | ACL | 2024-01-04 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FChartAst) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMeituan-AutoML\u002FMobileVLM.svg?style=social&label=Star) \u003Cbr> [**MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.16886.pdf) \u003Cbr> | arXiv | 2023-12-28 | [Github](https:\u002F\u002Fgithub.com\u002FMeituan-AutoML\u002FMobileVLM) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14238.pdf) \u003Cbr> | CVPR | 2023-12-21 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Finternvl.opengvlab.com) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCircleRadon\u002FOsprey.svg?style=social&label=Star) \u003Cbr> [**Osprey: Pixel Understanding with Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10032.pdf) \u003Cbr> | CVPR | 2023-12-15 | [Github](https:\u002F\u002Fgithub.com\u002FCircleRadon\u002FOsprey) | [Demo](http:\u002F\u002F111.0.123.204:8000\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FCogVLM.svg?style=social&label=Star) \u003Cbr> [**CogAgent: A Visual Language Model for GUI Agents**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08914.pdf) \u003Cbr> | arXiv | 2023-12-14 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVLM) | [Coming soon]() |\n| [**Pixel Aligned Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.09237.pdf) | arXiv | 2023-12-14 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FVILA.svg?style=social&label=Star) \u003Cbr> [**VILA: On Pre-training for Visual Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07533) \u003Cbr> | CVPR | 2023-12-13 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVILA) | Local Demo |\n| [**See, Say, and Segment: Teaching LMMs to Overcome False Premises**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08366.pdf) | arXiv | 2023-12-13 | [Coming soon]() | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUcas-HaoranWei\u002FVary.svg?style=social&label=Star) \u003Cbr> [**Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06109.pdf) \u003Cbr> | ECCV | 2023-12-11 | [Github](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FVary) | [Demo](http:\u002F\u002Fregion-31.seetacloud.com:22701\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkakaobrain\u002Fhoneybee.svg?style=social&label=Star) \u003Cbr> [**Honeybee: Locality-enhanced Projector for Multimodal LLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06742.pdf) \u003Cbr> | CVPR | 2023-12-11 | [Github](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fhoneybee) | - |\n| [**Gemini: A Family of Highly Capable Multimodal Models**](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_1_report.pdf) | Google | 2023-12-06 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcsuhan\u002FOneLLM.svg?style=social&label=Star) \u003Cbr> [**OneLLM: One Framework to Align All Modalities with Language**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03700.pdf) \u003Cbr> | arXiv | 2023-12-06 | [Github](https:\u002F\u002Fgithub.com\u002Fcsuhan\u002FOneLLM) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fcsuhan\u002FOneLLM) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMeituan-AutoML\u002FLenna.svg?style=social&label=Star) \u003Cbr> [**Lenna: Language Enhanced Reasoning Detection Assistant**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02433.pdf) \u003Cbr> | arXiv | 2023-12-05 | [Github](https:\u002F\u002Fgithub.com\u002FMeituan-AutoML\u002FLenna) | - | \n| [**VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02310.pdf) | arXiv | 2023-12-04 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRenShuhuai-Andy\u002FTimeChat.svg?style=social&label=Star) \u003Cbr> [**TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02051.pdf) \u003Cbr> | arXiv | 2023-12-04 | [Github](https:\u002F\u002Fgithub.com\u002FRenShuhuai-Andy\u002FTimeChat) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmu-cai\u002Fvip-llava.svg?style=social&label=Star) \u003Cbr> [**Making Large Multimodal Models Understand Arbitrary Visual Prompts**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00784.pdf) \u003Cbr> | CVPR | 2023-12-01 | [Github](https:\u002F\u002Fgithub.com\u002Fmu-cai\u002Fvip-llava) | [Demo](https:\u002F\u002Fpages.cs.wisc.edu\u002F~mucai\u002Fvip-llava.html) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvlm-driver\u002FDolphins.svg?style=social&label=Star) \u003Cbr> [**Dolphins: Multimodal Language Model for Driving**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00438.pdf) \u003Cbr> | arXiv | 2023-12-01 | [Github](https:\u002F\u002Fgithub.com\u002Fvlm-driver\u002FDolphins) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen3DA\u002FLL3DA.svg?style=social&label=Star) \u003Cbr> [**LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18651.pdf) \u003Cbr> | arXiv | 2023-11-30 | [Github](https:\u002F\u002Fgithub.com\u002FOpen3DA\u002FLL3DA) | [Coming soon]() |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuangb23\u002FVTimeLLM.svg?style=social&label=Star) \u003Cbr> [**VTimeLLM: Empower LLM to Grasp Video Moments**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18445.pdf) \u003Cbr> | arXiv | 2023-11-30 | [Github](https:\u002F\u002Fgithub.com\u002Fhuangb23\u002FVTimeLLM\u002F) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-DocOwl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18248.pdf) \u003Cbr> | arXiv | 2023-11-30 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-DocOwl\u002Ftree\u002Fmain\u002FPaperOwl) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLaMA-VID.svg?style=social&label=Star) \u003Cbr> [**LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17043.pdf) \u003Cbr> | arXiv | 2023-11-28 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID) | [Coming soon]() |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLMGA.svg?style=social&label=Star) \u003Cbr> [**LLMGA: Multimodal Large Language Model based Generation Assistant**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16500.pdf) \u003Cbr> | arXiv | 2023-11-27 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLMGA) | [Demo](https:\u002F\u002Fbaa55ef8590b623f18.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftingxueronghua\u002FChartLlama-code.svg?style=social&label=Star) \u003Cbr> [**ChartLlama: A Multimodal LLM for Chart Understanding and Generation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16483.pdf) \u003Cbr> | arXiv | 2023-11-27 | [Github](https:\u002F\u002Fgithub.com\u002Ftingxueronghua\u002FChartLlama-code) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**ShareGPT4V: Improving Large Multi-Modal Models with Better Captions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.12793.pdf) \u003Cbr> | arXiv | 2023-11-21 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer\u002Ftree\u002Fmain\u002Fprojects\u002FShareGPT4V) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLin-Chen\u002FShareGPT4V-7B) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frshaojimmy\u002FJiuTian.svg?style=social&label=Star) \u003Cbr> [**LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.11860.pdf) \u003Cbr> | arXiv | 2023-11-20 | [Github](https:\u002F\u002Fgithub.com\u002Frshaojimmy\u002FJiuTian) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fembodied-generalist\u002Fembodied-generalist.svg?style=social&label=Star) \u003Cbr> [**An Embodied Generalist Agent in 3D World**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.12871.pdf) \u003Cbr> | arXiv | 2023-11-18 | [Github](https:\u002F\u002Fgithub.com\u002Fembodied-generalist\u002Fembodied-generalist) | [Demo](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=mlnjz4eSjB4) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FVideo-LLaVA.svg?style=social&label=Star) \u003Cbr> [**Video-LLaVA: Learning United Visual Representation by Alignment Before Projection**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.10122.pdf) \u003Cbr> | arXiv | 2023-11-16 | [Github](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FVideo-LLaVA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FChat-UniVi.svg?style=social&label=Star) \u003Cbr> [**Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.08046) \u003Cbr> | CVPR | 2023-11-14 | [Github](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FChat-UniVi) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX2FD\u002FLVIS-INSTRUCT4V.svg?style=social&label=Star) \u003Cbr> [**To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.07574.pdf) \u003Cbr> | arXiv | 2023-11-13 | [Github](https:\u002F\u002Fgithub.com\u002FX2FD\u002FLVIS-INSTRUCT4V) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlpha-VLLM\u002FLLaMA2-Accessory.svg?style=social&label=Star) \u003Cbr> [**SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.07575.pdf) \u003Cbr> | arXiv | 2023-11-13 | [Github](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLLaMA2-Accessory) | [Demo](http:\u002F\u002Fimagebind-llm.opengvlab.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuliang-Liu\u002FMonkey.svg?style=social&label=Star) \u003Cbr> [**Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.06607.pdf) \u003Cbr> | CVPR | 2023-11-11 | [Github](https:\u002F\u002Fgithub.com\u002FYuliang-Liu\u002FMonkey) | [Demo](http:\u002F\u002F27.17.184.224:7681\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLaVA-VL\u002FLLaVA-Plus-Codebase.svg?style=social&label=Star) \u003Cbr> [**LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.05437.pdf) \u003Cbr> | arXiv | 2023-11-09 | [Github](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Plus-Codebase) | [Demo](https:\u002F\u002Fllavaplus.ngrok.io\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNExT-ChatV\u002FNExT-Chat.svg?style=social&label=Star) \u003Cbr> [**NExT-Chat: An LMM for Chat, Detection and Segmentation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04498.pdf) \u003Cbr> | arXiv | 2023-11-08 | [Github](https:\u002F\u002Fgithub.com\u002FNExT-ChatV\u002FNExT-Chat) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-Owl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04257.pdf) \u003Cbr> | arXiv | 2023-11-07 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-Owl\u002Ftree\u002Fmain\u002FmPLUG-Owl2) | [Demo](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fdamo\u002FmPLUG-Owl2\u002Fsummary) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLuodian\u002FOtter.svg?style=social&label=Star) \u003Cbr> [**OtterHD: A High-Resolution Multi-modality Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04219.pdf) \u003Cbr> | arXiv | 2023-11-07 | [Github](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter) | - |\n| [**CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03354.pdf) | arXiv | 2023-11-06 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FgroundingLMM.svg?style=social&label=Star) \u003Cbr> [**GLaMM: Pixel Grounding Large Multimodal Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03356.pdf) \u003Cbr> | CVPR | 2023-11-06 | [Github](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FgroundingLMM) | [Demo](https:\u002F\u002Fglamm.mbzuai-oryx.ngrok.app\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FComVint.svg?style=social&label=Star) \u003Cbr> [**What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.01487.pdf) \u003Cbr> | arXiv | 2023-11-02| [Github](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FComVint) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FMiniGPT-4.svg?style=social&label=Star) \u003Cbr> [**MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.09478.pdf) \u003Cbr> | arXiv | 2023-10-14 | [Github](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FSALMONN.svg?style=social&label=Star) \u003Cbr> [**SALMONN: Towards Generic Hearing Abilities for Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.13289) \u003Cbr> | ICLR | 2023-10-20 | [Github](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FSALMONN) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fapple\u002Fml-ferret.svg?style=social&label=Star) \u003Cbr> [**Ferret: Refer and Ground Anything Anywhere at Any Granularity**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.07704.pdf) \u003Cbr> | arXiv | 2023-10-11 | [Github](https:\u002F\u002Fgithub.com\u002Fapple\u002Fml-ferret) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FCogVLM.svg?style=social&label=Star) \u003Cbr> [**CogVLM: Visual Expert For Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03079.pdf) \u003Cbr> | arXiv | 2023-10-09 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVLM) | [Demo](http:\u002F\u002F36.103.203.44:7861\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaotian-liu\u002FLLaVA.svg?style=social&label=Star) \u003Cbr> [**Improved Baselines with Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03744.pdf) \u003Cbr> | arXiv | 2023-10-05 | [Github](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) | [Demo](https:\u002F\u002Fllava.hliu.cc\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FLanguageBind.svg?style=social&label=Star) \u003Cbr> [**LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01852.pdf) \u003Cbr> | ICLR | 2023-10-03 | [Github](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FLanguageBind) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FLanguageBind) | \n![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSY-Xuan\u002FPink.svg?style=social&label=Star) \u003Cbr> [**Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00582.pdf) | arXiv | 2023-10-01 | [Github](https:\u002F\u002Fgithub.com\u002FSY-Xuan\u002FPink) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthunlp\u002FMuffin.svg?style=social&label=Star) \u003Cbr> [**Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00653.pdf) \u003Cbr> | arXiv | 2023-10-01 | [Github](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FMuffin) | Local Demo | \n| [**AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.16058.pdf) | arXiv | 2023-09-27 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.15112.pdf) \u003Cbr> | arXiv | 2023-09-26 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRunpeiDong\u002FDreamLLM.svg?style=social&label=Star) \u003Cbr> [**DreamLLM: Synergistic Multimodal Comprehension and Creation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.11499.pdf) \u003Cbr> | ICLR | 2023-09-20 | [Github](https:\u002F\u002Fgithub.com\u002FRunpeiDong\u002FDreamLLM) | [Coming soon]() |\n| [**An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.09958.pdf) | arXiv | 2023-09-18 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSihengLi99\u002FTextBind.svg?style=social&label=Star) \u003Cbr> [**TextBind: Multi-turn Interleaved Multimodal Instruction-following**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.08637.pdf) \u003Cbr> | arXiv | 2023-09-14 | [Github](https:\u002F\u002Fgithub.com\u002FSihengLi99\u002FTextBind) | [Demo](https:\u002F\u002Failabnlp.tencent.com\u002Fresearch_demos\u002Ftextbind\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNExT-GPT\u002FNExT-GPT.svg?style=social&label=Star) \u003Cbr> [**NExT-GPT: Any-to-Any Multimodal LLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.05519.pdf) \u003Cbr> | arXiv | 2023-09-11 | [Github](https:\u002F\u002Fgithub.com\u002FNExT-GPT\u002FNExT-GPT) | [Demo](https:\u002F\u002Ffc7a82a1c76b336b6f.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUCSC-VLAA\u002FSight-Beyond-Text.svg?style=social&label=Star) \u003Cbr> [**Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.07120.pdf) \u003Cbr> | arXiv | 2023-09-13 | [Github](https:\u002F\u002Fgithub.com\u002FUCSC-VLAA\u002FSight-Beyond-Text) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FLLaMA-Adapter.svg?style=social&label=Star) \u003Cbr> [**ImageBind-LLM: Multi-modality Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.03905.pdf) \u003Cbr> | arXiv | 2023-09-07 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FLLaMA-Adapter) | [Demo](http:\u002F\u002Fimagebind-llm.opengvlab.com\u002F) |\n| [**Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.02591.pdf) | arXiv | 2023-09-05 | - | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenRobotLab\u002FPointLLM.svg?style=social&label=Star) \u003Cbr> [**PointLLM: Empowering Large Language Models to Understand Point Clouds**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16911.pdf) \u003Cbr> | arXiv | 2023-08-31 | [Github](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FPointLLM) | [Demo](http:\u002F\u002F101.230.144.196\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHYPJUDY\u002FSparkles.svg?style=social&label=Star) \u003Cbr> [**✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16463.pdf) \u003Cbr> | arXiv | 2023-08-31 | [Github](https:\u002F\u002Fgithub.com\u002FHYPJUDY\u002FSparkles) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopendatalab\u002FMLLM-DataEngine.svg?style=social&label=Star) \u003Cbr> [**MLLM-DataEngine: An Iterative Refinement Approach for MLLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.13566.pdf) \u003Cbr> | arXiv | 2023-08-25 | [Github](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMLLM-DataEngine) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPVIT-official\u002FPVIT.svg?style=social&label=Star) \u003Cbr> [**Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.13437.pdf) \u003Cbr> | arXiv | 2023-08-25 | [Github](https:\u002F\u002Fgithub.com\u002FPVIT-official\u002FPVIT) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FPVIT\u002Fpvit) |  \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen-VL.svg?style=social&label=Star) \u003Cbr> [**Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.12966.pdf) \u003Cbr> | arXiv | 2023-08-24 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-VL) | [Demo](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fqwen\u002FQwen-VL-Chat-Demo\u002Fsummary) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenBMB\u002FVisCPM.svg?style=social&label=Star) \u003Cbr> [**Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.12038.pdf) \u003Cbr> | ICLR | 2023-08-23 | [Github](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisCPM) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopenbmb\u002Fviscpm-chat) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ficoz69\u002FStableLLAVA.svg?style=social&label=Star) \u003Cbr> [**StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.10253.pdf) \u003Cbr> | arXiv | 2023-08-20 | [Github](https:\u002F\u002Fgithub.com\u002Ficoz69\u002FStableLLAVA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlpc-ucsd\u002FBLIVA.svg?style=social&label=Star) \u003Cbr> [**BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.09936.pdf) \u003Cbr> | arXiv | 2023-08-19 | [Github](https:\u002F\u002Fgithub.com\u002Fmlpc-ucsd\u002FBLIVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmlpc-lab\u002FBLIVA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDCDmllm\u002FCheetah.svg?style=social&label=Star) \u003Cbr> [**Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.04152.pdf) \u003Cbr> | arXiv | 2023-08-08 | [Github](https:\u002F\u002Fgithub.com\u002FDCDmllm\u002FCheetah) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FAll-Seeing.svg?style=social&label=Star) \u003Cbr> [**The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.01907.pdf) \u003Cbr> | ICLR | 2023-08-03 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAll-Seeing) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenGVLab\u002Fall-seeing) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLISA.svg?style=social&label=Star) \u003Cbr> [**LISA: Reasoning Segmentation via Large Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.00692.pdf) \u003Cbr> | arXiv | 2023-08-01 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLISA) | [Demo](http:\u002F\u002F103.170.5.190:7860) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frese1f\u002FMovieChat.svg?style=social&label=Star) \u003Cbr> [**MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.16449.pdf) \u003Cbr> | arXiv | 2023-07-31 | [Github](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUMass-Foundation-Model\u002F3D-LLM.svg?style=social&label=Star) \u003Cbr> [**3D-LLM: Injecting the 3D World into Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.12981.pdf) \u003Cbr> | arXiv | 2023-07-24 | [Github](https:\u002F\u002Fgithub.com\u002FUMass-Foundation-Model\u002F3D-LLM) | - | \n| [**ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.09474.pdf) \u003Cbr> | arXiv | 2023-07-18 | - | [Demo](https:\u002F\u002Fchatspot.streamlit.app\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmagic-research\u002Fbubogpt.svg?style=social&label=Star) \u003Cbr> [**BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.08581.pdf) \u003Cbr> | arXiv | 2023-07-17 | [Github](https:\u002F\u002Fgithub.com\u002Fmagic-research\u002Fbubogpt) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmagicr\u002FBuboGPT) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBAAI-DCAI\u002FVisual-Instruction-Tuning.svg?style=social&label=Star) \u003Cbr> [**SVIT: Scaling up Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.04087.pdf) \u003Cbr> | arXiv | 2023-07-09 | [Github](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FVisual-Instruction-Tuning) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjshilong\u002FGPT4RoI.svg?style=social&label=Star) \u003Cbr> [**GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.03601.pdf) \u003Cbr> | arXiv | 2023-07-07 | [Github](https:\u002F\u002Fgithub.com\u002Fjshilong\u002FGPT4RoI) | [Demo](http:\u002F\u002F139.196.83.164:7000\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002Flynx-llm.svg?style=social&label=Star) \u003Cbr> [**What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02469.pdf) \u003Cbr> | arXiv | 2023-07-05 | [Github](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Flynx-llm)  | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-DocOwl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02499.pdf) \u003Cbr> | arXiv | 2023-07-04 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-DocOwl) | [Demo](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fdamo\u002FmPLUG-DocOwl\u002Fsummary) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenDelong1999\u002Fpolite_flamingo.svg?style=social&label=Star) \u003Cbr> [**Visual Instruction Tuning with Polite Flamingo**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.01003.pdf) \u003Cbr >| arXiv | 2023-07-03 | [Github](https:\u002F\u002Fgithub.com\u002FChenDelong1999\u002Fpolite_flamingo) | [Demo](http:\u002F\u002Fclever_flamingo.xiaoice.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSALT-NLP\u002FLLaVAR.svg?style=social&label=Star) \u003Cbr> [**LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.17107.pdf) \u003Cbr> | arXiv | 2023-06-29 | [Github](https:\u002F\u002Fgithub.com\u002FSALT-NLP\u002FLLaVAR) | [Demo](https:\u002F\u002Feba470c07c805702b8.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshikras\u002Fshikra.svg?style=social&label=Star) \u003Cbr> [**Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15195.pdf) \u003Cbr> | arXiv | 2023-06-27 | [Github](https:\u002F\u002Fgithub.com\u002Fshikras\u002Fshikra) | [Demo](http:\u002F\u002Fdemo.zhaozhang.net:7860\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenMotionLab\u002FMotionGPT.svg?style=social&label=Star) \u003Cbr> [**MotionGPT: Human Motion as a Foreign Language**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14795.pdf) \u003Cbr> | arXiv | 2023-06-26 | [Github](https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flyuchenyang\u002FMacaw-LLM.svg?style=social&label=Star) \u003Cbr> [**Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.09093.pdf) \u003Cbr> | arXiv | 2023-06-15 | [Github](https:\u002F\u002Fgithub.com\u002Flyuchenyang\u002FMacaw-LLM) | [Coming soon]() |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenLAMM\u002FLAMM.svg?style=social&label=Star) \u003Cbr> [**LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.06687.pdf) \u003Cbr> | arXiv | 2023-06-11 | [Github](https:\u002F\u002Fgithub.com\u002FOpenLAMM\u002FLAMM) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopenlamm\u002FLAMM) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FVideo-ChatGPT.svg?style=social&label=Star) \u003Cbr> [**Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05424.pdf) \u003Cbr> | arXiv | 2023-06-08 | [Github](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT) | [Demo](https:\u002F\u002Fwww.ival-mbzuai.com\u002Fvideo-chatgpt) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLuodian\u002FOtter.svg?style=social&label=Star) \u003Cbr> [**MIMIC-IT: Multi-Modal In-Context Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05425.pdf) \u003Cbr> | arXiv | 2023-06-08 | [Github](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter) | [Demo](https:\u002F\u002Fotter.cliangyu.com\u002F) |\n| [**M\u003Csup>3\u003C\u002Fsup>IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.04387.pdf) | arXiv | 2023-06-07 | - | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideo-LLaMA.svg?style=social&label=Star) \u003Cbr> [**Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.02858.pdf) \u003Cbr> | arXiv | 2023-06-05 | [Github](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideo-LLaMA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FDAMO-NLP-SG\u002FVideo-LLaMA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLaVA-Med.svg?style=social&label=Star) \u003Cbr> [**LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.00890.pdf) \u003Cbr> | arXiv | 2023-06-01 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLaVA-Med) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FStevenGrove\u002FGPT4Tools.svg?style=social&label=Star) \u003Cbr> [**GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.18752.pdf) \u003Cbr> | arXiv | 2023-05-30 | [Github](https:\u002F\u002Fgithub.com\u002FStevenGrove\u002FGPT4Tools) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fstevengrove\u002FGPT4Tools) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyxuansu\u002FPandaGPT.svg?style=social&label=Star) \u003Cbr> [**PandaGPT: One Model To Instruction-Follow Them All**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.16355.pdf) \u003Cbr> | arXiv | 2023-05-25 | [Github](https:\u002F\u002Fgithub.com\u002Fyxuansu\u002FPandaGPT) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FGMFTBY\u002FPandaGPT) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjoez17\u002FChatBridge.svg?style=social&label=Star) \u003Cbr> [**ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.16103.pdf) \u003Cbr> | arXiv | 2023-05-25 | [Github](https:\u002F\u002Fgithub.com\u002Fjoez17\u002FChatBridge) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fluogen1996\u002FLaVIN.svg?style=social&label=Star) \u003Cbr> [**Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.15023.pdf) \u003Cbr> | arXiv | 2023-05-24 | [Github](https:\u002F\u002Fgithub.com\u002Fluogen1996\u002FLaVIN) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOptimalScale\u002FDetGPT.svg?style=social&label=Star) \u003Cbr> [**DetGPT: Detect What You Need via Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14167.pdf) \u003Cbr> | arXiv | 2023-05-23 | [Github](https:\u002F\u002Fgithub.com\u002FOptimalScale\u002FDetGPT) | [Demo](https:\u002F\u002Fd3c431c0c77b1d9010.gradio.live\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FPengi.svg?style=social&label=Star) \u003Cbr> [**Pengi: An Audio Language Model for Audio Tasks**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.11834.pdf) \u003Cbr> | NeurIPS | 2023-05-19 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPengi) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FVisionLLM.svg?style=social&label=Star) \u003Cbr> [**VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.11175.pdf) \u003Cbr> | arXiv | 2023-05-18 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FVisionLLM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuanGongND\u002Fltu.svg?style=social&label=Star) \u003Cbr> [**Listen, Think, and Understand**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.10790.pdf) \u003Cbr> | arXiv | 2023-05-18 | [Github](https:\u002F\u002Fgithub.com\u002FYuanGongND\u002Fltu) | [Demo](https:\u002F\u002Fgithub.com\u002FYuanGongND\u002Fltu) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FVisualGLM-6B.svg?style=social&label=Star) \u003Cbr> **VisualGLM-6B** \u003Cbr> | - | 2023-05-17 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FVisualGLM-6B) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxiaoman-zhang\u002FPMC-VQA.svg?style=social&label=Star) \u003Cbr> [**PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.10415.pdf) \u003Cbr> | arXiv | 2023-05-17 | [Github](https:\u002F\u002Fgithub.com\u002Fxiaoman-zhang\u002FPMC-VQA) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsalesforce\u002FLAVIS.svg?style=social&label=Star) \u003Cbr> [**InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.06500.pdf) \u003Cbr> | arXiv | 2023-05-11 | [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Finstructblip) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FAsk-Anything.svg?style=social&label=Star) \u003Cbr> [**VideoChat: Chat-Centric Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.06355.pdf) \u003Cbr> | arXiv | 2023-05-10 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything) | [Demo](https:\u002F\u002Fask.opengvlab.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopen-mmlab\u002FMultimodal-GPT.svg?style=social&label=Star) \u003Cbr> [**MultiModal-GPT: A Vision and Language Model for Dialogue with Humans**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04790.pdf) \u003Cbr> | arXiv | 2023-05-08 | [Github](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002FMultimodal-GPT) | [Demo](https:\u002F\u002Fmmgpt.openmmlab.org.cn\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fphellonchen\u002FX-LLM.svg?style=social&label=Star) \u003Cbr> [**X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04160.pdf) \u003Cbr> | arXiv | 2023-05-07 | [Github](https:\u002F\u002Fgithub.com\u002Fphellonchen\u002FX-LLM) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYunxinLi\u002FLingCloud.svg?style=social&label=Star) \u003Cbr> [**LMEye: An Interactive Perception Network for Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.03701.pdf) \u003Cbr> | arXiv | 2023-05-05 | [Github](https:\u002F\u002Fgithub.com\u002FYunxinLi\u002FLingCloud) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FLLaMA-Adapter.svg?style=social&label=Star) \u003Cbr> [**LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.15010.pdf) \u003Cbr> | arXiv | 2023-04-28 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FLLaMA-Adapter) | [Demo](http:\u002F\u002Fllama-adapter.opengvlab.com\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-Owl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.14178.pdf) \u003Cbr> | arXiv | 2023-04-27 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-Owl) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FMAGAer13\u002FmPLUG-Owl) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FMiniGPT-4.svg?style=social&label=Star) \u003Cbr> [**MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.10592.pdf) \u003Cbr> | arXiv | 2023-04-20 | [Github](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaotian-liu\u002FLLaVA.svg?style=social&label=Star) \u003Cbr> [**Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.08485.pdf) \u003Cbr> | NeurIPS | 2023-04-17 | [GitHub](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) | [Demo](https:\u002F\u002Fllava.hliu.cc\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FLLaMA-Adapter.svg?style=social&label=Star) \u003Cbr> [**LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.16199.pdf) \u003Cbr> | ICLR | 2023-03-28 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FLLaMA-Adapter) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fcsuhan\u002FLLaMA-Adapter) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVT-NLP\u002FMultiInstruct.svg?style=social&label=Star) \u003Cbr> [**MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10773.pdf) \u003Cbr> | ACL | 2022-12-21 | [Github](https:\u002F\u002Fgithub.com\u002FVT-NLP\u002FMultiInstruct) | - | \n\n## Multimodal Hallucination\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F1zhou-Wang\u002FMemVR.svg?style=social&label=Star) \u003Cbr> [**Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.03577) \u003Cbr> | arXiv | 2024-10-04 | [Github](https:\u002F\u002Fgithub.com\u002F1zhou-Wang\u002FMemVR) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnickjiang2378\u002Fvl-interp.svg?style=social&label=Star) \u003Cbr> [**Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.02762) \u003Cbr> | arXiv | 2024-10-03 | [Github](https:\u002F\u002Fgithub.com\u002Fnickjiang2378\u002Fvl-interp\u002F) | - |\n| [**FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.13612) | arXiv | 2024-09-20 | [Link](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002FFIHA-45BB) | - | \n| [**Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.00555) | arXiv | 2024-08-01 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLALBJ\u002FPAI.svg?style=social&label=Star) \u003Cbr> [**Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.21771) \u003Cbr> | ECCV | 2024-07-31 | [Github](https:\u002F\u002Fgithub.com\u002FLALBJ\u002FPAI) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmrwu-mac\u002FR-Bench.svg?style=social&label=Star) \u003Cbr> [**Evaluating and Analyzing Relationship Hallucinations in LVLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.16449) \u003Cbr> | ICML | 2024-06-24 | [Github](https:\u002F\u002Fgithub.com\u002Fmrwu-mac\u002FR-Bench) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLackel\u002FAGLA.svg?style=social&label=Star) \u003Cbr> [**AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.12718) \u003Cbr> | arXiv | 2024-06-18 | [Github](https:\u002F\u002Fgithub.com\u002FLackel\u002FAGLA) | - |\n| [**CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.01920) | arXiv | 2024-06-04 | [Coming soon]() | - |\n| [**Mitigating Object Hallucination via Data Augmented Contrastive Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.18654) | arXiv | 2024-05-28 | [Coming soon]() | - |\n| [**VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.15683) | arXiv | 2024-05-24 | [Coming soon]() | - |\n| [**Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.14233.pdf) | arXiv | 2024-04-22 | - | - |\n| [**Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.18715.pdf) | arXiv | 2024-03-27 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIVY-LVLM\u002FCounterfactual-Inception.svg?style=social&label=Star) \u003Cbr> [**What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.13513.pdf) \u003Cbr> | arXiv | 2024-03-20 | [Github](https:\u002F\u002Fgithub.com\u002FIVY-LVLM\u002FCounterfactual-Inception) | - |\n| [**Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.08730.pdf) | arXiv | 2024-03-13 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfzhang114\u002FLLaVA-Align.svg?style=social&label=Star) \u003Cbr> [**Debiasing Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.05262) \u003Cbr> | arXiv | 2024-03-08 | [Github](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FLLaVA-Align) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBillChan226\u002FHALC.svg?style=social&label=Star) \u003Cbr> [**HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.00425.pdf) \u003Cbr> | arXiv | 2024-03-01 | [Github](https:\u002F\u002Fgithub.com\u002FBillChan226\u002FHALC) | - |\n| [**IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.18476.pdf) | arXiv | 2024-02-28 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyuezih\u002Fless-is-more.svg?style=social&label=Star) \u003Cbr> [**Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.14545.pdf) \u003Cbr> | arXiv | 2024-02-22 | [Github](https:\u002F\u002Fgithub.com\u002Fyuezih\u002Fless-is-more) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHyperwjf\u002FLogicCheckGPT.svg?style=social&label=Star) \u003Cbr> [**Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11622.pdf) \u003Cbr> | arXiv | 2024-02-18 | [Github](https:\u002F\u002Fgithub.com\u002FHyperwjf\u002FLogicCheckGPT) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMasaiahHan\u002FCorrelationQA.svg?style=social&label=Star) \u003Cbr> [**The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.03757.pdf) \u003Cbr> | arXiv | 2024-02-06 | [Github](https:\u002F\u002Fgithub.com\u002FMasaiahHan\u002FCorrelationQA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenKG-ORG\u002FEasyDetect.svg?style=social&label=Star) \u003Cbr> [**Unified Hallucination Detection for Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.03190.pdf) \u003Cbr> | arXiv | 2024-02-05 | [Github](https:\u002F\u002Fgithub.com\u002FOpenKG-ORG\u002FEasyDetect) | - |\n| [**A Survey on Hallucination in Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.00253.pdf) | arXiv | 2024-02-01 | - | - |\n| [**Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.09861.pdf) | arXiv | 2024-01-18 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-HalOwl.svg?style=social&label=Star) \u003Cbr> [**Hallucination Augmented Contrastive Learning for Multimodal Large Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06968.pdf) \u003Cbr> | arXiv | 2023-12-12 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-HalOwl\u002Ftree\u002Fmain\u002Fhacl) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fassafbk\u002Fmocha_code.svg?style=social&label=Star) \u003Cbr> [**MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03631.pdf) \u003Cbr> | arXiv | 2023-12-06 | [Github](https:\u002F\u002Fgithub.com\u002Fassafbk\u002Fmocha_code) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAnonymousanoy\u002FFOHE.svg?style=social&label=Star) \u003Cbr> [**Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.01701.pdf) \u003Cbr> | arXiv | 2023-12-04 | [Github](https:\u002F\u002Fgithub.com\u002FAnonymousanoy\u002FFOHE) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRLHF-V\u002FRLHF-V.svg?style=social&label=Star) \u003Cbr> [**RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00849.pdf) \u003Cbr> | arXiv | 2023-12-01 | [Github](https:\u002F\u002Fgithub.com\u002FRLHF-V\u002FRLHF-V) | [Demo](http:\u002F\u002F120.92.209.146:8081\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshikiw\u002FOPERA.svg?style=social&label=Star) \u003Cbr> [**OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17911.pdf) \u003Cbr> | CVPR | 2023-11-29 | [Github](https:\u002F\u002Fgithub.com\u002Fshikiw\u002FOPERA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVCD.svg?style=social&label=Star) \u003Cbr> [**Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16922.pdf) \u003Cbr> | CVPR | 2023-11-28 | [Github](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVCD) | - |\n| [**Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16839.pdf) | arXiv | 2023-11-28 | [Github](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FHA-DPO) | [Comins Soon]() |\n| [**Mitigating Hallucination in Visual Language Models with Visual Supervision**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16479.pdf) | arXiv | 2023-11-27 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuqifan1117\u002FHalluciDoctor.svg?style=social&label=Star) \u003Cbr> [**HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.13614.pdf) \u003Cbr> | arXiv | 2023-11-22 | [Github](https:\u002F\u002Fgithub.com\u002FYuqifan1117\u002FHalluciDoctor) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjunyangwang0410\u002FAMBER.svg?style=social&label=Star) \u003Cbr> [**An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.07397.pdf) \u003Cbr> | arXiv | 2023-11-13 | [Github](https:\u002F\u002Fgithub.com\u002Fjunyangwang0410\u002FAMBER) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbcdnlp\u002FFAITHSCORE.svg?style=social&label=Star) \u003Cbr> [**FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.01477.pdf) \u003Cbr> | arXiv | 2023-11-02 | [Github](https:\u002F\u002Fgithub.com\u002Fbcdnlp\u002FFAITHSCORE) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FWoodpecker.svg?style=social&label=Star) \u003Cbr> [**Woodpecker: Hallucination Correction for Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16045.pdf) \u003Cbr> | arXiv | 2023-10-24 | [Github](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FWoodpecker) | [Demo](https:\u002F\u002Fdeb6a97bae6fab67ae.gradio.live\u002F) |\n| [**Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.05338.pdf) | arXiv | 2023-10-09 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbronyayang\u002FHallE_Switch.svg?style=social&label=Star) \u003Cbr> [**HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01779.pdf) \u003Cbr> | arXiv | 2023-10-03 | [Github](https:\u002F\u002Fgithub.com\u002Fbronyayang\u002FHallE_Switch) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYiyangZhou\u002FLURE.svg?style=social&label=Star) \u003Cbr> [**Analyzing and Mitigating Object Hallucination in Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00754.pdf) \u003Cbr> | ICLR | 2023-10-01 | [Github](https:\u002F\u002Fgithub.com\u002FYiyangZhou\u002FLURE) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fllava-rlhf\u002FLLaVA-RLHF.svg?style=social&label=Star) \u003Cbr> [**Aligning Large Multimodal Models with Factually Augmented RLHF**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.14525.pdf) \u003Cbr> | arXiv | 2023-09-25 | [Github](https:\u002F\u002Fgithub.com\u002Fllava-rlhf\u002FLLaVA-RLHF) | [Demo](http:\u002F\u002Fpitt.lti.cs.cmu.edu:7890\u002F) |\n| [**Evaluation and Mitigation of Agnosia in Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.04041.pdf) | arXiv | 2023-09-07 | - | - |\n| [**CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.02301.pdf) | arXiv | 2023-09-05 | - | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjunyangwang0410\u002FHaELM.svg?style=social&label=Star) \u003Cbr> [**Evaluation and Analysis of Hallucination in Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.15126.pdf) \u003Cbr> | arXiv | 2023-08-29 | [Github](https:\u002F\u002Fgithub.com\u002Fjunyangwang0410\u002FHaELM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopendatalab\u002FVIGC.svg?style=social&label=Star) \u003Cbr> [**VIGC: Visual Instruction Generation and Correction**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.12714.pdf) \u003Cbr> | arXiv | 2023-08-24 | [Github](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FVIGC) | [Demo](https:\u002F\u002Fopendatalab.github.io\u002FVIGC) | \n| [**Detecting and Preventing Hallucinations in Large Vision Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.06394.pdf) | arXiv | 2023-08-11 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFuxiaoLiu\u002FLRV-Instruction.svg?style=social&label=Star) \u003Cbr> [**Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14565.pdf) \u003Cbr> | ICLR | 2023-06-26 | [Github](https:\u002F\u002Fgithub.com\u002FFuxiaoLiu\u002FLRV-Instruction) | [Demo](https:\u002F\u002F7b6590ed039a06475d.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FPOPE.svg?style=social&label=Star) \u003Cbr> [**Evaluating Object Hallucination in Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.10355.pdf) \u003Cbr> | EMNLP | 2023-05-17 | [Github](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FPOPE) | - |\n\n## Multimodal In-Context Learning\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| [**Visual In-Context Learning for Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11574.pdf) | arXiv | 2024-02-18 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuanJianhao508\u002FRAG-Driver.svg?style=social&label=Star) \u003Cbr> [**RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.10828) \u003Cbr> | RSS | 2024-02-16 | [Github](https:\u002F\u002Fgithub.com\u002FYuanJianhao508\u002FRAG-Driver) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUW-Madison-Lee-Lab\u002FCoBSAT.svg?style=social&label=Star) \u003Cbr> [**Can MLLMs Perform Text-to-Image In-Context Learning?**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01293.pdf) \u003Cbr> | arXiv | 2024-02-02 | [Github](https:\u002F\u002Fgithub.com\u002FUW-Madison-Lee-Lab\u002FCoBSAT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu.svg?style=social&label=Star) \u003Cbr> [**Generative Multimodal Models are In-Context Learners**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.13286) \u003Cbr> | CVPR | 2023-12-20 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu\u002Ftree\u002Fmain\u002FEmu2) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FBAAI\u002FEmu2) |\n| [**Hijacking Context in Large Multi-modal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07553.pdf) | arXiv | 2023-12-07 | - | - |\n| [**Towards More Unified In-context Visual Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02520.pdf) | arXiv | 2023-12-05 | - | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHaozheZhao\u002FMIC.svg?style=social&label=Star) \u003Cbr> [**MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.07915.pdf) \u003Cbr> | arXiv | 2023-09-14 | [Github](https:\u002F\u002Fgithub.com\u002FHaozheZhao\u002FMIC) | [Demo](https:\u002F\u002F8904cdd23621858859.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fisekai-portal\u002FLink-Context-Learning.svg?style=social&label=Star) \u003Cbr> [**Link-Context Learning for Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.07891.pdf) \u003Cbr> | arXiv | 2023-08-15 | [Github](https:\u002F\u002Fgithub.com\u002Fisekai-portal\u002FLink-Context-Learning) | [Demo](http:\u002F\u002F117.144.81.99:20488\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlfoundations\u002Fopen_flamingo.svg?style=social&label=Star) \u003Cbr> [**OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.01390.pdf) \u003Cbr> | arXiv | 2023-08-02 | [Github](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_flamingo) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopenflamingo\u002FOpenFlamingo) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnap-stanford\u002Fmed-flamingo.svg?style=social&label=Star) \u003Cbr> [**Med-Flamingo: a Multimodal Medical Few-shot Learner**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.15189.pdf) \u003Cbr> | arXiv | 2023-07-27 | [Github](https:\u002F\u002Fgithub.com\u002Fsnap-stanford\u002Fmed-flamingo) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu.svg?style=social&label=Star) \u003Cbr> [**Generative Pretraining in Multimodality**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.05222.pdf) \u003Cbr> | ICLR | 2023-07-11 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu\u002Ftree\u002Fmain\u002FEmu1) | [Demo](http:\u002F\u002F218.91.113.230:9002\u002F) |\n| [**AVIS: Autonomous Visual Information Seeking with Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.08129.pdf) | arXiv | 2023-06-13 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLuodian\u002FOtter.svg?style=social&label=Star) \u003Cbr> [**MIMIC-IT: Multi-Modal In-Context Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05425.pdf) \u003Cbr> | arXiv | 2023-06-08 | [Github](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter) | [Demo](https:\u002F\u002Fotter.cliangyu.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyongliang-wu\u002FExploreCfg.svg?style=social&label=Star) \u003Cbr> [**Exploring Diverse In-Context Configurations for Image Captioning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14800.pdf) \u003Cbr> | NeurIPS | 2023-05-24 | [Github](https:\u002F\u002Fgithub.com\u002Fyongliang-wu\u002FExploreCfg) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flupantech\u002Fchameleon-llm.svg?style=social&label=Star) \u003Cbr> [**Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.09842.pdf) \u003Cbr> | arXiv | 2023-04-19 | [Github](https:\u002F\u002Fgithub.com\u002Flupantech\u002Fchameleon-llm) | [Demo](https:\u002F\u002Fchameleon-llm.github.io\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star) \u003Cbr> [**HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.17580.pdf) \u003Cbr> | arXiv | 2023-03-30 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FJARVIS) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FHuggingGPT) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMM-REACT.svg?style=social&label=Star) \u003Cbr> [**MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.11381.pdf) \u003Cbr> | arXiv | 2023-03-20 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMM-REACT) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft-cognitive-service\u002Fmm-react) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMAEHCM\u002FICL-D3IE.svg?style=social&label=Star) \u003Cbr> [**ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05063.pdf) \u003Cbr> | ICCV | 2023-03-09 | [Github](https:\u002F\u002Fgithub.com\u002FMAEHCM\u002FICL-D3IE) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMILVLG\u002Fprophet.svg?style=social&label=Star) \u003Cbr> [**Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.01903.pdf) \u003Cbr> | CVPR | 2023-03-03 | [Github](https:\u002F\u002Fgithub.com\u002FMILVLG\u002Fprophet) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Fvisprog.svg?style=social&label=Star) \u003Cbr> [**Visual Programming: Compositional visual reasoning without training**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FGupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf) \u003Cbr> | CVPR | 2022-11-18 | [Github](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fvisprog) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FPICa.svg?style=social&label=Star) \u003Cbr> [**An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA**](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAAAI\u002Farticle\u002Fdownload\u002F20215\u002F19974) \u003Cbr> | AAAI | 2022-06-28 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPICa) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlfoundations\u002Fopen_flamingo.svg?style=social&label=Star) \u003Cbr> [**Flamingo: a Visual Language Model for Few-Shot Learning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.14198.pdf) \u003Cbr> | NeurIPS | 2022-04-29 | [Github](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_flamingo) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdhansmair\u002Fflamingo-mini-cap) | \n| [**Multimodal Few-Shot Learning with Frozen Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.13884.pdf) | NeurIPS | 2021-06-25 | - | - |\n\n\n## Multimodal Chain-of-Thought\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdongyh20\u002FInsight-V.svg?style=social&label=Star) \u003Cbr> [**Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.14432) \u003Cbr> | arXiv | 2024-11-21 | [Github](https:\u002F\u002Fgithub.com\u002Fdongyh20\u002FInsight-V) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fggg0919\u002Fcantor.svg?style=social&label=Star) \u003Cbr> [**Cantor: Inspiring Multimodal Chain-of-Thought of MLLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.16033.pdf) \u003Cbr> | arXiv | 2024-04-24 | [Github](https:\u002F\u002Fgithub.com\u002Fggg0919\u002Fcantor) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepcs233\u002FVisual-CoT.svg?style=social&label=Star) \u003Cbr> [**Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.16999.pdf) \u003Cbr> | arXiv | 2024-03-25 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepcs233\u002FVisual-CoT) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fchancharikmitra\u002FCCoT.svg?style=social&label=Star) \u003Cbr> [**Compositional Chain-of-Thought Prompting for Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17076) \u003Cbr> | CVPR | 2023-11-27 | [Github](https:\u002F\u002Fgithub.com\u002Fchancharikmitra\u002FCCoT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSooLab\u002FDDCOT.svg?style=social&label=Star) \u003Cbr> [**DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16436.pdf) \u003Cbr> | NeurIPS | 2023-10-25 | [Github](https:\u002F\u002Fgithub.com\u002FSooLab\u002FDDCOT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshikras\u002Fshikra.svg?style=social&label=Star) \u003Cbr> [**Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15195.pdf) \u003Cbr> | arXiv | 2023-06-27 | [Github](https:\u002F\u002Fgithub.com\u002Fshikras\u002Fshikra) | [Demo](http:\u002F\u002Fdemo.zhaozhang.net:7860\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FzeroQiaoba\u002FExplainable-Multimodal-Emotion-Reasoning.svg?style=social&label=Star) \u003Cbr> [**Explainable Multimodal Emotion Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15401.pdf) \u003Cbr> | arXiv | 2023-06-27 | [Github](https:\u002F\u002Fgithub.com\u002FzeroQiaoba\u002FExplainable-Multimodal-Emotion-Reasoning) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEmbodiedGPT\u002FEmbodiedGPT_Pytorch.svg?style=social&label=Star) \u003Cbr> [**EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.15021.pdf) \u003Cbr> | arXiv | 2023-05-24 | [Github](https:\u002F\u002Fgithub.com\u002FEmbodiedGPT\u002FEmbodiedGPT_Pytorch) | - | \n| [**Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13903.pdf) | arXiv | 2023-05-23 | - | - |\n| [**T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.03453.pdf) | arXiv | 2023-05-05 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fttengwang\u002FCaption-Anything.svg?style=social&label=Star) \u003Cbr> [**Caption Anything: Interactive Image Description with Diverse Multimodal Controls**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.02677.pdf) \u003Cbr> | arXiv | 2023-05-04 | [Github](https:\u002F\u002Fgithub.com\u002Fttengwang\u002FCaption-Anything) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTencentARC\u002FCaption-Anything) |\n| [**Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.02317.pdf) | arXiv | 2023-05-03 | [Coming soon](https:\u002F\u002Fgithub.com\u002Fdannyrose30\u002FVCOT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flupantech\u002Fchameleon-llm.svg?style=social&label=Star) \u003Cbr> [**Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.09842.pdf) \u003Cbr> | arXiv | 2023-04-19 | [Github](https:\u002F\u002Fgithub.com\u002Flupantech\u002Fchameleon-llm) | [Demo](https:\u002F\u002Fchameleon-llm.github.io\u002F) | \n| [**Chain of Thought Prompt Tuning in Vision Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.07919.pdf) | arXiv | 2023-04-16 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMM-REACT.svg?style=social&label=Star) \u003Cbr> [**MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.11381.pdf) \u003Cbr> | arXiv | 2023-03-20 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMM-REACT) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft-cognitive-service\u002Fmm-react) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FTaskMatrix.svg?style=social&label=Star) \u003Cbr> [**Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.04671.pdf) \u003Cbr> | arXiv | 2023-03-08 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FTaskMatrix) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002Fvisual_chatgpt) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Famazon-science\u002Fmm-cot.svg?style=social&label=Star) \u003Cbr> [**Multimodal Chain-of-Thought Reasoning in Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00923.pdf) \u003Cbr> | arXiv | 2023-02-02 | [Github](https:\u002F\u002Fgithub.com\u002Famazon-science\u002Fmm-cot) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Fvisprog.svg?style=social&label=Star) \u003Cbr> [**Visual Programming: Compositional visual reasoning without training**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FGupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf) \u003Cbr> | CVPR | 2022-11-18 | [Github](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fvisprog) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flupantech\u002FScienceQA.svg?style=social&label=Star) \u003Cbr> [**Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2022\u002Ffile\u002F11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf) \u003Cbr> | NeurIPS | 2022-09-20 | [Github](https:\u002F\u002Fgithub.com\u002Flupantech\u002FScienceQA) | - |\n\n\n## LLM-Aided Visual Reasoning\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyhy-2000\u002FVideoDeepResearch.svg?style=social&label=Star) \u003Cbr> [**VideoDeepResearch: Long Video Understanding With Agentic Tool Using**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.10821) \u003Cbr> | arXiv | 2025-06-12 | [Github](https:\u002F\u002Fgithub.com\u002Fyhy-2000\u002FVideoDeepResearch) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLaVi-Lab\u002FVisual-Table.svg?style=social&label=Star) \u003Cbr> [**Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.18252.pdf) \u003Cbr> | arXiv | 2024-03-27 | [Github](https:\u002F\u002Fgithub.com\u002FLaVi-Lab\u002FVisual-Table) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpenghao-wu\u002Fvstar.svg?style=social&label=Star) \u003Cbr> [**V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14135.pdf) \u003Cbr> | arXiv | 2023-12-21 | [Github](https:\u002F\u002Fgithub.com\u002Fpenghao-wu\u002Fvstar) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLaVA-VL\u002FLLaVA-Interactive-Demo.svg?style=social&label=Star) \u003Cbr> [**LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.00571.pdf) \u003Cbr> | arXiv | 2023-11-01 | [Github](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Interactive-Demo) | [Demo](https:\u002F\u002F6dd3-20-163-117-69.ngrok-free.app\u002F) |\n| [**MM-VID: Advancing Video Understanding with GPT-4V(vision)**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.19773.pdf) | arXiv | 2023-10-30 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FControlLLM.svg?style=social&label=Star) \u003Cbr> [**ControlLLM: Augment Language Models with Tools by Searching on Graphs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.17796.pdf) \u003Cbr> | arXiv | 2023-10-26 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FControlLLM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FWoodpecker.svg?style=social&label=Star) \u003Cbr> [**Woodpecker: Hallucination Correction for Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16045.pdf) \u003Cbr> | arXiv | 2023-10-24 | [Github](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FWoodpecker) | [Demo](https:\u002F\u002Fdeb6a97bae6fab67ae.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmindagent\u002Fmindagent.svg?style=social&label=Star) \u003Cbr> [**MindAgent: Emergent Gaming Interaction**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.09971.pdf) \u003Cbr> | arXiv | 2023-09-18 | [Github](https:\u002F\u002Fgithub.com\u002Fmindagent\u002Fmindagent) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FContextualAI\u002Flens.svg?style=social&label=Star) \u003Cbr> [**Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.16410.pdf) \u003Cbr> | arXiv | 2023-06-28 | [Github](https:\u002F\u002Fgithub.com\u002FContextualAI\u002Flens) | [Demo](https:\u002F\u002Flens.contextual.ai\u002F) |\n| [**Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.11732.pdf) | arXiv | 2023-06-15 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002Fassistgpt.svg?style=social&label=Star) \u003Cbr> [**AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.08640.pdf) \u003Cbr> | arXiv | 2023-06-14 | [Github](https:\u002F\u002Fgithub.com\u002Fshowlab\u002Fassistgpt) | - |\n| [**AVIS: Autonomous Visual Information Seeking with Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.08129.pdf) | arXiv | 2023-06-13 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FStevenGrove\u002FGPT4Tools.svg?style=social&label=Star) \u003Cbr> [**GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.18752.pdf) \u003Cbr> | arXiv | 2023-05-30 | [Github](https:\u002F\u002Fgithub.com\u002FStevenGrove\u002FGPT4Tools) | [Demo](https:\u002F\u002Fc60eb7e9400930f31b.gradio.live\u002F) | \n| [**Mindstorms in Natural Language-Based Societies of Mind**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.17066.pdf) | arXiv | 2023-05-26 | - | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fweixi-feng\u002FLayoutGPT.svg?style=social&label=Star) \u003Cbr> [**LayoutGPT: Compositional Visual Planning and Generation with Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.15393.pdf) \u003Cbr> | arXiv | 2023-05-24 | [Github](https:\u002F\u002Fgithub.com\u002Fweixi-feng\u002FLayoutGPT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHxyou\u002FIdealGPT.svg?style=social&label=Star) \u003Cbr> [**IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14985.pdf) \u003Cbr> | arXiv | 2023-05-24 | [Github](https:\u002F\u002Fgithub.com\u002FHxyou\u002FIdealGPT) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmatrix-alpha\u002FAccountable-Textual-Visual-Chat.svg?style=social&label=Star) \u003Cbr> [**Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05983.pdf) \u003Cbr> | arXiv | 2023-05-10 | [Github](https:\u002F\u002Fgithub.com\u002Fmatrix-alpha\u002FAccountable-Textual-Visual-Chat) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fttengwang\u002FCaption-Anything.svg?style=social&label=Star) \u003Cbr> [**Caption Anything: Interactive Image Description with Diverse Multimodal Controls**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.02677.pdf) \u003Cbr> | arXiv | 2023-05-04 | [Github](https:\u002F\u002Fgithub.com\u002Fttengwang\u002FCaption-Anything) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTencentARC\u002FCaption-Anything) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flupantech\u002Fchameleon-llm.svg?style=social&label=Star) \u003Cbr> [**Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.09842.pdf) \u003Cbr> | arXiv | 2023-04-19 | [Github](https:\u002F\u002Fgithub.com\u002Flupantech\u002Fchameleon-llm) | [Demo](https:\u002F\u002Fchameleon-llm.github.io\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star) \u003Cbr> [**HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.17580.pdf) \u003Cbr> | arXiv | 2023-03-30 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FJARVIS) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FHuggingGPT) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMM-REACT.svg?style=social&label=Star) \u003Cbr> [**MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.11381.pdf) \u003Cbr> | arXiv | 2023-03-20 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMM-REACT) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft-cognitive-service\u002Fmm-react) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcvlab-columbia\u002Fviper.svg?style=social&label=Star) \u003Cbr> [**ViperGPT: Visual Inference via Python Execution for Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.08128.pdf) \u003Cbr> | arXiv | 2023-03-14 | [Github](https:\u002F\u002Fgithub.com\u002Fcvlab-columbia\u002Fviper) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FChatCaptioner.svg?style=social&label=Star) \u003Cbr> [**ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.06594.pdf) \u003Cbr> | arXiv | 2023-03-12 | [Github](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FChatCaptioner) | Local Demo |\n| [**ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05063.pdf) | ICCV | 2023-03-09 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FTaskMatrix.svg?style=social&label=Star) \u003Cbr> [**Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.04671.pdf) \u003Cbr> | arXiv | 2023-03-08 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FTaskMatrix) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002Fvisual_chatgpt) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZrrSkywalker\u002FCaFo.svg?style=social&label=Star) \u003Cbr> [**Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.02151.pdf) \u003Cbr> | CVPR | 2023-03-03 | [Github](https:\u002F\u002Fgithub.com\u002FZrrSkywalker\u002FCaFo) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsalesforce\u002FLAVIS.svg?style=social&label=Star) \u003Cbr> [**From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10846.pdf) \u003Cbr> | CVPR | 2022-12-21 | [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fimg2llm-vqa) | [Demo](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsalesforce\u002FLAVIS\u002Fblob\u002Fmain\u002Fprojects\u002Fimg2llm-vqa\u002Fimg2llm_vqa.ipynb) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvishaal27\u002FSuS-X.svg?style=social&label=Star) \u003Cbr> [**SuS-X: Training-Free Name-Only Transfer of Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.16198.pdf) \u003Cbr> | arXiv | 2022-11-28 | [Github](https:\u002F\u002Fgithub.com\u002Fvishaal27\u002FSuS-X) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangyangyang127\u002FPointCLIP_V2.svg?style=social&label=Star) \u003Cbr> [**PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11682.pdf) \u003Cbr> | CVPR | 2022-11-21 | [Github](https:\u002F\u002Fgithub.com\u002Fyangyangyang127\u002FPointCLIP_V2) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Fvisprog.svg?style=social&label=Star) \u003Cbr> [**Visual Programming: Compositional visual reasoning without training**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FGupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf) \u003Cbr> | CVPR | 2022-11-18 | [Github](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fvisprog) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle-research\u002Fgoogle-research.svg?style=social&label=Star) \u003Cbr> [**Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00598.pdf) \u003Cbr> | arXiv | 2022-04-01 | [Github](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002Ftree\u002Fmaster\u002Fsocraticmodels) | - |\n\n\n## Foundation Models\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| [**Introducing GPT-5**](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-gpt-5\u002F) | OpenAI | 2025-08-07 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoLLaMA3.svg?style=social&label=Star) \u003Cbr> [**VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.13106) \u003Cbr> | arXiv | 2025-01-22 | [Github](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flixin4ever\u002FVideoLLaMA3) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu3.svg?style=social&label=Star) \u003Cbr> [**Emu3: Next-Token Prediction is All You Need**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.18869) \u003Cbr> | arXiv | 2024-09-27 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu3) | Local Demo |\n| [**Llama 3.2: Revolutionizing edge AI and vision with open, customizable models**](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fllama-3-2-connect-2024-vision-edge-mobile-devices\u002F) | Meta | 2024-09-25 | - | [Demo](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-3.2-11B-Vision-Instruct) | \n| [**Pixtral-12B**](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fpixtral-12b\u002F) | Mistral | 2024-09-17 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsalesforce\u002FLAVIS.svg?style=social&label=Star) \u003Cbr> [**xGen-MM (BLIP-3): A Family of Open Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.08872) \u003Cbr> | arXiv | 2024-08-16 | [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fxgen-mm) | - |\n| [**The Llama 3 Herd of Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.21783) | arXiv | 2024-07-31 | - | - |\n| [**Chameleon: Mixed-Modal Early-Fusion Foundation Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.09818) | arXiv | 2024-05-16 | - | - |\n| [**Hello GPT-4o**](https:\u002F\u002Fopenai.com\u002Findex\u002Fhello-gpt-4o\u002F) | OpenAI | 2024-05-13 | - | - | \n| [**The Claude 3 Model Family: Opus, Sonnet, Haiku**](https:\u002F\u002Fwww-cdn.anthropic.com\u002Fde8ba9b01c9ab7cbabf5c33b80b7bbc618857627\u002FModel_Card_Claude_3.pdf) | Anthropic | 2024-03-04 | - | - |\n| [**Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context**](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_v1_5_report.pdf) | Google | 2024-02-15 | - | - |\n| [**Gemini: A Family of Highly Capable Multimodal Models**](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_1_report.pdf) | Google | 2023-12-06 | - | - |\n| [**Fuyu-8B: A Multimodal Architecture for AI Agents**](https:\u002F\u002Fwww.adept.ai\u002Fblog\u002Ffuyu-8b) | Blog | 2023-10-17 | [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fadept\u002Ffuyu-8b) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fadept\u002Ffuyu-8b) \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmshukor\u002FUnIVAL.svg?style=social&label=Star) \u003Cbr> [**Unified Model for Image, Video, Audio and Language Tasks**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.16184.pdf) \u003Cbr> | arXiv | 2023-07-30 | [Github](https:\u002F\u002Fgithub.com\u002Fmshukor\u002FUnIVAL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmshukor\u002FUnIVAL) |\n| [**PaLI-3 Vision Language Models: Smaller, Faster, Stronger**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.09199.pdf) | arXiv | 2023-10-13 | - | - |\n| [**GPT-4V(ision) System Card**](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002FGPTV_System_Card.pdf) | OpenAI | 2023-09-25 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjy0205\u002FLaVIT.svg?style=social&label=Star) \u003Cbr> [**Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.04669.pdf) \u003Cbr> | arXiv | 2023-09-09 | [Github](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FLaVIT) | - |\n| [**Multimodal Foundation Models: From Specialists to General-Purpose Assistants**](https:\u002F\u002Fbrowse.arxiv.org\u002Fpdf\u002F2309.10020.pdf) | arXiv | 2023-09-18 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyiren-jian\u002FBLIText.svg?style=social&label=Star) \u003Cbr> [**Bootstrapping Vision-Language Learning with Decoupled Language Pre-training**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.07063.pdf) \u003Cbr> | NeurIPS | 2023-07-13 | [Github](https:\u002F\u002Fgithub.com\u002Fyiren-jian\u002FBLIText) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu.svg?style=social&label=Star) \u003Cbr> [**Generative Pretraining in Multimodality**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.05222.pdf) \u003Cbr> | arXiv | 2023-07-11 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu) | [Demo](http:\u002F\u002F218.91.113.230:9002\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social&label=Star) \u003Cbr> [**Kosmos-2: Grounding Multimodal Large Language Models to the World**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14824.pdf) \u003Cbr> | arXiv | 2023-06-26 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Fkosmos-2) | [Demo](https:\u002F\u002Faka.ms\u002Fkosmos-2-demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVPGTrans\u002FVPGTrans.svg?style=social&label=Star) \u003Cbr> [**Transfer Visual Prompt Generator across LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.01278.pdf) \u003Cbr> | arXiv | 2023-05-02 | [Github](https:\u002F\u002Fgithub.com\u002FVPGTrans\u002FVPGTrans) | [Demo](https:\u002F\u002F3fc7715dbc44234a7f.gradio.live\u002F) | \n| [**GPT-4 Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.08774.pdf) | arXiv | 2023-03-15 | - | - |\n| [**PaLM-E: An Embodied Multimodal Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.03378.pdf) | arXiv | 2023-03-06 | - | [Demo](https:\u002F\u002Fpalm-e.github.io\u002F#demo) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002Fprismer.svg?style=social&label=Star) \u003Cbr> [**Prismer: A Vision-Language Model with An Ensemble of Experts**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.02506.pdf) \u003Cbr> | arXiv | 2023-03-04 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fprismer) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Florenmt\u002Fprismer) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social&label=Star) \u003Cbr> [**Language Is Not All You Need: Aligning Perception with Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.14045.pdf) \u003Cbr> | arXiv | 2023-02-27 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsalesforce\u002FLAVIS.svg?style=social&label=Star) \u003Cbr> [**BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12597.pdf) \u003Cbr> | arXiv | 2023-01-30 | [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fblip2) | [Demo](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsalesforce\u002FLAVIS\u002Fblob\u002Fmain\u002Fexamples\u002Fblip2_instructed_generation.ipynb) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvimalabs\u002FVIMA.svg?style=social&label=Star) \u003Cbr> [**VIMA: General Robot Manipulation with Multimodal Prompts**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03094.pdf) \u003Cbr> | ICML | 2022-10-06 | [Github](https:\u002F\u002Fgithub.com\u002Fvimalabs\u002FVIMA) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMineDojo\u002FMineDojo.svg?style=social&label=Star) \u003Cbr> [**MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08853.pdf) \u003Cbr> | NeurIPS | 2022-06-17 | [Github](https:\u002F\u002Fgithub.com\u002FMineDojo\u002FMineDojo) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshizhediao\u002FDaVinci.svg?style=social&label=Star) \u003Cbr> [**Write and Paint: Generative Vision-Language Models are Unified Modal Learners**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07699.pdf) \u003Cbr> | ICLR | 2022-06-15 | [Github](https:\u002F\u002Fgithub.com\u002Fshizhediao\u002FDaVinci) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social&label=Star) \u003Cbr> [**Language Models are General-Purpose Interfaces**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06336.pdf) \u003Cbr> | arXiv | 2022-06-13 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm) | - |\n\n## Evaluation\n|  Title  |   Venue  |   Date   |   Page   |\n|:--------|:--------:|:--------:|:--------:|\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flerogo\u002FMMGenBench?style=social&label=Star) \u003Cbr> [**Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.14171) \u003Cbr> | arXiv | 2024-12-18 | [Github](https:\u002F\u002Fgithub.com\u002Fvision-x-nyu\u002Fthinking-in-space) |\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flerogo\u002FMMGenBench?style=social&label=Star) \u003Cbr> [**MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.14062) \u003Cbr> | arXiv | 2024-11-21 | [Github](https:\u002F\u002Fgithub.com\u002Flerogo\u002FMMGenBench) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmultimodal-art-projection\u002FOmniBench?style=social&label=Star) \u003Cbr> [**OmniBench: Towards The Future of Universal Omni-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.15272) \u003Cbr> | arXiv | 2024-09-23 | [Github](https:\u002F\u002Fgithub.com\u002Fmultimodal-art-projection\u002FOmniBench) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfzhang114\u002FMME-RealWorld?style=social&label=Star) \u003Cbr> [**MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.13257) \u003Cbr> | arXiv | 2024-08-23 | [Github](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FMME-RealWorld) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fguoyang9\u002FUNK-VQA?style=social&label=Star) \u003Cbr> [**UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.10942) \u003Cbr> | TPAMI | 2023-10-17 | [Github](https:\u002F\u002Fgithub.com\u002Fguoyang9\u002FUNK-VQA) |\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fchenllliang\u002FMMEvalPro?style=social&label=Star) \u003Cbr> [**MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.00468) \u003Cbr> | arXiv | 2024-06-29 | [Github](https:\u002F\u002Fgithub.com\u002Fchenllliang\u002FMMEvalPro) |\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMBZUAI-LLM\u002Fweb2code?style=social&label=Star) \u003Cbr> [**Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.20098) \u003Cbr> | arXiv | 2024-06-28 | [Github](https:\u002F\u002Fgithub.com\u002FMBZUAI-LLM\u002Fweb2code) |\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fprinceton-nlp\u002FCharXiv?style=social&label=Star) \u003Cbr> [**CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.18521) \u003Cbr> | arXiv | 2024-06-26 | [Github](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FCharXiv) |\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChartMimic\u002FChartMimic?style=social&label=Star) \u003Cbr> [**ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.09961) \u003Cbr> | arXiv | 2024-04-15 | [Github](https:\u002F\u002Fgithub.com\u002FChartMimic\u002FChartMimic) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FVideo-MME?style=social&label=Star) \u003Cbr> [**Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.21075) \u003Cbr> | arXiv | 2024-05-31 | [Github](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsail-sg\u002FMMCBench?style=social&label=Star) \u003Cbr> [**Benchmarking Large Multimodal Models against Common Corruptions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.11943.pdf) \u003Cbr> | NAACL | 2024-01-22 | [Github](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FMMCBench) |\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftsb0601\u002FMMVP?style=social&label=Star) \u003Cbr> [**Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.06209.pdf) \u003Cbr> | arXiv | 2024-01-11 | [Github](https:\u002F\u002Fgithub.com\u002Ftsb0601\u002FMMVP) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models?style=social&label=Star) \u003Cbr> [**A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.12436.pdf) \u003Cbr> | arXiv | 2023-12-19 | [Github](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIFEG\u002FBenchLMM?style=social&label=Star) \u003Cbr> [**BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02896.pdf) \u003Cbr> | arXiv | 2023-12-05 | [Github](https:\u002F\u002Fgithub.com\u002FAIFEG\u002FBenchLMM) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUCSC-VLAA\u002Fvllm-safety-benchmark.svg?style=social&label=Star) \u003Cbr> [**How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16101.pdf) \u003Cbr> | arXiv | 2023-11-27 | [Github](https:\u002F\u002Fgithub.com\u002FUCSC-VLAA\u002Fvllm-safety-benchmark) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjonathan-roberts1\u002Fcharting-new-territories.svg?style=social&label=Star) \u003Cbr> [**Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.14656.pdf) \u003Cbr> | arXiv | 2023-11-24 | [Github](https:\u002F\u002Fgithub.com\u002Fjonathan-roberts1\u002Fcharting-new-territories) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFreedomIntelligence\u002FMLLM-Bench?style=social&label=Star) \u003Cbr> [**MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.13951) \u003Cbr> | arXiv | 2023-11-23 | [Github](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FMLLM-Bench) |\n| [**VLM-Eval: A General Evaluation on Video Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.11865.pdf) | arXiv | 2023-11-20 | [Coming soon]() |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgzcch\u002FBingo.svg?style=social&label=Star) \u003Cbr> [**Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03287.pdf) \u003Cbr> | arXiv | 2023-11-06 | [Github](https:\u002F\u002Fgithub.com\u002Fgzcch\u002FBingo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPJLab-ADG\u002FGPT4V-AD-Exploration.svg?style=social&label=Star) \u003Cbr> [**On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.05332.pdf) \u003Cbr> | arXiv | 2023-11-09 | [Github](https:\u002F\u002Fgithub.com\u002FPJLab-ADG\u002FGPT4V-AD-Exploration) |\n| [**Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.02782.pdf) | arXiv | 2023-11-05 | - |\n| [**A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.20381.pdf) | arXiv | 2023-10-31 | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falbertwy\u002FGPT-4V-Evaluation.svg?style=social&label=Star) \u003Cbr> [**An Early Evaluation of GPT-4V(ision)**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16534.pdf) \u003Cbr> | arXiv | 2023-10-25 | [Github](https:\u002F\u002Fgithub.com\u002Falbertwy\u002FGPT-4V-Evaluation) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSCUT-DLVCLab\u002FGPT-4V_OCR.svg?style=social&label=Star) \u003Cbr> [**Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16809.pdf) \u003Cbr> | arXiv | 2023-10-25 | [Github](https:\u002F\u002Fgithub.com\u002FSCUT-DLVCLab\u002FGPT-4V_OCR) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftianyi-lab\u002FHallusionBench.svg?style=social&label=Star) \u003Cbr> [**HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.14566.pdf) \u003Cbr> | CVPR | 2023-10-23 | [Github](https:\u002F\u002Fgithub.com\u002Ftianyi-lab\u002FHallusionBench) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flupantech\u002FMathVista.svg?style=social&label=Star) \u003Cbr> [**MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.02255.pdf) \u003Cbr> | ICLR | 2023-10-03 | [Github](https:\u002F\u002Fgithub.com\u002Flupantech\u002FMathVista) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fys-zong\u002FFoolyourVLLMs.svg?style=social&label=Star) \u003Cbr> [**Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01651.pdf) \u003Cbr> | arXiv | 2023-10-02 | [Github](https:\u002F\u002Fgithub.com\u002Fys-zong\u002FFoolyourVLLMs) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmshukor\u002FEvALign-ICL.svg?style=social&label=Star) \u003Cbr> [**Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00647.pdf) \u003Cbr> | arXiv | 2023-10-01 | [Github](https:\u002F\u002Fgithub.com\u002Fmshukor\u002FEvALign-ICL) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzjunlp\u002FEasyEdit.svg?style=social&label=Star) \u003Cbr> [**Can We Edit Multimodal Large Language Models?**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.08475.pdf) \u003Cbr> | arXiv | 2023-10-12 | [Github](https:\u002F\u002Fgithub.com\u002Fzjunlp\u002FEasyEdit) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliaoning97\u002FREVO-LION.svg?style=social&label=Star) \u003Cbr> [**REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.06594.pdf) \u003Cbr> | arXiv | 2023-10-10 | [Github](https:\u002F\u002Fgithub.com\u002Fliaoning97\u002FREVO-LION) |\n| [**The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.17421.pdf) | arXiv | 2023-09-29 | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOFA-Sys\u002FTouchStone.svg?style=social&label=Star) \u003Cbr> [**TouchStone: Evaluating Vision-Language Models by Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16890.pdf) \u003Cbr>| arXiv | 2023-08-31 | [Github](https:\u002F\u002Fgithub.com\u002FOFA-Sys\u002FTouchStone) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHYPJUDY\u002FSparkles.svg?style=social&label=Star) \u003Cbr> [**✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16463.pdf) \u003Cbr> | arXiv | 2023-08-31 | [Github](https:\u002F\u002Fgithub.com\u002FHYPJUDY\u002FSparkles#sparkleseval) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffindalexli\u002FSciGraphQA.svg?style=social&label=Star) \u003Cbr> [**SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.03349.pdf) \u003Cbr> | arXiv | 2023-08-07 | [Github](https:\u002F\u002Fgithub.com\u002Ffindalexli\u002FSciGraphQA) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FMulti-Modality-Arena.svg?style=social&label=Star) \u003Cbr> [**Tiny LVLM-eHub: Early Multimodal Experiments with Bard**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.03729.pdf) \u003Cbr> | arXiv | 2023-08-07 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FMulti-Modality-Arena) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyuweihao\u002FMM-Vet.svg?style=social&label=Star) \u003Cbr> [**MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.02490.pdf) \u003Cbr> | arXiv | 2023-08-04 | [Github](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMM-Vet) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FSEED-Bench.svg?style=social&label=Star) \u003Cbr> [**SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.16125.pdf) \u003Cbr> | CVPR | 2023-07-30 | [Github](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FSEED-Bench) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopen-compass\u002FMMBench.svg?style=social&label=Star) \u003Cbr> [**MMBench: Is Your Multi-modal Model an All-around Player?**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.06281.pdf) \u003Cbr> | arXiv | 2023-07-12 | [Github](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FMMBench) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models.svg?style=social&label=Star) \u003Cbr> [**MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13394.pdf) \u003Cbr> | arXiv | 2023-06-23 | [Github](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FEvaluation) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FMulti-Modality-Arena.svg?style=social&label=Star) \u003Cbr> [**LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.09265.pdf) \u003Cbr> | arXiv | 2023-06-15 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FMulti-Modality-Arena) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenLAMM\u002FLAMM.svg?style=social&label=Star) \u003Cbr> [**LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.06687.pdf) \u003Cbr> | arXiv | 2023-06-11 | [Github](https:\u002F\u002Fgithub.com\u002FOpenLAMM\u002FLAMM#lamm-benchmark) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FM3Exam.svg?style=social&label=Star) \u003Cbr> [**M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05179.pdf) \u003Cbr> | arXiv | 2023-06-08 | [Github](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FM3Exam) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuliang-Liu\u002FMultimodalOCR.svg?style=social&label=Star) \u003Cbr> [**On The Hidden Mystery of OCR in Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.07895.pdf) \u003Cbr> | arXiv | 2023-05-13 | [Github](https:\u002F\u002Fgithub.com\u002FYuliang-Liu\u002FMultimodalOCR) | \n\n## Multimodal RLHF\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfzhang114\u002Fr1_reward.svg?style=social&label=Star) \u003Cbr> [**R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.02835) \u003Cbr> | arXiv | 2025-05-09 | [Github](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002Fr1_reward) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models.svg?style=social&label=Star) \u003Cbr> [**Aligning Multimodal LLM with Human Preference: A Survey**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.14504) \u003Cbr> | arXiv | 2025-03-23 | [Github](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FAlignment) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKwai-YuanQi\u002FMM-RLHF.svg?style=social&label=Star) \u003Cbr> [**MM-RLHF: The Next Step Forward in Multimodal LLM Alignment**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.10391) \u003Cbr> | arXiv | 2025-02-14 | [Github](https:\u002F\u002Fgithub.com\u002FKwai-YuanQi\u002FMM-RLHF) | - |\n| [**Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.06682) | arXiv | 2024-10-09 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvlf-silkie\u002FVLFeedback.svg?style=social&label=Star) \u003Cbr> [**Silkie: Preference Distillation for Large Visual Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10665.pdf) \u003Cbr> | arXiv | 2023-12-17 | [Github](https:\u002F\u002Fgithub.com\u002Fvlf-silkie\u002FVLFeedback) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRLHF-V\u002FRLHF-V.svg?style=social&label=Star) \u003Cbr> [**RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00849.pdf) \u003Cbr> | arXiv | 2023-12-01 | [Github](https:\u002F\u002Fgithub.com\u002FRLHF-V\u002FRLHF-V) | [Demo](http:\u002F\u002F120.92.209.146:8081\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fllava-rlhf\u002FLLaVA-RLHF.svg?style=social&label=Star) \u003Cbr> [**Aligning Large Multimodal Models with Factually Augmented RLHF**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.14525.pdf) \u003Cbr> | arXiv | 2023-09-25 | [Github](https:\u002F\u002Fgithub.com\u002Fllava-rlhf\u002FLLaVA-RLHF) | [Demo](http:\u002F\u002Fpitt.lti.cs.cmu.edu:7890\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwangclnlp\u002FVision-LLM-Alignment.svg?style=social&label=Star) \u003Cbr> [**RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.12109) \u003Cbr> | arXiv | 2024-08-22 | [Github](https:\u002F\u002Fgithub.com\u002Fwangclnlp\u002FVision-LLM-Alignment) | - |\n\n## Others\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftingyu215\u002FTS-LLaVA.svg?style=social&label=Star) \u003Cbr> [**TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.11066) \u003Cbr> | arXiv | 2024-11-17 | [Github](https:\u002F\u002Fgithub.com\u002Ftingyu215\u002FTS-LLaVA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fys-zong\u002FVLGuard.svg?style=social&label=Star) \u003Cbr> [**Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.02207.pdf) \u003Cbr> | arXiv | 2024-02-03 | [Github](https:\u002F\u002Fgithub.com\u002Fys-zong\u002FVLGuard) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSHI-Labs\u002FVCoder.svg?style=social&label=Star) \u003Cbr> [**VCoder: Versatile Vision Encoders for Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14233.pdf) \u003Cbr> | arXiv | 2023-12-21 | [Github](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FVCoder) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FPrompt-Highlighter.svg?style=social&label=Star) \u003Cbr> [**Prompt Highlighter: Interactive Control for Multi-Modal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.04302.pdf) \u003Cbr> | arXiv | 2023-12-07 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FPrompt-Highlighter) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIlab-CVC\u002FSEED.svg?style=social&label=Star) \u003Cbr> [**Planting a SEED of Vision in Large Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.08041.pdf) \u003Cbr> | arXiv | 2023-07-16 | [Github](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FSEED) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuawei-noah\u002FEfficient-Computing.svg?style=social&label=Star) \u003Cbr> [**Can Large Pre-trained Models Help Vision Models on Perception Tasks?**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.00693.pdf) \u003Cbr> | arXiv | 2023-06-01 | [Github](https:\u002F\u002Fgithub.com\u002Fhuawei-noah\u002FEfficient-Computing\u002Ftree\u002Fmaster\u002FGPT4Image\u002F) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyuhangzang\u002FContextDET.svg?style=social&label=Star) \u003Cbr> [**Contextual Object Detection with Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.18279.pdf) \u003Cbr> | arXiv | 2023-05-29 | [Github](https:\u002F\u002Fgithub.com\u002Fyuhangzang\u002FContextDET) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fyuhangzang\u002FContextDet-Demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkohjingyu\u002Fgill.svg?style=social&label=Star) \u003Cbr> [**Generating Images with Multimodal Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.17216.pdf) \u003Cbr> | arXiv | 2023-05-26 | [Github](https:\u002F\u002Fgithub.com\u002Fkohjingyu\u002Fgill) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyunqing-me\u002FAttackVLM.svg?style=social&label=Star) \u003Cbr> [**On Evaluating Adversarial Robustness of Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.16934.pdf) \u003Cbr> | arXiv | 2023-05-26 | [Github](https:\u002F\u002Fgithub.com\u002Fyunqing-me\u002FAttackVLM) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkohjingyu\u002Ffromage.svg?style=social&label=Star) \u003Cbr> [**Grounding Language Models to Images for Multimodal Inputs and Outputs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13823.pdf) \u003Cbr> | ICML | 2023-01-31 | [Github](https:\u002F\u002Fgithub.com\u002Fkohjingyu\u002Ffromage) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fjykoh\u002Ffromage) |\n\n---\n\n# Awesome Datasets\n\n## Datasets of Pre-Training for Alignment\n| Name | Paper | Type | Modalities |\n|:-----|:-----:|:----:|:----------:|\n| **ShareGPT4Video** | [ShareGPT4Video: Improving Video Understanding and Generation with Better Captions](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.04325v1) | Caption | Video-Text |\n| **COYO-700M** | [COYO-700M: Image-Text Pair Dataset](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fcoyo-dataset\u002F) | Caption | Image-Text |\n| **ShareGPT4V** | [ShareGPT4V: Improving Large Multi-Modal Models with Better Captions](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.12793.pdf) | Caption | Image-Text |\n| **AS-1B** | [The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.01907.pdf) | Hybrid | Image-Text |\n| **InternVid** | [InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.06942.pdf) | Caption | Video-Text |\n| **MS-COCO** | [Microsoft COCO: Common Objects in Context](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1405.0312.pdf) | Caption | Image-Text |\n| **SBU Captions** | [Im2Text: Describing Images Using 1 Million Captioned Photographs](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2011\u002Ffile\u002F5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf) | Caption | Image-Text |\n| **Conceptual Captions** | [Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning](https:\u002F\u002Faclanthology.org\u002FP18-1238.pdf) | Caption | Image-Text |\n| **LAION-400M** | [LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.02114.pdf) | Caption | Image-Text |\n| **VG Captions** |[Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations](https:\u002F\u002Flink.springer.com\u002Fcontent\u002Fpdf\u002F10.1007\u002Fs11263-016-0981-7.pdf) | Caption | Image-Text |\n| **Flickr30k** | [Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_iccv_2015\u002Fpapers\u002FPlummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf) | Caption | Image-Text |\n| **AI-Caps** | [AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1711.06475.pdf) | Caption | Image-Text | \n| **Wukong Captions** | [Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2022\u002Ffile\u002Fa90b9a09a6ee43d6631cf42e225d73b4-Paper-Datasets_and_Benchmarks.pdf) | Caption | Image-Text |\n| **GRIT** | [Kosmos-2: Grounding Multimodal Large Language Models to the World](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14824.pdf) | Caption | Image-Text-Bounding-Box |\n| **Youku-mPLUG** | [Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.04362.pdf) | Caption | Video-Text |\n| **MSR-VTT** | [MSR-VTT: A Large Video Description Dataset for Bridging Video and Language](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_cvpr_2016\u002Fpapers\u002FXu_MSR-VTT_A_Large_CVPR_2016_paper.pdf) | Caption | Video-Text |\n| **Webvid10M** | [Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.00650.pdf) | Caption | Video-Text |\n| **WavCaps** | [WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.17395.pdf) | Caption | Audio-Text |\n| **AISHELL-1** | [AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1709.05522.pdf) | ASR | Audio-Text |\n| **AISHELL-2** | [AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1808.10583.pdf) | ASR | Audio-Text |\n| **VSDial-CN** | [X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04160.pdf) | ASR | Image-Audio-Text |\n\n\n## Datasets of Multimodal Instruction Tuning\n| Name | Paper | Link | Notes |\n|:-----|:-----:|:----:|:-----:|\n| **Inst-IT Dataset** | [Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03565) | [Link](https:\u002F\u002Fgithub.com\u002Finst-it\u002Finst-it) | An instruction-tuning dataset which contains fine-grained multi-level annotations for 21k videos and 51k images |\n| **E.T. Instruct 164K** | [E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.18111) | [Link](https:\u002F\u002Fgithub.com\u002FPolyU-ChenLab\u002FETBench) | An instruction-tuning dataset for time-sensitive video understanding |\n| **MSQA** | [Multi-modal Situated Reasoning in 3D Scenes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.02389) | [Link](https:\u002F\u002Fmsr3d.github.io\u002F) | A large scale dataset for multi-modal situated reasoning in 3D scenes |\n| **MM-Evol** | [MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.05840) | [Link](https:\u002F\u002Fmmevol.github.io\u002Fhome_page.html) | An instruction dataset with rich diversity | \n| **UNK-VQA** | [UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.10942) | [Link](https:\u002F\u002Fgithub.com\u002Fguoyang9\u002FUNK-VQA) | A dataset designed to teach models to refrain from answering unanswerable questions |\n| **VEGA** | [VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.10228) | [Link](https:\u002F\u002Fgithub.com\u002Fzhourax\u002FVEGA) | A dataset for enhancing model capabilities in comprehension of interleaved information | \n| **ALLaVA-4V** | [ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11684.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FFreedomIntelligence\u002FALLaVA-4V) | Vision and language caption and instruction dataset generated by GPT4V | \n| **IDK** | [Visually Dehallucinative Instruction Generation: Know What You Don't Know](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.09717.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fncsoft\u002Fidk) | Dehallucinative visual instruction for \"I Know\" hallucination |\n| **CAP2QA** | [Visually Dehallucinative Instruction Generation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.08348.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fncsoft\u002Fcap2qa) | Image-aligned visual instruction dataset |\n| **M3DBench** | [M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10763.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FOpenM3D\u002FM3DBench) | A large-scale 3D instruction tuning dataset |\n| **ViP-LLaVA-Instruct** | [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00784.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmucai\u002FViP-LLaVA-Instruct) |  A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data |\n| **LVIS-Instruct4V** | [To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.07574.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FX2FD\u002FLVIS-Instruct4V) | A visual instruction dataset via self-instruction from GPT-4V |\n| **ComVint** | [What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.01487.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FComVint#comvint-data) | A synthetic instruction dataset for complex visual reasoning |\n| **SparklesDialogue** | [✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16463.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FHYPJUDY\u002FSparkles#sparklesdialogue) | A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns. |\n| **StableLLaVA** | [StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.10253v1.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Ficoz69\u002FStableLLAVA) | A cheap and effective approach to collect visual instruction tuning data |\n| **M-HalDetect** | [Detecting and Preventing Hallucinations in Large Vision Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.06394.pdf) | [Coming soon]() | A dataset used to train and benchmark models for hallucination detection and prevention | \n| **MGVLID** | [ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.09474.pdf) | - | A high-quality instruction-tuning dataset including image-text and region-text pairs |\n| **BuboGPT** | [BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.08581.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmagicr\u002FBuboGPT) | A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data |\n| **SVIT** | [SVIT: Scaling up Visual Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.04087.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FSVIT) | A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs |\n| **mPLUG-DocOwl** | [mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02499.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-DocOwl\u002Ftree\u002Fmain\u002FDocLLM) | An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding | \n| **PF-1M** | [Visual Instruction Tuning with Polite Flamingo](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.01003.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fchendelong\u002FPF-1M\u002Ftree\u002Fmain) | A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo. | \n| **ChartLlama** | [ChartLlama: A Multimodal LLM for Chart Understanding and Generation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16483.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flisten2you002\u002FChartLlama-Dataset) | A multi-modal instruction-tuning dataset for chart understanding and generation |\n| **LLaVAR** | [LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.17107.pdf) | [Link](https:\u002F\u002Fllavar.github.io\u002F#data) | A visual instruction-tuning dataset for Text-rich Image Understanding | \n| **MotionGPT** | [MotionGPT: Human Motion as a Foreign Language](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14795.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT) | A instruction-tuning dataset including multiple human motion-related tasks |\n| **LRV-Instruction** | [Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14565.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FFuxiaoLiu\u002FLRV-Instruction#visual-instruction-data-lrv-instruction) | Visual instruction tuning dataset for addressing hallucination issue | \n| **Macaw-LLM** | [Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.09093.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Flyuchenyang\u002FMacaw-LLM\u002Ftree\u002Fmain\u002Fdata) | A large-scale multi-modal instruction dataset in terms of multi-turn dialogue | \n| **LAMM-Dataset** | [LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.06687.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FOpenLAMM\u002FLAMM#lamm-dataset) | A comprehensive multi-modal instruction tuning dataset |\n| **Video-ChatGPT** | [Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05424.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT#video-instruction-dataset-open_file_folder) | 100K high-quality video instruction dataset | \n| **MIMIC-IT** | [MIMIC-IT: Multi-Modal In-Context Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05425.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter\u002Fblob\u002Fmain\u002Fmimic-it\u002FREADME.md) | Multimodal in-context instruction tuning |\n| **M\u003Csup>3\u003C\u002Fsup>IT** | [M\u003Csup>3\u003C\u002Fsup>IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.04387.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMMInstruction\u002FM3IT) | Large-scale, broad-coverage multimodal instruction tuning dataset | \n| **LLaVA-Med** | [LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.00890.pdf) | [Coming soon](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLaVA-Med#llava-med-dataset) | A large-scale, broad-coverage biomedical instruction-following dataset |\n| **GPT4Tools** | [GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.18752.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FStevenGrove\u002FGPT4Tools#dataset) | Tool-related instruction datasets |\n| **MULTIS** | [ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.16103.pdf) | [Coming soon](https:\u002F\u002Fiva-chatbridge.github.io\u002F) | Multimodal instruction tuning dataset covering 16 multimodal tasks |\n| **DetGPT** | [DetGPT: Detect What You Need via Reasoning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14167.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FOptimalScale\u002FDetGPT\u002Ftree\u002Fmain\u002Fdataset) |  Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs|\n| **PMC-VQA** | [PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.10415.pdf) | [Coming soon](https:\u002F\u002Fxiaoman-zhang.github.io\u002FPMC-VQA\u002F) | Large-scale medical visual question-answering dataset |\n| **VideoChat** | [VideoChat: Chat-Centric Video Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.06355.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVideo\u002Ftree\u002Fmain\u002FData\u002Finstruction_data) | Video-centric multimodal instruction dataset |\n| **X-LLM** | [X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04160.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fphellonchen\u002FX-LLM) | Chinese multimodal instruction dataset |\n| **LMEye** | [LMEye: An Interactive Perception Network for Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.03701.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYunxinLi\u002FMultimodal_Insturction_Data_V2) | A multi-modal instruction-tuning dataset |\n| **cc-sbu-align** | [MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.10592.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FVision-CAIR\u002Fcc_sbu_align) | Multimodal aligned dataset for improving model's usability and generation's fluency |\n| **LLaVA-Instruct-150K** | [Visual Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.08485.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fliuhaotian\u002FLLaVA-Instruct-150K) | Multimodal instruction-following data generated by GPT|\n| **MultiInstruct** | [MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10773.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FVT-NLP\u002FMultiInstruct) | The first multimodal instruction tuning benchmark dataset |\n\n## Datasets of In-Context Learning\n| Name | Paper | Link | Notes |\n|:-----|:-----:|:----:|:-----:|\n| **MIC** | [MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.07915.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBleachNick\u002FMIC_full) | A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs. |\n| **MIMIC-IT** | [MIMIC-IT: Multi-Modal In-Context Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05425.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter\u002Fblob\u002Fmain\u002Fmimic-it\u002FREADME.md) | Multimodal in-context instruction dataset|\n\n## Datasets of Multimodal Chain-of-Thought\n| Name | Paper | Link | Notes |\n|:-----|:-----:|:----:|:-----:|\n| **EMER** | [Explainable Multimodal Emotion Reasoning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15401.pdf) | [Coming soon](https:\u002F\u002Fgithub.com\u002FzeroQiaoba\u002FExplainable-Multimodal-Emotion-Reasoning) | A benchmark dataset for explainable emotion reasoning task |\n| **EgoCOT** | [EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.15021.pdf) | [Coming soon](https:\u002F\u002Fgithub.com\u002FEmbodiedGPT\u002FEmbodiedGPT_Pytorch) | Large-scale embodied planning dataset |\n| **VIP** | [Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13903.pdf) | [Coming soon]() | An inference-time dataset that can be used to evaluate VideoCOT |\n| **ScienceQA** | [Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2022\u002Ffile\u002F11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Flupantech\u002FScienceQA#ghost-download-the-dataset) | Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains | \n\n## Datasets of Multimodal RLHF\n| Name | Paper | Link | Notes |\n|:-----|:-----:|:----:|:-----:|\n| **VLFeedback** | [Silkie: Preference Distillation for Large Visual Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10665.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMMInstruction\u002FVLFeedback) | A vision-language feedback dataset annotated by AI |\n\n## Benchmarks for Evaluation\n| Name | Paper | Link | Notes |\n|:-----|:-----:|:----:|:-----:|\n| **Inst-IT Bench** | [Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03565) | [Link](https:\u002F\u002Fgithub.com\u002Finst-it\u002Finst-it) | A benchmark to evaluate fine-grained instance-level understanding in images and videos |\n| **M\u003Csup>3\u003C\u002Fsup>CoT** | [M\u003Csup>3\u003C\u002Fsup>CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.16473) | [Link](https:\u002F\u002Fgithub.com\u002FLightChen233\u002FM3CoT) | A multi-domain, multi-step benchmark for multimodal CoT |\n| **MMGenBench** | [MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.14062) | [Link](https:\u002F\u002Fgithub.com\u002Flerogo\u002FMMGenBench) | A benchmark that gauges the performance of constructing image-generation prompt given an image |\n| **MiCEval** | [MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.14668) | [Link](https:\u002F\u002Fgithub.com\u002Falenai97\u002FMiCEval) | A multimodal CoT benchmark to evaluate MLLMs' reasoning capabilities |\n| **LiveXiv** | [LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.10783) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FLiveXiv\u002FLiveXiv) | A live benchmark based on arXiv papers |\n| **TemporalBench** | [TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.10818) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmicrosoft\u002FTemporalBench) | A benchmark for evaluation of fine-grained temporal understanding |\n| **OmniBench** | [OmniBench: Towards The Future of Universal Omni-Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.15272) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fm-a-p\u002FOmniBench) | A benchmark that evaluates models' capabilities of processing visual, acoustic, and textual inputs simultaneously |\n| **MME-RealWorld** | [MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.13257) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyifanzhang114\u002FMME-RealWorld) | A challenging benchmark that involves real-life scenarios |\n| **VELOCITI** | [VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.10889) | [Link](https:\u002F\u002Fgithub.com\u002Fkatha-ai\u002FVELOCITI) | A video benhcmark that evaluates on perception and binding capabilities |\n| **MMR** | [Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.10638.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FMultimodal-Robustness-Benchmark) | A benchmark for measuring MLLMs' understanding capability and robustness to leading questions |\n| **CharXiv** | [CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.18521) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fprinceton-nlp\u002FCharXiv) | Chart understanding benchmark curated by human experts |\n| **Video-MME** | [Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.21075) | [Link](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME) | A comprehensive evaluation benchmark of Multi-modal LLMs in video analysis |\n| **VL-ICL Bench** | [VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.13164.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fys-zong\u002FVL-ICL) | A benchmark for M-ICL evaluation, covering a wide spectrum of tasks |\n| **TempCompass** | [TempCompass: Do Video LLMs Really Understand Videos?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.00476.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fllyx97\u002FTempCompass) | A benchmark to evaluate the temporal perception ability of Video LLMs |\n| **GVLQA** | [GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.02130) | [Link](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FYanbin99\u002Fgvlqa-datasets-65c705c9488606617e246bd3) | A benchmark for evaluation of graph reasoning capabilities |\n| **CoBSAT** | [Can MLLMs Perform Text-to-Image In-Context Learning?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01293.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyzeng58\u002FCoBSAT) | A benchmark for text-to-image ICL |\n| **VQAv2-IDK** | [Visually Dehallucinative Instruction Generation: Know What You Don't Know](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.09717.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fncsoft\u002Fidk) | A benchmark for assessing \"I Know\" visual hallucination |\n| **Math-Vision** | [Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.14804.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fmathvision-cuhk\u002FMathVision) | A diverse mathematical reasoning benchmark |\n| **SciMMIR** | [SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.13478) | [Link](https:\u002F\u002Fgithub.com\u002FWusiwei0410\u002FSciMMIR) | | A benchmark for multi-modal information retrieval in the science domain |\n| **CMMMU** | [CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.11944.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FCMMMU-Benchmark\u002FCMMMU) | A Chinese benchmark involving reasoning and knowledge across multiple disciplines |\n| **MMCBench** | [Benchmarking Large Multimodal Models against Common Corruptions](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.11943.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FMMCBench) | A benchmark for examining self-consistency under common corruptions |\n| **MMVP** | [Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.06209.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Ftsb0601\u002FMMVP) | A benchmark for assessing visual capabilities |\n| **TimeIT** | [TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02051.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FShuhuaiRen\u002FTimeIT) | A video instruction-tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks. |\n| **ViP-Bench** | [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00784.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmucai\u002FViP-Bench) | A benchmark for visual prompts |\n| **M3DBench** | [M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10763.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FOpenM3D\u002FM3DBench) | A 3D-centric benchmark |\n| **Video-Bench** | [Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16103.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-Bench) | A benchmark for video-MLLM evaluation |\n| **Charting-New-Territories** | [Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.14656.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fjonathan-roberts1\u002Fcharting-new-territories) | A benchmark for evaluating geographic and geospatial capabilities |\n| **MLLM-Bench** | [MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.13951.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FMLLM-Bench) | GPT-4V evaluation with per-sample criteria |\n| **BenchLMM** | [BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02896.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAIFEG\u002FBenchLMM) | A benchmark for assessment of the robustness against different image styles |\n| **MMC-Benchmark** | [MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.10774.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FFuxiaoLiu\u002FMMC) | A comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts |\n| **MVBench** | [MVBench: A Comprehensive Multi-modal Video Understanding Benchmark](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17005.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything\u002Fblob\u002Fmain\u002Fvideo_chat2\u002FMVBENCH.md) | A comprehensive multimodal benchmark for video understanding |\n| **Bingo** | [Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03287.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fgzcch\u002FBingo) | A benchmark for hallucination evaluation that focuses on two common types |\n| **MagnifierBench** | [OtterHD: A High-Resolution Multi-modality Model](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04219.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOtter-AI\u002FMagnifierBench) | A benchmark designed to probe models' ability of fine-grained perception |\n| **HallusionBench** | [HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.14566.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Ftianyi-lab\u002FHallusionBench) |An image-context reasoning benchmark for evaluation of hallucination |\n| **PCA-EVAL** | [Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.02071.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fpkunlp-icler\u002FPCA-EVAL) | A benchmark for evaluating multi-domain embodied decision-making. |\n| **MMHal-Bench** | [Aligning Large Multimodal Models with Factually Augmented RLHF](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.14525.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FShengcao1006\u002FMMHal-Bench) | A benchmark for hallucination evaluation |\n| **MathVista** | [MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.02255.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAI4Math\u002FMathVista) | A benchmark that challenges both visual and math reasoning capabilities |\n| **SparklesEval** | [✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16463.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FHYPJUDY\u002FSparkles#sparkleseval) | A GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria. |\n| **ISEKAI** | [Link-Context Learning for Multimodal LLMs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.07891.pdf) | [Link](https:\u002F\u002Fhuggingface.co\u002FISEKAI-Portal) | A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning |\n| **M-HalDetect** | [Detecting and Preventing Hallucinations in Large Vision Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.06394.pdf) | [Coming soon]() | A dataset used to train and benchmark models for hallucination detection and prevention | \n| **I4** | [Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.04152.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FDCDmllm\u002FCheetah) | A benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions | \n| **SciGraphQA** | [SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.03349.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Ffindalexli\u002FSciGraphQA#data) | A large-scale chart-visual question-answering dataset |\n| **MM-Vet**| [MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.02490.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMM-Vet) | An evaluation benchmark that examines large multimodal models on complicated multimodal tasks |\n| **SEED-Bench** | [SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.16125.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FSEED-Bench) | A benchmark for evaluation of generative comprehension in MLLMs | \n| **MMBench** | [MMBench: Is Your Multi-modal Model an All-around Player?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.06281.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FMMBench) | A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models|\n| **Lynx** | [What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02469.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Flynx-llm#prepare-data) |  A comprehensive evaluation benchmark including both image and video tasks |\n| **GAVIE** | [Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14565.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FFuxiaoLiu\u002FLRV-Instruction#evaluationgavie) | A benchmark to evaluate the hallucination and instruction following ability | \n| **MME** | [MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13394.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FEvaluation) | A comprehensive MLLM Evaluation benchmark |\n| **LVLM-eHub** | [LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.09265.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FMulti-Modality-Arena) | An evaluation platform for MLLMs |\n| **LAMM-Benchmark** | [LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.06687.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FOpenLAMM\u002FLAMM#lamm-benchmark) | A benchmark for evaluating  the quantitative performance of MLLMs on various2D\u002F3D vision tasks |\n| **M3Exam** | [M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05179.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FM3Exam) |  A multilingual, multimodal, multilevel benchmark for evaluating MLLM |\n| **OwlEval** | [mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.14178.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-Owl\u002Ftree\u002Fmain\u002FOwlEval) | Dataset for evaluation on multiple capabilities |\n\n## Others\n| Name | Paper | Link | Notes |\n|:-----|:-----:|:----:|:-----:|\n| **IMAD** | [IMAD: IMage-Augmented multi-modal Dialogue](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.10512.pdf) | [Link](https:\u002F\u002Fgithub.com\u002FVityaVitalich\u002FIMAD) | Multimodal dialogue dataset|\n| **Video-ChatGPT** | [Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05424.pdf) | [Link](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT#quantitative-evaluation-bar_chart) | A quantitative evaluation framework for video-based dialogue models |\n| **CLEVR-ATVC** | [Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05983.pdf) | [Link](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1TqBzkyqxOSg1hgCXF8JjpYIAuRV-uVft) | A synthetic multimodal fine-tuning dataset for learning to reject instructions |\n| **Fruit-ATVC** | [Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05983.pdf) | [Link](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1Saaia2rRRb1nz5sKdmpzYdS4jHiMDaP0) | A manually pictured multimodal fine-tuning dataset for learning to reject instructions |\n| **InfoSeek** | [Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11713.pdf) | [Link](https:\u002F\u002Fopen-vision-language.github.io\u002Finfoseek\u002F) | A VQA dataset that focuses on asking information-seeking questions |\n| **OVEN** | [Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11154.pdf) | [Link](https:\u002F\u002Fopen-vision-language.github.io\u002Foven\u002F) | A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild |\n","# 令人惊叹的多模态大语言模型\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBradyFU_Awesome-Multimodal-Large-Language-Models_readme_161dcbf89c15.jpg\" width=\"100%\" height=\"100%\">\n\u003C\u002Fp>\n\n## ✨ NJU-MiG 的亮点\n\n> 🔥🔥 **MLLM 综述**  |  **[💬 微信（MLLM微信交流群）](.\u002Fimages\u002Fwechat-group.png)**\n\n- 🌟 **MME-Survey：多模态 LLM 评估的全面综述**  \narXiv 2025，[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.15296.pdf)，[项目](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FBenchmarks) \n\n- 🌟 **统一多模态理解与生成的综述：进展与挑战**  \narXiv 2025，[论文](https:\u002F\u002Fwww.techrxiv.org\u002Fdoi\u002Fpdf\u002F10.36227\u002Ftechrxiv.176289261.16802577)，[项目](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FUnified) \n\n- **多模态大语言模型综述**  \nNSR 2024，[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13549.pdf)，[项目](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models)\n\n\n---\n\n\n> 🔥🔥 **VITA 系列全能 MLLM** | **[💬 微信（VITA微信交流群）](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA\u002Fblob\u002Fmain\u002Fasset\u002Fwechat-group.jpg)**\n\n- **VITA-1.5：迈向 GPT-4o 级别的实时视觉与语音交互**  \nNeurIPS 2025 亮点，[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.01957.pdf)，[项目](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA)\n\n- **VITA-E：自然具身交互——同时看见、听见、说话与行动**  \narXiv 2025，[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.21817.pdf)，[项目](https:\u002F\u002Flxysl.github.io\u002FVITA-E\u002F)\n\n- **VITA：迈向开源互动型全能多模态 LLM**  \narXiv 2024，[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.05211.pdf)，[项目](https:\u002F\u002Fvita-home.github.io\u002F)\n\n- **Long-VITA：在保持领先短上下文准确率的同时，将大型多模态模型扩展至 100 万 token**  \narXiv 2025，[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.05177.pdf)，[项目](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FLong-VITA)\n\n- **VITA-Audio：高效大型语音-语言模型的快速交错跨模态 token 生成**  \nNeurIPS 2025，[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.03739.pdf)，[项目](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA-Audio)\n\n\n---\n\n\n> 🔥🔥 **MME 系列 MLLM 基准测试**\n\n- 🔥 **Video-MME-v2：迈向视频理解评估的新阶段**\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBradyFU_Awesome-Multimodal-Large-Language-Models_readme_88797976d3c8.png\" width=\"100%\" height=\"100%\">\n\u003C\u002Fp>\n\n\u003Cfont size=7>\u003Cdiv align='center' > [[🍎 项目页面](https:\u002F\u002Fvideo-mme-v2.netlify.app\u002F)] [[📖 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2604.05015)] [[🤗 数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMME-Benchmarks\u002FVideo-MME-v2)] [[🏆 排行榜](https:\u002F\u002Fvideo-mme-v2.netlify.app\u002F#leaderboard)]  \u003C\u002Fdiv>\u003C\u002Ffont>\n\n- 🌟 **MME-Survey：多模态 LLM 评估的全面综述**  \narXiv 2025，[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.15296.pdf)，[项目](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FBenchmarks)\n\n- **MME：多模态大语言模型的综合评估基准**  \nNeurIPS 2025 DB 亮点，[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13394.pdf)，[数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flmms-lab\u002FMME)，[评估工具](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Fblob\u002FEvaluation\u002Ftools\u002Feval_tool.zip)，[✒️ 引用](.\u002Fimages\u002Fbib_mme.txt)\n\n- **Video-MME：首个针对多模态 LLM 在视频分析中的综合评估基准**  \nCVPR 2025，[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.21075.pdf)，[项目](https:\u002F\u002Fvideo-mme.github.io\u002F)，[数据集](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME?tab=readme-ov-file#-dataset)\n\n- **MME-RealWorld：你的多模态 LLM 能否应对连人类都难以处理的高分辨率真实场景？**  \nICLR 2025，[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.13257.pdf)，[项目](https:\u002F\u002Fmme-realworld.github.io\u002F)，[数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyifanzhang114\u002FMME-RealWorld)\n\n\n\n\n---\n\n\u003Cfont size=5>\u003Ccenter>\u003Cb> 目录 \u003C\u002Fb> \u003C\u002Fcenter>\u003C\u002Ffont>\n- [精彩论文](#awesome-papers)\n  - [多模态指令微调（及最新成果）](#multimodal-instruction-tuning--latest-works)\n  - [多模态幻觉](#multimodal-hallucination)\n  - [多模态上下文学习](#multimodal-in-context-learning)\n  - [多模态思维链](#multimodal-chain-of-thought)\n  - [LLM 辅助视觉推理](#llm-aided-visual-reasoning)\n  - [基础模型](#foundation-models)\n  - [评估](#evaluation)\n  - [多模态 RLHF](#multimodal-rlhf)\n  - [其他](#others)\n- [精彩数据集](#awesome-datasets)\n  - [对齐预训练数据集](#datasets-of-pre-training-for-alignment)\n  - [多模态指令微调数据集](#datasets-of-multimodal-instruction-tuning)\n  - [上下文学习数据集](#datasets-of-in-context-learning)\n  - [多模态思维链数据集](#datasets-of-multimodal-chain-of-thought)\n  - [多模态 RLHF 数据集](#datasets-of-multimodal-rlhf)\n  - [评估基准](#benchmarks-for-evaluation)\n  - [其他](#others-1)\n---\n\n# 精彩论文\n\n## Multimodal Instruction Tuning (& Latest Works)\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMME-Benchmarks\u002FVideo-MME-v2.svg?style=social&label=Star) \u003Cbr> [**Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2604.05015) \u003Cbr> | arXiv | 2026-04-06 | [Github](https:\u002F\u002Fgithub.com\u002FMME-Benchmarks\u002FVideo-MME-v2) | [Demo](https:\u002F\u002Fvideo-mme-v2.netlify.app\u002F) |\n| [**Introducing Muse Spark: Scaling Towards Personal Superintelligence**](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fintroducing-muse-spark-msl\u002F) | Blog | 2026-04-08 | - | [Demo](https:\u002F\u002Fmeta.ai\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FVITA-QinYu.svg?style=social&label=Star) \u003Cbr> [**VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing**](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA-QinYu) \u003Cbr> | arXiv | 2026-04-03 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA-QinYu) | Local Demo |\n| [**Gemma 4: Byte for byte, the most capable open models**](https:\u002F\u002Fdeepmind.google\u002Fmodels\u002Fgemma\u002Fgemma-4\u002F) | Blog | 2026-04-02 | - | [Demo](https:\u002F\u002Faistudio.google.com\u002Fprompts\u002Fnew_chat?model=gemma-4-31b-it&utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=) |\n| [**Qwen3.5-Omni: Scaling Up, Toward Native Omni-Modal AGI**](https:\u002F\u002Fqwen.ai\u002Fblog?id=qwen3.5-omni) | Blog | 2026-03-30 | - | [Demo](https:\u002F\u002Fchat.qwen.ai\u002F?spm=a2ty_o06.30285417.0.0.6d26c921GDrWrb) |\n| [**Xiaomi MiMo-V2-Omni**](https:\u002F\u002Fmimo.xiaomi.com\u002Fmimo-v2-omni) | Blog | 2026-03-18 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL-U.svg?style=social&label=Star) \u003Cbr> [**InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2603.09877) \u003Cbr> | arXiv | 2026-03-10 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL-U) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FOmni-Diffusion.svg?style=social&label=Star) \u003Cbr> [**Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2603.06577) \u003Cbr> | arXiv | 2026-03-06 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FOmni-Diffusion) | - |\n| [**Beyond Language Modeling: An Exploration of Multimodal Pretraining**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2603.03276) | arXiv | 2026-03-03 | - | - |\n| [**Gemini 3.1 Pro: A smarter model for your most complex tasks**](https:\u002F\u002Fblog.google\u002Finnovation-and-ai\u002Fmodels-and-research\u002Fgemini-models\u002Fgemini-3-1-pro\u002F) | Blog | 2026-02-19 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3.5.svg?style=social&label=Star) \u003Cbr> [**Qwen3.5: Towards Native Multimodal Agents**](https:\u002F\u002Fqwen.ai\u002Fblog?id=qwen3.5) \u003Cbr> | Blog | 2026-02-16 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3.5) | [Demo](https:\u002F\u002Fchat.qwen.ai\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenBMB\u002FMiniCPM-o.svg?style=social&label=Star) \u003Cbr> [**MiniCPM-o 4.5**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-o-4_5) \u003Cbr> | Blog | 2026-02-06 | [Github](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-o) | [Demo](https:\u002F\u002Fminicpm-omni.openbmb.cn\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-OCR-2.svg?style=social&label=Star) \u003Cbr> [**DeepSeek-OCR 2: Visual Causal Flow**](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-OCR-2\u002Fblob\u002Fmain\u002FDeepSeek_OCR2_paper.pdf) \u003Cbr> | DeepSeek | 2026-01-27 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-OCR-2) | - |\n| [**Seed1.8 Model Card: Towards Generalized Real-World Agency**](https:\u002F\u002Flf3-static.bytednsdoc.com\u002Fobj\u002Feden-cn\u002Flapzild-tss\u002FljhwZthlaukjlkulzlp\u002Fresearch\u002FSeed-1.8-Modelcard.pdf) | Bytedance Seed | 2025-12-18 | - | - |\n| [**Introducing GPT-5.2**](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-gpt-5-2\u002F) | OpenAI | 2025-12-11 | - | - |\n| [**Introducing Mistral 3**](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fmistral-3) | Blog | 2025-12-02 | [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fmistralai\u002Fmistral-large-3) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3-VL.svg?style=social&label=Star) \u003Cbr> [**Qwen3-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.21631) \u003Cbr> | arXiv | 2025-11-26 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen3-VL-Demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu3.5.svg?style=social&label=Star) \u003Cbr> [**Emu3.5: Native Multimodal Models are World Learners**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.26583) \u003Cbr> | arXiv | 2025-10-30 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu3.5) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent\u002FVITA.svg?style=social&label=Star) \u003Cbr> [**VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.21817.pdf) \u003Cbr> | arXiv | 2025-10-21 | [Github](https:\u002F\u002Fgithub.com\u002FTencent\u002FVITA\u002Ftree\u002FVITA-E) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-OCR.svg?style=social&label=Star) \u003Cbr> [**DeepSeek-OCR: Contexts Optical Compression**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.18234) \u003Cbr> | arXiv | 2025-10-21 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-OCR) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FOmniVinci.svg?style=social&label=Star) \u003Cbr> [**OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.15870) \u003Cbr> | arXiv | 2025-10-17 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FOmniVinci) | - |\n| [**NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.13721) | arXiv | 2025-10-16 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSenseTime-FVG\u002FInteractiveOmni.svg?style=social&label=Star) \u003Cbr> [**InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.13747) | arXiv | 2025-10-15 | [Github](https:\u002F\u002Fgithub.com\u002FSenseTime-FVG\u002FInteractiveOmni) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent\u002FVITA.svg?style=social&label=Star) \u003Cbr> [**VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.09607) \u003Cbr> | arXiv | 2025-10-10 | [Github](https:\u002F\u002Fgithub.com\u002FTencent\u002FVITA\u002Ftree\u002FVITA-VLA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEvolvingLMMs-Lab\u002FLLaVA-OneVision-1.5.svg?style=social&label=Star) \u003Cbr> [**LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.23661) \u003Cbr> | arXiv | 2025-10-09 | [Github](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002FLLaVA-OneVision-1.5) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flmms-lab\u002FLLaVA-OneVision-1.5) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3-Omni.svg?style=social&label=Star) \u003Cbr> [**Qwen3-Omni Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.17765) \u003Cbr> | arXiv | 2025-09-22 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-Omni) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen3-Omni-Demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.18265) \u003Cbr> | arXiv | 2025-08-27 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Fchat.intern-ai.org.cn\u002F) |\n| **MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and Video Understanding on Your Phone** | - | 2025-08-26 | [Github](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-o) | [Demo](http:\u002F\u002F101.126.42.235:30910\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfzhang114\u002FThyme.svg?style=social&label=Star) \u003Cbr> [**Thyme: Think Beyond Images**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.11630) \u003Cbr> | arXiv | 2025-08-18 | [Github](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme) | [Demo](https:\u002F\u002Fthyme-vl.github.io\u002F) |\n| [**Introducing GPT-5**](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-gpt-5\u002F) | OpenAI | 2025-08-07 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frednote-hilab\u002Fdots.vlm1.svg?style=social&label=Star) \u003Cbr> [**dots.vlm1**](https:\u002F\u002Fgithub.com\u002Frednote-hilab\u002Fdots.vlm1) \u003Cbr> | rednote-hilab | 2025-08-06 | [Github](https:\u002F\u002Fgithub.com\u002Frednote-hilab\u002Fdots.vlm1) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Frednote-hilab\u002Fdots-vlm1-demo) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FGLM-4.1V-Thinking.svg?style=social&label=Star) \u003Cbr> [**Step3: Cost-Effective Multimodal Intelligence**](https:\u002F\u002Fstepfun.ai\u002Fresearch\u002Fstep3) \u003Cbr> | StepFun | 2025-07-31 | [Github](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep3) | [Demo](https:\u002F\u002Fstepfun.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FGLM-4.1V-Thinking.svg?style=social&label=Star) \u003Cbr> [**GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.01006) \u003Cbr> | arXiv | 2025-07-02 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FGLM-4.1V-Thinking) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTHUDM\u002FGLM-4.1V-9B-Thinking-API-Demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlxtGH\u002FDenseWorld-1M.svg?style=social&label=Star) \u003Cbr> [**DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.24102) \u003Cbr> | arXiv | 2025-06-30 | [Github](https:\u002F\u002Fgithub.com\u002FlxtGH\u002FDenseWorld-1M) | - |\n| [**Qwen VLo: From \"Understanding\" the World to \"Depicting\" It**](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen-vlo\u002F) | Qwen | 2025-06-26 | - | [Demo](https:\u002F\u002Fchat.qwen.ai\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEvolvingLMMs-Lab\u002Fmultimodal-search-r1.svg?style=social&label=Star) \u003Cbr> [**MMSearch-R1: Incentivizing LMMs to Search**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.20670) \u003Cbr> | arXiv | 2025-06-25 | [Github](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Fmultimodal-search-r1) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FShow-o.svg?style=social&label=Star) \u003Cbr> [**Show-o2: Improved Native Unified Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.15564) \u003Cbr> | arXiv | 2025-06-18 | [Github](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FShow-o) | - |\n| [**Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities**](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_v2_5_report.pdf) | Google | 2025-06-17 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fegolife-ai\u002FEgo-R1.svg?style=social&label=Star) \u003Cbr> [**Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.13654) \u003Cbr> | arXiv | 2025-06-16 | [Github](https:\u002F\u002Fgithub.com\u002Fegolife-ai\u002FEgo-R1) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FXiaomiMiMo\u002FMiMo-VL.svg?style=social&label=Star) \u003Cbr> [**MiMo-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.03569) \u003Cbr> | arXiv | 2025-06-04 | [Github](https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo-VL) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwusize\u002FOpenUni.svg?style=social&label=Star) \u003Cbr> [**OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.23661) \u003Cbr> | arXiv | 2025-05-29 | [Github](https:\u002F\u002Fgithub.com\u002Fwusize\u002FOpenUni) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance-seed\u002FBAGEL.svg?style=social&label=Star) \u003Cbr> [**Emerging Properties in Unified Multimodal Pretraining**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.14683) \u003Cbr> | arXiv | 2025-05-23 | [Github](https:\u002F\u002Fgithub.com\u002Fbytedance-seed\u002FBAGEL) | [Demo](https:\u002F\u002Fdemo.bagel-ai.org\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FMMaDA.svg?style=social&label=Star) \u003Cbr> [**MMaDA: Multimodal Large Diffusion Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.15809) \u003Cbr> | arXiv | 2025-05-21 | [Github](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FMMaDA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FGen-Verse\u002FMMaDA) |\n| [**UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.14682) | arXiv | 2025-05-20 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJiuhaiChen\u002FBLIP3o.svg?style=social&label=Star) \u003Cbr> [**BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.09568) \u003Cbr> | arXiv | 2025-05-14 | [Github](https:\u002F\u002Fgithub.com\u002FJiuhaiChen\u002FBLIP3o) | Local Demo |\n| [**Seed1.5-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.07062) | arXiv | 2025-05-11 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHITsz-TMG\u002FAwesome-Large-Multimodal-Reasoning-Models.svg?style=social&label=Star) \u003Cbr> [**Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.04921) \u003Cbr> | arXiv | 2025-05-08 | [Github](https:\u002F\u002Fgithub.com\u002FHITsz-TMG\u002FAwesome-Large-Multimodal-Reasoning-Models) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FVITA-Audio.svg?style=social&label=Star) \u003Cbr> [**VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.03739) \u003Cbr> | arXiv | 2025-05-06 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA-Audio) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-R1V.svg?style=social&label=Star) \u003Cbr> [**Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.16656) \u003Cbr> | arXiv | 2025-04-23 | [Github](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-R1V) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FEAGLE.svg?style=social&label=Star) \u003Cbr> [**Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.15271) \u003Cbr> | arXiv | 2025-04-21 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FEAGLE) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fquicksviewer\u002Fquicksviewer.svg?style=social&label=Star) \u003Cbr> [**An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.15270) \u003Cbr> | arXiv | 2025-04-21 | [Github](https:\u002F\u002Fgithub.com\u002Fquicksviewer\u002Fquicksviewer) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10479) \u003Cbr> | arXiv | 2025-04-14 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Finternvl.opengvlab.com\u002F) |\n| [**Introducing GPT-4.1 in the API**](https:\u002F\u002Fopenai.com\u002Findex\u002Fgpt-4-1\u002F) | OpenAI | 2025-04-14 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMoonshotAI\u002FKimi-VL.svg?style=social&label=Star) \u003Cbr> [**Kimi-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.07491) \u003Cbr> | arXiv | 2025-04-10 | [Github](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmoonshotai\u002FKimi-VL-A3B-Thinking) |\n| [**The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation**](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fllama-4-multimodal-intelligence\u002F) | Meta | 2025-04-05 | [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fmeta-llama\u002Fllama-4-67f0c30d9fe03840bc9d0164) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2.5-Omni.svg?style=social&label=Star) \u003Cbr> [**Qwen2.5-Omni Technical Report**](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Omni\u002Fblob\u002Fmain\u002Fassets\u002FQwen2.5_Omni.pdf) \u003Cbr> | Qwen | 2025-03-26 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Omni) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen2.5-Omni-7B-Demo) |\n| [**Addendum to GPT-4o System Card: Native image generation**](https:\u002F\u002Fcdn.openai.com\u002F11998be9-5319-4302-bfbf-1167e093f1fb\u002FNative_Image_Generation_System_Card.pdf) | OpenAI | 2025-03-25 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FSparrow.svg?style=social&label=Star) \u003Cbr> [**Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.19951) \u003Cbr> | arXiv | 2025-03-17 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FSparrow) | - |\n| [**Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.01879) | arXiv | 2025-03-07 | - | - |\n| [**Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.01743) | arXiv | 2025-03-03 | [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-4-multimodal-instruct) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002Fphi-4-multimodal) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FLong-VITA.svg?style=social&label=Star) \u003Cbr> [**Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.05177) \u003Cbr> | arXiv | 2025-02-19 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FLong-VITA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2.5-VL.svg?style=social&label=Star) \u003Cbr> [**Qwen2.5-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.13923) \u003Cbr> | arXiv | 2025-02-19 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen2.5-VL) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaichuan-inc\u002FBaichuan-Omni-1.5.svg?style=social&label=Star) \u003Cbr> [**Baichuan-Omni-1.5 Technical Report**](https:\u002F\u002Fgithub.com\u002Fbaichuan-inc\u002FBaichuan-Omni-1.5\u002Fblob\u002Fmain\u002Fbaichuan_omni_1_5.pdf) \u003Cbr> | Tech Report | 2025-01-26 | [Github](https:\u002F\u002Fgithub.com\u002Fbaichuan-inc\u002FBaichuan-Omni-1.5) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FLlamaV-o1.svg?style=social&label=Star) \u003Cbr> [**LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.06186) \u003Cbr> | arXiv | 2025-01-10 | [Github](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FLlamaV-o1) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FVITA.svg?style=social&label=Star) \u003Cbr> [**VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.01957) \u003Cbr> | arXiv | 2025-01-03 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2-VL.svg?style=social&label=Star) \u003Cbr> [**QVQ: To See the World with Wisdom**](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqvq-72b-preview\u002F) \u003Cbr> | Qwen | 2024-12-25 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2-VL) | [Demo](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqvq-72b-preview\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-VL2.svg?style=social&label=Star) \u003Cbr> [**DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.10302) \u003Cbr> | arXiv | 2024-12-13 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-VL2) | - |\n| [**Apollo: An Exploration of Video Understanding in Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.10360) | arXiv | 2024-12-13 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.09596) \u003Cbr> | arXiv | 2024-12-12 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer\u002Ftree\u002Fmain\u002FInternLM-XComposer-2.5-OmniLive) | Local Demo |\n| [**StreamChat: Chatting with Streaming Video**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.08646) | arXiv | 2024-12-11 | Coming soon | - |\n| [**CompCap: Improving Multimodal Large Language Models with Composite Captions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.05243) | arXiv | 2024-12-06 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgls0425\u002FLinVT.svg?style=social&label=Star) \u003Cbr> [**LinVT: Empower Your Image-level Large Language Model to Understand Videos**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.05185) \u003Cbr> | arXiv | 2024-12-06 | [Github](https:\u002F\u002Fgithub.com\u002Fgls0425\u002FLinVT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.05271) \u003Cbr> | arXiv | 2024-12-06 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Finternvl.opengvlab.com) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FVILA.svg?style=social&label=Star) \u003Cbr> [**NVILA: Efficient Frontier Visual Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.04468) \u003Cbr> | arXiv | 2024-12-05 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVILA) | [Demo](https:\u002F\u002Fvila.mit.edu) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Finst-it\u002Finst-it.svg?style=social&label=Star) \u003Cbr> [**Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03565) \u003Cbr> | arXiv | 2024-12-04 | [Github](https:\u002F\u002Fgithub.com\u002Finst-it\u002Finst-it) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTimeMarker-LLM\u002FTimeMarker.svg?style=social&label=Star) \u003Cbr> [**TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.18211) \u003Cbr> | arXiv | 2024-11-27 | [Github](https:\u002F\u002Fgithub.com\u002FTimeMarker-LLM\u002FTimeMarker\u002F) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIDEA-Research\u002FChatRex.svg?style=social&label=Star) \u003Cbr> [**ChatRex: Taming Multimodal LLM for Joint Perception and Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.18363) \u003Cbr> | arXiv | 2024-11-27 | [Github](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FChatRex) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FLongVU.svg?style=social&label=Star) \u003Cbr> [**LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.17434) \u003Cbr> | arXiv | 2024-10-22 | [Github](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FLongVU) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVision-CAIR\u002FLongVU) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshikiw\u002FModality-Integration-Rate.svg?style=social&label=Star) \u003Cbr> [**Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.07167) \u003Cbr> | arXiv | 2024-10-09 | [Github](https:\u002F\u002Fgithub.com\u002Fshikiw\u002FModality-Integration-Rate) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frese1f\u002Faurora.svg?style=social&label=Star) \u003Cbr> [**AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.03051) \u003Cbr> | arXiv | 2024-10-04 | [Github](https:\u002F\u002Fgithub.com\u002Frese1f\u002Faurora) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Femova-ollm\u002FEMOVA.svg?style=social&label=Star) \u003Cbr> [**EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.18042) \u003Cbr> | CVPR | 2024-09-26 | [Github](https:\u002F\u002Fgithub.com\u002Femova-ollm\u002FEMOVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FEmova-ollm\u002FEMOVA-demo) | \n| [**Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.17146) | arXiv | 2024-09-25 | [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fallenai\u002FMolmoE-1B-0924) | [Demo](https:\u002F\u002Fmolmo.allenai.org) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2-VL.svg?style=social&label=Star) \u003Cbr> [**Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.12191) \u003Cbr> | arXiv | 2024-09-18 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen2-VL) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIDEA-FinAI\u002FChartMoE.svg?style=social&label=Star) \u003Cbr> [**ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.03277) \u003Cbr> | ICLR | 2024-09-05 | [Github](https:\u002F\u002Fgithub.com\u002FIDEA-FinAI\u002FChartMoE) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFreedomIntelligence\u002FLongLLaVA.svg?style=social&label=Star) \u003Cbr> [**LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.02889) \u003Cbr> | arXiv | 2024-09-04 | [Github](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FLongLLaVA) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FEagle.svg?style=social&label=Star) \u003Cbr> [**EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15998) \u003Cbr> | arXiv | 2024-08-28 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FEagle) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FNVEagle\u002FEagle-X5-13B-Chat) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshufangxun\u002FLLaVA-MoD.svg?style=social&label=Star) \u003Cbr> [**LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15881) \u003Cbr> | arXiv | 2024-08-28 | [Github](https:\u002F\u002Fgithub.com\u002Fshufangxun\u002FLLaVA-MoD) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-Owl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models**](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2408.04840) \u003Cbr> | arXiv | 2024-08-09 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-Owl) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FVITA.svg?style=social&label=Star) \u003Cbr> [**VITA: Towards Open-Source Interactive Omni Multimodal LLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.05211) \u003Cbr> | arXiv | 2024-08-09 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLaVA-VL\u002FLLaVA-NeXT.svg?style=social&label=Star) \u003Cbr> [**LLaVA-OneVision: Easy Visual Task Transfer**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.03326) \u003Cbr> | arXiv | 2024-08-06 | [Github](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-NeXT) | [Demo](https:\u002F\u002Fllava-onevision.lmms-lab.com) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenBMB\u002FMiniCPM-V.svg?style=social&label=Star) \u003Cbr> [**MiniCPM-V: A GPT-4V Level MLLM on Your Phone**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.01800) \u003Cbr> | arXiv | 2024-08-03 | [Github](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-V) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopenbmb\u002FMiniCPM-Llama3-V-2_5) |\n| [**VILA^2: VILA Augmented VILA**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.17453) | arXiv | 2024-07-24 | - | - |\n| [**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.15841) | arXiv | 2024-07-22 | - | - |\n| [**EVLM: An Efficient Vision-Language Model for Visual Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.14177) | arXiv | 2024-07-19 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjiyt17\u002FIDA-VLM.svg?style=social&label=Star) \u003Cbr> [**IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.07577) \u003Cbr> | arXiv | 2024-07-10 | [Github](https:\u002F\u002Fgithub.com\u002Fjiyt17\u002FIDA-VLM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.03320) \u003Cbr> | arXiv | 2024-07-03 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer) | [Demo](https:\u002F\u002Fopenxlab.org.cn\u002Fapps\u002Fdetail\u002FWillowBreeze\u002FInternLM-XComposer) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlxtGH\u002FOMG-Seg.svg?style=social&label=Star) \u003Cbr> [**OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.19389) \u003Cbr> | arXiv | 2024-06-27 | [Github](https:\u002F\u002Fgithub.com\u002FlxtGH\u002FOMG-Seg) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZZZHANG-jx\u002FDocKylin.svg?style=social&label=Star) \u003Cbr> [**DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.19101) \u003Cbr> | AAAI | 2024-06-27 | [Github](https:\u002F\u002Fgithub.com\u002FZZZHANG-jx\u002FDocKylin) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcambrian-mllm\u002Fcambrian.svg?style=social&label=Star) \u003Cbr> [**Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.16860) \u003Cbr> | arXiv | 2024-06-24 | [Github](https:\u002F\u002Fgithub.com\u002Fcambrian-mllm\u002Fcambrian) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEvolvingLMMs-Lab\u002FLongVA.svg?style=social&label=Star) \u003Cbr> [**Long Context Transfer from Language to Vision**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.16852) \u003Cbr> | arXiv | 2024-06-24 | [Github](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002FLongVA) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FSALMONN.svg?style=social&label=Star) \u003Cbr> [**video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.15704) \u003Cbr> | ICML | 2024-06-22 | [Github](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FSALMONN) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByungKwanLee\u002FTroL.svg?style=social&label=Star) \u003Cbr> [**TroL: Traversal of Layers for Large Language and Vision Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.12246) \u003Cbr> | EMNLP | 2024-06-18 | [Github](https:\u002F\u002Fgithub.com\u002FByungKwanLee\u002FTroL) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEVE.svg?style=social&label=Star) \u003Cbr> [**Unveiling Encoder-Free Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.11832) \u003Cbr> | arXiv | 2024-06-17 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEVE) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FVideoLLM-online.svg?style=social&label=Star) \u003Cbr> [**VideoLLM-online: Online Video Large Language Model for Streaming Video**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.11816) \u003Cbr> | CVPR | 2024-06-17 | [Github](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FVideoLLM-online) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwentaoyuan\u002FRoboPoint.svg?style=social&label=Star) \u003Cbr> [**RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.10721) \u003Cbr> | CoRL | 2024-06-15 | [Github](https:\u002F\u002Fgithub.com\u002Fwentaoyuan\u002FRoboPoint) | [Demo](https:\u002F\u002F007e03d34429a2517b.gradio.live\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwlin-at\u002FCaD-VI) \u003Cbr> [**Comparison Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09240) \u003Cbr> | arXiv | 2024-06-13 | [Github](https:\u002F\u002Fwlin-at.github.io\u002Fcad_vi) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfzhang114\u002FSliME.svg?style=social&label=Star) \u003Cbr> [**Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.08487) \u003Cbr> | arXiv | 2024-06-12 | [Github](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FSliME) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoLLaMA2.svg?style=social&label=Star) \u003Cbr> [**VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.07476) \u003Cbr> | arXiv | 2024-06-11 | [Github](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIDC-AI\u002FParrot.svg?style=social&label=Star) \u003Cbr> [**Parrot: Multilingual Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.02539) \u003Cbr> | arXiv | 2024-06-04 | [Github](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FParrot) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIDC-AI\u002FOvis.svg?style=social&label=Star) \u003Cbr> [**Ovis: Structural Embedding Alignment for Multimodal Large Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.20797) \u003Cbr> | arXiv | 2024-05-31 | [Github](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FOvis\u002F) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgordonhu608\u002FMQT-LLaVA.svg?style=social&label=Star) \u003Cbr> [**Matryoshka Query Transformer for Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.19315) \u003Cbr> | arXiv | 2024-05-29 | [Github](https:\u002F\u002Fgithub.com\u002Fgordonhu608\u002FMQT-LLaVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fgordonhu\u002FMQT-LLaVA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falibaba\u002Fconv-llava.svg?style=social&label=Star) \u003Cbr> [**ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.15738) \u003Cbr> | arXiv | 2024-05-24 | [Github](https:\u002F\u002Fgithub.com\u002Falibaba\u002Fconv-llava) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByungKwanLee\u002FMeteor.svg?style=social&label=Star) \u003Cbr> [**Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.15574) \u003Cbr> | arXiv | 2024-05-24 | [Github](https:\u002F\u002Fgithub.com\u002FByungKwanLee\u002FMeteor) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FBK-Lee\u002FMeteor) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYifanXu74\u002FLibra.svg?style=social&label=Star) \u003Cbr> [**Libra: Building Decoupled Vision System on Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.10140) \u003Cbr> | ICML | 2024-05-16 | [Github](https:\u002F\u002Fgithub.com\u002FYifanXu74\u002FLibra) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSHI-Labs\u002FCuMo.svg?style=social&label=Star) \u003Cbr> [**CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05949) \u003Cbr> | arXiv | 2024-05-09 | [Github](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FCuMo) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.16821) \u003Cbr> | arXiv | 2024-04-25 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Finternvl.opengvlab.com) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgraphic-design-ai\u002Fgraphist.svg?style=social&label=Star) \u003Cbr> [**Graphic Design with Large Multimodal Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.14368) \u003Cbr> | arXiv | 2024-04-22 | [Github](https:\u002F\u002Fgithub.com\u002Fgraphic-design-ai\u002Fgraphist) | - |\n| [**BRAVE: Broadening the visual encoding of vision-language models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.07204) | ECCV | 2024-04-10 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.06512.pdf) \u003Cbr> | arXiv | 2024-04-09 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FWillow123\u002FInternLM-XComposer) |\n| [**Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.05719.pdf) | arXiv | 2024-04-08 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fboheumd\u002FMA-LMM.svg?style=social&label=Star) \u003Cbr> [**MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.05726.pdf) \u003Cbr> | CVPR | 2024-04-08 | [Github](https:\u002F\u002Fgithub.com\u002Fboheumd\u002FMA-LMM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FVitron.svg?style=social&label=Star) \u003Cbr> [**VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing**](https:\u002F\u002Fhaofei.vip\u002Fdownloads\u002Fpapers\u002FSkywork_Vitron_2024.pdf) \u003Cbr> | NeurIPS | 2024-04-04 | [Github](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FVitron) | Local Demo |\n| [**TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3654674) | ACM TKDD | 2024-03-28 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FLITA.svg?style=social&label=Star) \u003Cbr> [**LITA: Language Instructed Temporal-Localization Assistant**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.19046) | arXiv | 2024-03-27 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FLITA) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FMiniGemini.svg?style=social&label=Star) \u003Cbr> [**Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.18814.pdf) \u003Cbr> | arXiv | 2024-03-27 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FMiniGemini) | [Demo](http:\u002F\u002F103.170.5.190:7860) |\n| [**MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.09611.pdf) | arXiv | 2024-03-14 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByungKwanLee\u002FMoAI.svg?style=social&label=Star) \u003Cbr> [**MoAI: Mixture of All Intelligence for Large Language and Vision Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.07508.pdf) \u003Cbr> | arXiv | 2024-03-12 | [Github](https:\u002F\u002Fgithub.com\u002FByungKwanLee\u002FMoAI) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-VL.svg?style=social&label=Star) \u003Cbr> [**DeepSeek-VL: Towards Real-World Vision-Language Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.05525) \u003Cbr> | arXiv | 2024-03-08 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdeepseek-ai\u002FDeepSeek-VL-7B) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuliang-Liu\u002FMonkey.svg?style=social&label=Star) \u003Cbr> [**TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.04473.pdf) \u003Cbr> | arXiv | 2024-03-07 | [Github](https:\u002F\u002Fgithub.com\u002FYuliang-Liu\u002FMonkey) | [Demo](http:\u002F\u002Fvlrlab-monkey.xyz:7684) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002Fall-seeing.svg?style=social&label=Star) \u003Cbr> [**The All-Seeing Project V2: Towards General Relation Comprehension of the Open World**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.19474.pdf) | arXiv | 2024-02-29 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002Fall-seeing) | - |\n| [**GROUNDHOG: Grounding Large Language Models to Holistic Segmentation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.16846.pdf) | CVPR | 2024-02-26 | Coming soon | Coming soon |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenMOSS\u002FAnyGPT.svg?style=social&label=Star) \u003Cbr> [**AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.12226.pdf) \u003Cbr> | arXiv | 2024-02-19 | [Github](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FAnyGPT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDCDmllm\u002FMomentor.svg?style=social&label=Star) \u003Cbr> [**Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11435.pdf) \u003Cbr> | arXiv | 2024-02-18 | [Github](https:\u002F\u002Fgithub.com\u002FDCDmllm\u002FMomentor) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFreedomIntelligence\u002FALLaVA.svg?style=social&label=Star) \u003Cbr> [**ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11684.pdf) \u003Cbr> | arXiv | 2024-02-18 | [Github](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FALLaVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002FFreedomIntelligence\u002FALLaVA-3B) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByungKwanLee\u002FCoLLaVO-Crayon-Large-Language-and-Vision-mOdel.svg?style=social&label=Star) \u003Cbr> [**CoLLaVO: Crayon Large Language and Vision mOdel**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11248.pdf) \u003Cbr> | arXiv | 2024-02-17 | [Github](https:\u002F\u002Fgithub.com\u002FByungKwanLee\u002FCoLLaVO-Crayon-Large-Language-and-Vision-mOdel) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTRI-ML\u002Fprismatic-vlms.svg?style=social&label=Star) \u003Cbr> [**Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.07865) \u003Cbr> | ICML | 2024-02-12 | [Github](https:\u002F\u002Fgithub.com\u002FTRI-ML\u002Fprismatic-vlms) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FCogCoM.svg?style=social&label=Star) \u003Cbr> [**CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.04236.pdf) \u003Cbr> | arXiv | 2024-02-06 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogCoM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMeituan-AutoML\u002FMobileVLM.svg?style=social&label=Star) \u003Cbr> [**MobileVLM V2: Faster and Stronger Baseline for Vision Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.03766.pdf) \u003Cbr> | arXiv | 2024-02-06 | [Github](https:\u002F\u002Fgithub.com\u002FMeituan-AutoML\u002FMobileVLM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWEIYanbin1999\u002FGITA.svg?style=social&label=Star) \u003Cbr> [**GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.02130) \u003Cbr> | NeurIPS | 2024-02-03 | [Github](https:\u002F\u002Fgithub.com\u002FWEIYanbin1999\u002FGITA\u002F) | - |\n| [**Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.17981.pdf) | arXiv | 2024-01-31 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaotian-liu\u002FLLaVA.svg?style=social&label=Star) \u003Cbr> [**LLaVA-NeXT: Improved reasoning, OCR, and world knowledge**](https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-01-30-llava-next\u002F) | Blog | 2024-01-30 | [Github](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) | [Demo](https:\u002F\u002Fllava.hliu.cc) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FMoE-LLaVA.svg?style=social&label=Star) \u003Cbr> [**MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.15947.pdf) \u003Cbr> | arXiv | 2024-01-29 | [Github](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FMoE-LLaVA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.16420.pdf) \u003Cbr> | arXiv | 2024-01-29 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer) | [Demo](https:\u002F\u002Fopenxlab.org.cn\u002Fapps\u002Fdetail\u002FWillowBreeze\u002FInternLM-XComposer) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F01-ai\u002FYi.svg?style=social&label=Star) \u003Cbr> [**Yi-VL**](https:\u002F\u002Fgithub.com\u002F01-ai\u002FYi\u002Ftree\u002Fmain\u002FVL) \u003Cbr> | - | 2024-01-23 | [Github](https:\u002F\u002Fgithub.com\u002F01-ai\u002FYi\u002Ftree\u002Fmain\u002FVL) | Local Demo |\n| [**SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.12168.pdf) | arXiv | 2024-01-22 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FChartAst.svg?style=social&label=Star) \u003Cbr> [**ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.02384) \u003Cbr> | ACL | 2024-01-04 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FChartAst) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMeituan-AutoML\u002FMobileVLM.svg?style=social&label=Star) \u003Cbr> [**MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.16886.pdf) \u003Cbr> | arXiv | 2023-12-28 | [Github](https:\u002F\u002Fgithub.com\u002FMeituan-AutoML\u002FMobileVLM) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14238.pdf) \u003Cbr> | CVPR | 2023-12-21 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Finternvl.opengvlab.com) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCircleRadon\u002FOsprey.svg?style=social&label=Star) \u003Cbr> [**Osprey: Pixel Understanding with Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10032.pdf) \u003Cbr> | CVPR | 2023-12-15 | [Github](https:\u002F\u002Fgithub.com\u002FCircleRadon\u002FOsprey) | [Demo](http:\u002F\u002F111.0.123.204:8000\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FCogVLM.svg?style=social&label=Star) \u003Cbr> [**CogAgent: A Visual Language Model for GUI Agents**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08914.pdf) \u003Cbr> | arXiv | 2023-12-14 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVLM) | [Coming soon]() |\n| [**Pixel Aligned Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.09237.pdf) | arXiv | 2023-12-14 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FVILA.svg?style=social&label=Star) \u003Cbr> [**VILA: On Pre-training for Visual Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07533) \u003Cbr> | CVPR | 2023-12-13 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVILA) | Local Demo |\n| [**See, Say, and Segment: Teaching LMMs to Overcome False Premises**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08366.pdf) | arXiv | 2023-12-13 | [Coming soon]() | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUcas-HaoranWei\u002FVary.svg?style=social&label=Star) \u003Cbr> [**Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06109.pdf) \u003Cbr> | ECCV | 2023-12-11 | [Github](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FVary) | [Demo](http:\u002F\u002Fregion-31.seetacloud.com:22701\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkakaobrain\u002Fhoneybee.svg?style=social&label=Star) \u003Cbr> [**Honeybee: Locality-enhanced Projector for Multimodal LLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06742.pdf) \u003Cbr> | CVPR | 2023-12-11 | [Github](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fhoneybee) | - |\n| [**Gemini: A Family of Highly Capable Multimodal Models**](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_1_report.pdf) | Google | 2023-12-06 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcsuhan\u002FOneLLM.svg?style=social&label=Star) \u003Cbr> [**OneLLM: One Framework to Align All Modalities with Language**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03700.pdf) \u003Cbr> | arXiv | 2023-12-06 | [Github](https:\u002F\u002Fgithub.com\u002Fcsuhan\u002FOneLLM) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fcsuhan\u002FOneLLM) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMeituan-AutoML\u002FLenna.svg?style=social&label=Star) \u003Cbr> [**Lenna: Language Enhanced Reasoning Detection Assistant**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02433.pdf) \u003Cbr> | arXiv | 2023-12-05 | [Github](https:\u002F\u002Fgithub.com\u002FMeituan-AutoML\u002FLenna) | - | \n| [**VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02310.pdf) | arXiv | 2023-12-04 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRenShuhuai-Andy\u002FTimeChat.svg?style=social&label=Star) \u003Cbr> [**TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02051.pdf) \u003Cbr> | arXiv | 2023-12-04 | [Github](https:\u002F\u002Fgithub.com\u002FRenShuhuai-Andy\u002FTimeChat) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmu-cai\u002Fvip-llava.svg?style=social&label=Star) \u003Cbr> [**Making Large Multimodal Models Understand Arbitrary Visual Prompts**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00784.pdf) \u003Cbr> | CVPR | 2023-12-01 | [Github](https:\u002F\u002Fgithub.com\u002Fmu-cai\u002Fvip-llava) | [Demo](https:\u002F\u002Fpages.cs.wisc.edu\u002F~mucai\u002Fvip-llava.html) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvlm-driver\u002FDolphins.svg?style=social&label=Star) \u003Cbr> [**Dolphins: Multimodal Language Model for Driving**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00438.pdf) \u003Cbr> | arXiv | 2023-12-01 | [Github](https:\u002F\u002Fgithub.com\u002Fvlm-driver\u002FDolphins) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen3DA\u002FLL3DA.svg?style=social&label=Star) \u003Cbr> [**LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18651.pdf) \u003Cbr> | arXiv | 2023-11-30 | [Github](https:\u002F\u002Fgithub.com\u002FOpen3DA\u002FLL3DA) | [Coming soon]() |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuangb23\u002FVTimeLLM.svg?style=social&label=Star) \u003Cbr> [**VTimeLLM: Empower LLM to Grasp Video Moments**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18445.pdf) \u003Cbr> | arXiv | 2023-11-30 | [Github](https:\u002F\u002Fgithub.com\u002Fhuangb23\u002FVTimeLLM\u002F) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-DocOwl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18248.pdf) \u003Cbr> | arXiv | 2023-11-30 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-DocOwl\u002Ftree\u002Fmain\u002FPaperOwl) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLaMA-VID.svg?style=social&label=Star) \u003Cbr> [**LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17043.pdf) \u003Cbr> | arXiv | 2023-11-28 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID) | [Coming soon]() |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLMGA.svg?style=social&label=Star) \u003Cbr> [**LLMGA: Multimodal Large Language Model based Generation Assistant**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16500.pdf) \u003Cbr> | arXiv | 2023-11-27 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLMGA) | [Demo](https:\u002F\u002Fbaa55ef8590b623f18.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftingxueronghua\u002FChartLlama-code.svg?style=social&label=Star) \u003Cbr> [**ChartLlama: A Multimodal LLM for Chart Understanding and Generation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16483.pdf) \u003Cbr> | arXiv | 2023-11-27 | [Github](https:\u002F\u002Fgithub.com\u002Ftingxueronghua\u002FChartLlama-code) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**ShareGPT4V: Improving Large Multi-Modal Models with Better Captions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.12793.pdf) \u003Cbr> | arXiv | 2023-11-21 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer\u002Ftree\u002Fmain\u002Fprojects\u002FShareGPT4V) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLin-Chen\u002FShareGPT4V-7B) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frshaojimmy\u002FJiuTian.svg?style=social&label=Star) \u003Cbr> [**LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.11860.pdf) \u003Cbr> | arXiv | 2023-11-20 | [Github](https:\u002F\u002Fgithub.com\u002Frshaojimmy\u002FJiuTian) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fembodied-generalist\u002Fembodied-generalist.svg?style=social&label=Star) \u003Cbr> [**An Embodied Generalist Agent in 3D World**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.12871.pdf) \u003Cbr> | arXiv | 2023-11-18 | [Github](https:\u002F\u002Fgithub.com\u002Fembodied-generalist\u002Fembodied-generalist) | [Demo](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=mlnjz4eSjB4) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FVideo-LLaVA.svg?style=social&label=Star) \u003Cbr> [**Video-LLaVA: Learning United Visual Representation by Alignment Before Projection**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.10122.pdf) \u003Cbr> | arXiv | 2023-11-16 | [Github](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FVideo-LLaVA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FChat-UniVi.svg?style=social&label=Star) \u003Cbr> [**Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.08046) \u003Cbr> | CVPR | 2023-11-14 | [Github](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FChat-UniVi) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX2FD\u002FLVIS-INSTRUCT4V.svg?style=social&label=Star) \u003Cbr> [**To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.07574.pdf) \u003Cbr> | arXiv | 2023-11-13 | [Github](https:\u002F\u002Fgithub.com\u002FX2FD\u002FLVIS-INSTRUCT4V) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlpha-VLLM\u002FLLaMA2-Accessory.svg?style=social&label=Star) \u003Cbr> [**SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.07575.pdf) \u003Cbr> | arXiv | 2023-11-13 | [Github](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLLaMA2-Accessory) | [Demo](http:\u002F\u002Fimagebind-llm.opengvlab.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuliang-Liu\u002FMonkey.svg?style=social&label=Star) \u003Cbr> [**Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.06607.pdf) \u003Cbr> | CVPR | 2023-11-11 | [Github](https:\u002F\u002Fgithub.com\u002FYuliang-Liu\u002FMonkey) | [Demo](http:\u002F\u002F27.17.184.224:7681\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLaVA-VL\u002FLLaVA-Plus-Codebase.svg?style=social&label=Star) \u003Cbr> [**LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.05437.pdf) \u003Cbr> | arXiv | 2023-11-09 | [Github](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Plus-Codebase) | [Demo](https:\u002F\u002Fllavaplus.ngrok.io\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNExT-ChatV\u002FNExT-Chat.svg?style=social&label=Star) \u003Cbr> [**NExT-Chat: An LMM for Chat, Detection and Segmentation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04498.pdf) \u003Cbr> | arXiv | 2023-11-08 | [Github](https:\u002F\u002Fgithub.com\u002FNExT-ChatV\u002FNExT-Chat) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-Owl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04257.pdf) \u003Cbr> | arXiv | 2023-11-07 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-Owl\u002Ftree\u002Fmain\u002FmPLUG-Owl2) | [Demo](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fdamo\u002FmPLUG-Owl2\u002Fsummary) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLuodian\u002FOtter.svg?style=social&label=Star) \u003Cbr> [**OtterHD: A High-Resolution Multi-modality Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04219.pdf) \u003Cbr> | arXiv | 2023-11-07 | [Github](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter) | - |\n| [**CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03354.pdf) | arXiv | 2023-11-06 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FgroundingLMM.svg?style=social&label=Star) \u003Cbr> [**GLaMM: Pixel Grounding Large Multimodal Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03356.pdf) \u003Cbr> | CVPR | 2023-11-06 | [Github](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FgroundingLMM) | [Demo](https:\u002F\u002Fglamm.mbzuai-oryx.ngrok.app\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FComVint.svg?style=social&label=Star) \u003Cbr> [**What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.01487.pdf) \u003Cbr> | arXiv | 2023-11-02| [Github](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FComVint) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FMiniGPT-4.svg?style=social&label=Star) \u003Cbr> [**MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.09478.pdf) \u003Cbr> | arXiv | 2023-10-14 | [Github](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FSALMONN.svg?style=social&label=Star) \u003Cbr> [**SALMONN: Towards Generic Hearing Abilities for Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.13289) \u003Cbr> | ICLR | 2023-10-20 | [Github](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FSALMONN) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fapple\u002Fml-ferret.svg?style=social&label=Star) \u003Cbr> [**Ferret: Refer and Ground Anything Anywhere at Any Granularity**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.07704.pdf) \u003Cbr> | arXiv | 2023-10-11 | [Github](https:\u002F\u002Fgithub.com\u002Fapple\u002Fml-ferret) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FCogVLM.svg?style=social&label=Star) \u003Cbr> [**CogVLM: Visual Expert For Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03079.pdf) \u003Cbr> | arXiv | 2023-10-09 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVLM) | [Demo](http:\u002F\u002F36.103.203.44:7861\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaotian-liu\u002FLLaVA.svg?style=social&label=Star) \u003Cbr> [**Improved Baselines with Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03744.pdf) \u003Cbr> | arXiv | 2023-10-05 | [Github](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) | [Demo](https:\u002F\u002Fllava.hliu.cc\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FLanguageBind.svg?style=social&label=Star) \u003Cbr> [**LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01852.pdf) \u003Cbr> | ICLR | 2023-10-03 | [Github](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FLanguageBind) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FLanguageBind) | \n![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSY-Xuan\u002FPink.svg?style=social&label=Star) \u003Cbr> [**Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00582.pdf) | arXiv | 2023-10-01 | [Github](https:\u002F\u002Fgithub.com\u002FSY-Xuan\u002FPink) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthunlp\u002FMuffin.svg?style=social&label=Star) \u003Cbr> [**Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00653.pdf) \u003Cbr> | arXiv | 2023-10-01 | [Github](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FMuffin) | Local Demo | \n| [**AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.16058.pdf) | arXiv | 2023-09-27 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.15112.pdf) \u003Cbr> | arXiv | 2023-09-26 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRunpeiDong\u002FDreamLLM.svg?style=social&label=Star) \u003Cbr> [**DreamLLM: Synergistic Multimodal Comprehension and Creation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.11499.pdf) \u003Cbr> | ICLR | 2023-09-20 | [Github](https:\u002F\u002Fgithub.com\u002FRunpeiDong\u002FDreamLLM) | [Coming soon]() |\n| [**An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.09958.pdf) | arXiv | 2023-09-18 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSihengLi99\u002FTextBind.svg?style=social&label=Star) \u003Cbr> [**TextBind: Multi-turn Interleaved Multimodal Instruction-following**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.08637.pdf) \u003Cbr> | arXiv | 2023-09-14 | [Github](https:\u002F\u002Fgithub.com\u002FSihengLi99\u002FTextBind) | [Demo](https:\u002F\u002Failabnlp.tencent.com\u002Fresearch_demos\u002Ftextbind\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNExT-GPT\u002FNExT-GPT.svg?style=social&label=Star) \u003Cbr> [**NExT-GPT: Any-to-Any Multimodal LLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.05519.pdf) \u003Cbr> | arXiv | 2023-09-11 | [Github](https:\u002F\u002Fgithub.com\u002FNExT-GPT\u002FNExT-GPT) | [Demo](https:\u002F\u002Ffc7a82a1c76b336b6f.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUCSC-VLAA\u002FSight-Beyond-Text.svg?style=social&label=Star) \u003Cbr> [**Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.07120.pdf) \u003Cbr> | arXiv | 2023-09-13 | [Github](https:\u002F\u002Fgithub.com\u002FUCSC-VLAA\u002FSight-Beyond-Text) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FLLaMA-Adapter.svg?style=social&label=Star) \u003Cbr> [**ImageBind-LLM: Multi-modality Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.03905.pdf) \u003Cbr> | arXiv | 2023-09-07 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FLLaMA-Adapter) | [Demo](http:\u002F\u002Fimagebind-llm.opengvlab.com\u002F) |\n| [**Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.02591.pdf) | arXiv | 2023-09-05 | - | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenRobotLab\u002FPointLLM.svg?style=social&label=Star) \u003Cbr> [**PointLLM: Empowering Large Language Models to Understand Point Clouds**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16911.pdf) \u003Cbr> | arXiv | 2023-08-31 | [Github](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FPointLLM) | [Demo](http:\u002F\u002F101.230.144.196\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHYPJUDY\u002FSparkles.svg?style=social&label=Star) \u003Cbr> [**✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16463.pdf) \u003Cbr> | arXiv | 2023-08-31 | [Github](https:\u002F\u002Fgithub.com\u002FHYPJUDY\u002FSparkles) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopendatalab\u002FMLLM-DataEngine.svg?style=social&label=Star) \u003Cbr> [**MLLM-DataEngine: An Iterative Refinement Approach for MLLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.13566.pdf) \u003Cbr> | arXiv | 2023-08-25 | [Github](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMLLM-DataEngine) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPVIT-official\u002FPVIT.svg?style=social&label=Star) \u003Cbr> [**Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.13437.pdf) \u003Cbr> | arXiv | 2023-08-25 | [Github](https:\u002F\u002Fgithub.com\u002FPVIT-official\u002FPVIT) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FPVIT\u002Fpvit) |  \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen-VL.svg?style=social&label=Star) \u003Cbr> [**Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.12966.pdf) \u003Cbr> | arXiv | 2023-08-24 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-VL) | [Demo](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fqwen\u002FQwen-VL-Chat-Demo\u002Fsummary) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenBMB\u002FVisCPM.svg?style=social&label=Star) \u003Cbr> [**Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.12038.pdf) \u003Cbr> | ICLR | 2023-08-23 | [Github](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisCPM) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopenbmb\u002Fviscpm-chat) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ficoz69\u002FStableLLAVA.svg?style=social&label=Star) \u003Cbr> [**StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.10253.pdf) \u003Cbr> | arXiv | 2023-08-20 | [Github](https:\u002F\u002Fgithub.com\u002Ficoz69\u002FStableLLAVA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlpc-ucsd\u002FBLIVA.svg?style=social&label=Star) \u003Cbr> [**BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.09936.pdf) \u003Cbr> | arXiv | 2023-08-19 | [Github](https:\u002F\u002Fgithub.com\u002Fmlpc-ucsd\u002FBLIVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmlpc-lab\u002FBLIVA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDCDmllm\u002FCheetah.svg?style=social&label=Star) \u003Cbr> [**Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.04152.pdf) \u003Cbr> | arXiv | 2023-08-08 | [Github](https:\u002F\u002Fgithub.com\u002FDCDmllm\u002FCheetah) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FAll-Seeing.svg?style=social&label=Star) \u003Cbr> [**The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.01907.pdf) \u003Cbr> | ICLR | 2023-08-03 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAll-Seeing) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenGVLab\u002Fall-seeing) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLISA.svg?style=social&label=Star) \u003Cbr> [**LISA: Reasoning Segmentation via Large Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.00692.pdf) \u003Cbr> | arXiv | 2023-08-01 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLISA) | [Demo](http:\u002F\u002F103.170.5.190:7860) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frese1f\u002FMovieChat.svg?style=social&label=Star) \u003Cbr> [**MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.16449.pdf) \u003Cbr> | arXiv | 2023-07-31 | [Github](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUMass-Foundation-Model\u002F3D-LLM.svg?style=social&label=Star) \u003Cbr> [**3D-LLM: Injecting the 3D World into Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.12981.pdf) \u003Cbr> | arXiv | 2023-07-24 | [Github](https:\u002F\u002Fgithub.com\u002FUMass-Foundation-Model\u002F3D-LLM) | - | \n| [**ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.09474.pdf) \u003Cbr> | arXiv | 2023-07-18 | - | [Demo](https:\u002F\u002Fchatspot.streamlit.app\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmagic-research\u002Fbubogpt.svg?style=social&label=Star) \u003Cbr> [**BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.08581.pdf) \u003Cbr> | arXiv | 2023-07-17 | [Github](https:\u002F\u002Fgithub.com\u002Fmagic-research\u002Fbubogpt) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmagicr\u002FBuboGPT) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBAAI-DCAI\u002FVisual-Instruction-Tuning.svg?style=social&label=Star) \u003Cbr> [**SVIT: Scaling up Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.04087.pdf) \u003Cbr> | arXiv | 2023-07-09 | [Github](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FVisual-Instruction-Tuning) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjshilong\u002FGPT4RoI.svg?style=social&label=Star) \u003Cbr> [**GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.03601.pdf) \u003Cbr> | arXiv | 2023-07-07 | [Github](https:\u002F\u002Fgithub.com\u002Fjshilong\u002FGPT4RoI) | [Demo](http:\u002F\u002F139.196.83.164:7000\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002Flynx-llm.svg?style=social&label=Star) \u003Cbr> [**What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02469.pdf) \u003Cbr> | arXiv | 2023-07-05 | [Github](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Flynx-llm)  | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-DocOwl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02499.pdf) \u003Cbr> | arXiv | 2023-07-04 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-DocOwl) | [Demo](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fdamo\u002FmPLUG-DocOwl\u002Fsummary) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenDelong1999\u002Fpolite_flamingo.svg?style=social&label=Star) \u003Cbr> [**Visual Instruction Tuning with Polite Flamingo**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.01003.pdf) \u003Cbr >| arXiv | 2023-07-03 | [Github](https:\u002F\u002Fgithub.com\u002FChenDelong1999\u002Fpolite_flamingo) | [Demo](http:\u002F\u002Fclever_flamingo.xiaoice.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSALT-NLP\u002FLLaVAR.svg?style=social&label=Star) \u003Cbr> [**LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.17107.pdf) \u003Cbr> | arXiv | 2023-06-29 | [Github](https:\u002F\u002Fgithub.com\u002FSALT-NLP\u002FLLaVAR) | [Demo](https:\u002F\u002Feba470c07c805702b8.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshikras\u002Fshikra.svg?style=social&label=Star) \u003Cbr> [**Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15195.pdf) \u003Cbr> | arXiv | 2023-06-27 | [Github](https:\u002F\u002Fgithub.com\u002Fshikras\u002Fshikra) | [Demo](http:\u002F\u002Fdemo.zhaozhang.net:7860\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenMotionLab\u002FMotionGPT.svg?style=social&label=Star) \u003Cbr> [**MotionGPT: Human Motion as a Foreign Language**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14795.pdf) \u003Cbr> | arXiv | 2023-06-26 | [Github](https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flyuchenyang\u002FMacaw-LLM.svg?style=social&label=Star) \u003Cbr> [**Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.09093.pdf) \u003Cbr> | arXiv | 2023-06-15 | [Github](https:\u002F\u002Fgithub.com\u002Flyuchenyang\u002FMacaw-LLM) | [Coming soon]() |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenLAMM\u002FLAMM.svg?style=social&label=Star) \u003Cbr> [**LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.06687.pdf) \u003Cbr> | arXiv | 2023-06-11 | [Github](https:\u002F\u002Fgithub.com\u002FOpenLAMM\u002FLAMM) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopenlamm\u002FLAMM) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FVideo-ChatGPT.svg?style=social&label=Star) \u003Cbr> [**Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05424.pdf) \u003Cbr> | arXiv | 2023-06-08 | [Github](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT) | [Demo](https:\u002F\u002Fwww.ival-mbzuai.com\u002Fvideo-chatgpt) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLuodian\u002FOtter.svg?style=social&label=Star) \u003Cbr> [**MIMIC-IT: Multi-Modal In-Context Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05425.pdf) \u003Cbr> | arXiv | 2023-06-08 | [Github](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter) | [Demo](https:\u002F\u002Fotter.cliangyu.com\u002F) |\n| [**M\u003Csup>3\u003C\u002Fsup>IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.04387.pdf) | arXiv | 2023-06-07 | - | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideo-LLaMA.svg?style=social&label=Star) \u003Cbr> [**Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.02858.pdf) \u003Cbr> | arXiv | 2023-06-05 | [Github](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideo-LLaMA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FDAMO-NLP-SG\u002FVideo-LLaMA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLLaVA-Med.svg?style=social&label=Star) \u003Cbr> [**LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.00890.pdf) \u003Cbr> | arXiv | 2023-06-01 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLaVA-Med) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FStevenGrove\u002FGPT4Tools.svg?style=social&label=Star) \u003Cbr> [**GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.18752.pdf) \u003Cbr> | arXiv | 2023-05-30 | [Github](https:\u002F\u002Fgithub.com\u002FStevenGrove\u002FGPT4Tools) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fstevengrove\u002FGPT4Tools) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyxuansu\u002FPandaGPT.svg?style=social&label=Star) \u003Cbr> [**PandaGPT: One Model To Instruction-Follow Them All**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.16355.pdf) \u003Cbr> | arXiv | 2023-05-25 | [Github](https:\u002F\u002Fgithub.com\u002Fyxuansu\u002FPandaGPT) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FGMFTBY\u002FPandaGPT) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjoez17\u002FChatBridge.svg?style=social&label=Star) \u003Cbr> [**ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.16103.pdf) \u003Cbr> | arXiv | 2023-05-25 | [Github](https:\u002F\u002Fgithub.com\u002Fjoez17\u002FChatBridge) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fluogen1996\u002FLaVIN.svg?style=social&label=Star) \u003Cbr> [**Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.15023.pdf) \u003Cbr> | arXiv | 2023-05-24 | [Github](https:\u002F\u002Fgithub.com\u002Fluogen1996\u002FLaVIN) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOptimalScale\u002FDetGPT.svg?style=social&label=Star) \u003Cbr> [**DetGPT: Detect What You Need via Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14167.pdf) \u003Cbr> | arXiv | 2023-05-23 | [Github](https:\u002F\u002Fgithub.com\u002FOptimalScale\u002FDetGPT) | [Demo](https:\u002F\u002Fd3c431c0c77b1d9010.gradio.live\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FPengi.svg?style=social&label=Star) \u003Cbr> [**Pengi: An Audio Language Model for Audio Tasks**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.11834.pdf) \u003Cbr> | NeurIPS | 2023-05-19 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPengi) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FVisionLLM.svg?style=social&label=Star) \u003Cbr> [**VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.11175.pdf) \u003Cbr> | arXiv | 2023-05-18 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FVisionLLM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuanGongND\u002Fltu.svg?style=social&label=Star) \u003Cbr> [**Listen, Think, and Understand**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.10790.pdf) \u003Cbr> | arXiv | 2023-05-18 | [Github](https:\u002F\u002Fgithub.com\u002FYuanGongND\u002Fltu) | [Demo](https:\u002F\u002Fgithub.com\u002FYuanGongND\u002Fltu) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FVisualGLM-6B.svg?style=social&label=Star) \u003Cbr> **VisualGLM-6B** \u003Cbr> | - | 2023-05-17 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FVisualGLM-6B) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxiaoman-zhang\u002FPMC-VQA.svg?style=social&label=Star) \u003Cbr> [**PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.10415.pdf) \u003Cbr> | arXiv | 2023-05-17 | [Github](https:\u002F\u002Fgithub.com\u002Fxiaoman-zhang\u002FPMC-VQA) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsalesforce\u002FLAVIS.svg?style=social&label=Star) \u003Cbr> [**InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.06500.pdf) \u003Cbr> | arXiv | 2023-05-11 | [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Finstructblip) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FAsk-Anything.svg?style=social&label=Star) \u003Cbr> [**VideoChat: Chat-Centric Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.06355.pdf) \u003Cbr> | arXiv | 2023-05-10 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything) | [Demo](https:\u002F\u002Fask.opengvlab.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopen-mmlab\u002FMultimodal-GPT.svg?style=social&label=Star) \u003Cbr> [**MultiModal-GPT: A Vision and Language Model for Dialogue with Humans**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04790.pdf) \u003Cbr> | arXiv | 2023-05-08 | [Github](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002FMultimodal-GPT) | [Demo](https:\u002F\u002Fmmgpt.openmmlab.org.cn\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fphellonchen\u002FX-LLM.svg?style=social&label=Star) \u003Cbr> [**X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04160.pdf) \u003Cbr> | arXiv | 2023-05-07 | [Github](https:\u002F\u002Fgithub.com\u002Fphellonchen\u002FX-LLM) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYunxinLi\u002FLingCloud.svg?style=social&label=Star) \u003Cbr> [**LMEye: An Interactive Perception Network for Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.03701.pdf) \u003Cbr> | arXiv | 2023-05-05 | [Github](https:\u002F\u002Fgithub.com\u002FYunxinLi\u002FLingCloud) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FLLaMA-Adapter.svg?style=social&label=Star) \u003Cbr> [**LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.15010.pdf) \u003Cbr> | arXiv | 2023-04-28 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FLLaMA-Adapter) | [Demo](http:\u002F\u002Fllama-adapter.opengvlab.com\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-Owl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.14178.pdf) \u003Cbr> | arXiv | 2023-04-27 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-Owl) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FMAGAer13\u002FmPLUG-Owl) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FMiniGPT-4.svg?style=social&label=Star) \u003Cbr> [**MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.10592.pdf) \u003Cbr> | arXiv | 2023-04-20 | [Github](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaotian-liu\u002FLLaVA.svg?style=social&label=Star) \u003Cbr> [**Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.08485.pdf) \u003Cbr> | NeurIPS | 2023-04-17 | [GitHub](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) | [Demo](https:\u002F\u002Fllava.hliu.cc\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FLLaMA-Adapter.svg?style=social&label=Star) \u003Cbr> [**LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.16199.pdf) \u003Cbr> | ICLR | 2023-03-28 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FLLaMA-Adapter) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fcsuhan\u002FLLaMA-Adapter) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVT-NLP\u002FMultiInstruct.svg?style=social&label=Star) \u003Cbr> [**MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10773.pdf) \u003Cbr> | ACL | 2022-12-21 | [Github](https:\u002F\u002Fgithub.com\u002FVT-NLP\u002FMultiInstruct) | - |\n\n## 多模态幻觉\n|  标题  |   场所  |   日期   |   代码   |   演示   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F1zhou-Wang\u002FMemVR.svg?style=social&label=Star) \u003Cbr> [**回答前再看两眼：用于缓解多模态大语言模型中幻觉的记忆空间视觉回溯**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.03577) \u003Cbr> | arXiv | 2024-10-04 | [Github](https:\u002F\u002Fgithub.com\u002F1zhou-Wang\u002FMemVR) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnickjiang2378\u002Fvl-interp.svg?style=social&label=Star) \u003Cbr> [**解释与编辑视觉-语言表示以缓解幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.02762) \u003Cbr> | arXiv | 2024-10-03 | [Github](https:\u002F\u002Fgithub.com\u002Fnickjiang2378\u002Fvl-interp\u002F) | - |\n| [**FIHA：基于戴维森场景图的视觉-语言模型自主幻觉评估**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.13612) | arXiv | 2024-09-20 | [链接](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002FFIHA-45BB) | - |\n| [**通过主动检索增强缓解大型视觉-语言模型中的幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.00555) | arXiv | 2024-08-01 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLALBJ\u002FPAI.svg?style=social&label=Star) \u003Cbr> [**更加关注图像：一种无需训练即可缓解LVLMs中幻觉的方法**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.21771) \u003Cbr> | ECCV | 2024-07-31 | [Github](https:\u002F\u002Fgithub.com\u002FLALBJ\u002FPAI) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmrwu-mac\u002FR-Bench.svg?style=social&label=Star) \u003Cbr> [**评估和分析LVLMs中的关系幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.16449) \u003Cbr> | ICML | 2024-06-24 | [Github](https:\u002F\u002Fgithub.com\u002Fmrwu-mac\u002FR-Bench) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLackel\u002FAGLA.svg?style=social&label=Star) \u003Cbr> [**AGLA：利用全局与局部注意力的组合来缓解大型视觉-语言模型中的对象幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.12718) \u003Cbr> | arXiv | 2024-06-18 | [Github](https:\u002F\u002Fgithub.com\u002FLackel\u002FAGLA) | - |\n| [**CODE：对比自动生成的描述以对抗大型多模态模型中的幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.01920) | arXiv | 2024-06-04 | [即将推出]() | - |\n| [**通过数据增强的对比微调缓解对象幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.18654) | arXiv | 2024-05-28 | [即将推出]() | - |\n| [**VDGD：通过弥合视觉感知差距来缓解认知提示中的LVLM幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.15683) | arXiv | 2024-05-24 | [即将推出]() | - |\n| [**通过细粒度的AI反馈检测并缓解大型视觉语言模型中的幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.14233.pdf) | arXiv | 2024-04-22 | - | - |\n| [**使用指令对比解码缓解大型视觉-语言模型中的幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.18715.pdf) | arXiv | 2024-03-27 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIVY-LVLM\u002FCounterfactual-Inception.svg?style=social&label=Star) \u003Cbr> [**如果……呢？：反事实启发式方法以缓解大型多模态模型中的幻觉效应**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.13513.pdf) \u003Cbr> | arXiv | 2024-03-20 | [Github](https:\u002F\u002Fgithub.com\u002FIVY-LVLM\u002FCounterfactual-Inception) | - |\n| [**通过自举偏好优化强化多模态大语言模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.08730.pdf) | arXiv | 2024-03-13 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfzhang114\u002FLLaVA-Align.svg?style=social&label=Star) \u003Cbr> [**去偏见多模态大语言模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.05262) \u003Cbr> | arXiv | 2024-03-08 | [Github](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FLLaVA-Align) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBillChan226\u002FHALC.svg?style=social&label=Star) \u003Cbr> [**HALC：通过适应性焦点-对比解码减少对象幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.00425.pdf) \u003Cbr> | arXiv | 2024-03-01 | [Github](https:\u002F\u002Fgithub.com\u002FBillChan226\u002FHALC) | - |\n| [**IBD：通过图像偏向解码缓解大型视觉-语言模型中的幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.18476.pdf) | arXiv | 2024-02-28 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyuezih\u002Fless-is-more.svg?style=social&label=Star) \u003Cbr> [**少即是多：从EOS决策角度缓解多模态幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.14545.pdf) \u003Cbr> | arXiv | 2024-02-22 | [Github](https:\u002F\u002Fgithub.com\u002Fyuezih\u002Fless-is-more) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHyperwjf\u002FLogicCheckGPT.svg?style=social&label=Star) \u003Cbr> [**逻辑闭环：揭示大型视觉-语言模型中的对象幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11622.pdf) \u003Cbr> | arXiv | 2024-02-18 | [Github](https:\u002F\u002Fgithub.com\u002FHyperwjf\u002FLogicCheckGPT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMasaiahHan\u002FCorrelationQA.svg?style=social&label=Star) \u003Cbr> [**本能偏差：虚假图像导致MLLMs中的幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.03757.pdf) \u003Cbr> | arXiv | 2024-02-06 | [Github](https:\u002F\u002Fgithub.com\u002FMasaiahHan\u002FCorrelationQA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenKG-ORG\u002FEasyDetect.svg?style=social&label=Star) \u003Cbr> [**多模态大语言模型统一幻觉检测**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.03190.pdf) \u003Cbr> | arXiv | 2024-02-05 | [Github](https:\u002F\u002Fgithub.com\u002FOpenKG-ORG\u002FEasyDetect) | - |\n| [**大型视觉-语言模型中幻觉的综述**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.00253.pdf) | arXiv | 2024-02-01 | - | - |\n| [**时间洞察力提升：缓解多模态大语言模型中的时间幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.09861.pdf) | arXiv | 2024-01-18 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-HalOwl.svg?style=social&label=Star) \u003Cbr> [**面向多模态大语言模型的幻觉增强对比学习**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06968.pdf) \u003Cbr> | arXiv | 2023-12-12 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-HalOwl\u002Ftree\u002Fmain\u002Fhacl) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fassafbk\u002Fmocha_code.svg?style=social&label=Star) \u003Cbr> [**MOCHa：多目标强化学习缓解字幕幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03631.pdf) \u003Cbr> | arXiv | 2023-12-06 | [Github](https:\u002F\u002Fgithub.com\u002Fassafbk\u002Fmocha_code) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAnonymousanoy\u002FFOHE.svg?style=social&label=Star) \u003Cbr> [**通过字幕重写微调大型视觉-语言模型以缓解细粒度幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.01701.pdf) \u003Cbr> | arXiv | 2023-12-04 | [Github](https:\u002F\u002Fgithub.com\u002FAnonymousanoy\u002FFOHE) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRLHF-V\u002FRLHF-V.svg?style=social&label=Star) \u003Cbr> [**RLHF-V：通过来自细粒度纠正性人类反馈的行为对齐，迈向可信的MLLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00849.pdf) \u003Cbr> | arXiv | 2023-12-01 | [Github](https:\u002F\u002Fgithub.com\u002FRLHF-V\u002FRLHF-V) | [演示](http:\u002F\u002F120.92.209.146:8081\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshikiw\u002FOPERA.svg?style=social&label=Star) \u003Cbr> [**OPERA：通过过度信任惩罚和回顾分配缓解多模态大语言模型中的幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17911.pdf) \u003Cbr> | CVPR | 2023-11-29 | [Github](https:\u002F\u002Fgithub.com\u002Fshikiw\u002FOPERA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVCD.svg?style=social&label=Star) \u003Cbr> [**通过视觉对比解码缓解大型视觉-语言模型中的对象幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16922.pdf) \u003Cbr> | CVPR | 2023-11-28 | [Github](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVCD) | - |\n| [**超越幻觉：通过幻觉感知直接偏好优化提升LVLMs性能**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16839.pdf) | arXiv | 2023-11-28 | [Github](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FHA-DPO) | [即将推出]() |\n| [**借助视觉监督缓解视觉语言模型中的幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16479.pdf) | arXiv | 2023-11-27 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuqifan1117\u002FHalluciDoctor.svg?style=social&label=Star) \u003Cbr> [**HalluciDoctor：缓解视觉指令数据中的幻觉毒性**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.13614.pdf) \u003Cbr> | arXiv | 2023-11-22 | [Github](https:\u002F\u002Fgithub.com\u002FYuqifan1117\u002FHalluciDoctor) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjunyangwang0410\u002FAMBER.svg?style=social&label=Star) \u003Cbr> [**无LLM的多维度基准测试，用于MLLMs幻觉评估**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.07397.pdf) \u003Cbr> | arXiv | 2023-11-13 | [Github](https:\u002F\u002Fgithub.com\u002Fjunyangwang0410\u002FAMBER) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbcdnlp\u002FFAITHSCORE.svg?style=social&label=Star) \u003Cbr> [**FAITHSCORE：评估大型视觉-语言模型中的幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.01477.pdf) \u003Cbr> | arXiv | 2023-11-02 | [Github](https:\u002F\u002Fgithub.com\u002Fbcdnlp\u002FFAITHSCORE) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FWoodpecker.svg?style=social&label=Star) \u003Cbr> [**啄木鸟：多模态大语言模型的幻觉修正**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16045.pdf) \u003Cbr> | arXiv | 2023-10-24 | [Github](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FWoodpecker) | [演示](https:\u002F\u002Fdeb6a97bae6fab67ae.gradio.live\u002F) |\n| [**负对象存在评估（NOPE）用于测量视觉-语言模型中的对象幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.05338.pdf) | arXiv | 2023-10-09 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbronyayang\u002FHallE_Switch.svg?style=social&label=Star) \u003Cbr> [**HallE-Switch：重新思考并控制大型视觉语言模型中为详细字幕而产生的对象存在幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01779.pdf) \u003Cbr> | arXiv | 2023-10-03 | [Github](https:\u002F\u002Fgithub.com\u002Fbronyayang\u002FHallE_Switch) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYiyangZhou\u002FLURE.svg?style=social&label=Star) \u003Cbr> [**分析并缓解大型视觉-语言模型中的对象幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00754.pdf) \u003Cbr> | ICLR | 2023-10-01 | [Github](https:\u002F\u002Fgithub.com\u002FYiyangZhou\u002FLURE) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fllava-rlhf\u002FLLaVA-RLHF.svg?style=social&label=Star) \u003Cbr> [**用事实增强的RLHF对齐大型多模态模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.14525.pdf) \u003Cbr> | arXiv | 2023-09-25 | [Github](https:\u002F\u002Fgithub.com\u002Fllava-rlhf\u002FLLaVA-RLHF) | [演示](http:\u002F\u002Fpitt.lti.cs.cmu.edu:7890\u002F) |\n| [**多模态大语言模型中失认症的评估与缓解**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.04041.pdf) | arXiv | 2023-09-07 | - | - |\n| [**CIEM：更好的指令微调的对比指令评估方法**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.02301.pdf) | arXiv | 2023-09-05 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjunyangwang0410\u002FHaELM.svg?style=social&label=Star) \u003Cbr> [**大型视觉-语言模型中幻觉的评估与分析**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.15126.pdf) \u003Cbr> | arXiv | 2023-08-29 | [Github](https:\u002F\u002Fgithub.com\u002Fjunyangwang0410\u002FHaELM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopendatalab\u002FVIGC.svg?style=social&label=Star) \u003Cbr> [**VIGC：视觉指令生成与修正**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.12714.pdf) \u003Cbr> | arXiv | 2023-08-24 | [Github](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FVIGC) | [演示](https:\u002F\u002Fopendatalab.github.io\u002FVIGC) |\n| [**检测并预防大型视觉语言模型中的幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.06394.pdf) | arXiv | 2023-08-11 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFuxiaoLiu\u002FLRV-Instruction.svg?style=social&label=Star) \u003Cbr> [**通过稳健的指令微调缓解大型多模态模型中的幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14565.pdf) \u003Cbr> | ICLR | 2023-06-26 | [Github](https:\u002F\u002Fgithub.com\u002FFuxiaoLiu\u002FLRV-Instruction) | [演示](https:\u002F\u002F7b6590ed039a06475d.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FPOPE.svg?style=social&label=Star) \u003Cbr> [**评估大型视觉-语言模型中的对象幻觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.10355.pdf) \u003Cbr> | EMNLP | 2023-05-17 | [Github](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FPOPE) | - |\n\n## 多模态上下文学习\n|  标题  |   会议\u002F平台   |   日期   |   代码   |   演示   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| [**大型视觉-语言模型的视觉上下文学习**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11574.pdf) | arXiv | 2024-02-18 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuanJianhao508\u002FRAG-Driver.svg?style=social&label=Star) \u003Cbr> [**RAG-Driver：基于检索增强型多模态大语言模型上下文学习的可泛化驾驶解释**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.10828) \u003Cbr> | RSS | 2024-02-16 | [Github](https:\u002F\u002Fgithub.com\u002FYuanJianhao508\u002FRAG-Driver) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUW-Madison-Lee-Lab\u002FCoBSAT.svg?style=social&label=Star) \u003Cbr> [**多模态大语言模型能否进行文本到图像的上下文学习？**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01293.pdf) \u003Cbr> | arXiv | 2024-02-02 | [Github](https:\u002F\u002Fgithub.com\u002FUW-Madison-Lee-Lab\u002FCoBSAT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu.svg?style=social&label=Star) \u003Cbr> [**生成式多模态模型是上下文学习者**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.13286) \u003Cbr> | CVPR | 2023-12-20 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu\u002Ftree\u002Fmain\u002FEmu2) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FBAAI\u002FEmu2) |\n| [**劫持大型多模态模型中的上下文**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07553.pdf) | arXiv | 2023-12-07 | - | - |\n| [**迈向更加统一的视觉上下文理解**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02520.pdf) | arXiv | 2023-12-05 | - | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHaozheZhao\u002FMIC.svg?style=social&label=Star) \u003Cbr> [**MMICL：通过多模态上下文学习赋能视觉-语言模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.07915.pdf) \u003Cbr> | arXiv | 2023-09-14 | [Github](https:\u002F\u002Fgithub.com\u002FHaozheZhao\u002FMIC) | [Demo](https:\u002F\u002F8904cdd23621858859.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fisekai-portal\u002FLink-Context-Learning.svg?style=social&label=Star) \u003Cbr> [**面向多模态LLM的链接上下文学习**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.07891.pdf) \u003Cbr> | arXiv | 2023-08-15 | [Github](https:\u002F\u002Fgithub.com\u002Fisekai-portal\u002FLink-Context-Learning) | [Demo](http:\u002F\u002F117.144.81.99:20488\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlfoundations\u002Fopen_flamingo.svg?style=social&label=Star) \u003Cbr> [**OpenFlamingo：用于训练大型自回归视觉-语言模型的开源框架**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.01390.pdf) \u003Cbr> | arXiv | 2023-08-02 | [Github](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_flamingo) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopenflamingo\u002FOpenFlamingo) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnap-stanford\u002Fmed-flamingo.svg?style=social&label=Star) \u003Cbr> [**Med-Flamingo：一种多模态医学少样本学习器**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.15189.pdf) \u003Cbr> | arXiv | 2023-07-27 | [Github](https:\u002F\u002Fgithub.com\u002Fsnap-stanford\u002Fmed-flamingo) | 本地演示 | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu.svg?style=social&label=Star) \u003Cbr> [**多模态下的生成式预训练**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.05222.pdf) \u003Cbr> | ICLR | 2023-07-11 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu\u002Ftree\u002Fmain\u002FEmu1) | [Demo](http:\u002F\u002F218.91.113.230:9002\u002F) |\n| [**AVIS：利用大型语言模型实现自主视觉信息搜索**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.08129.pdf) | arXiv | 2023-06-13 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLuodian\u002FOtter.svg?style=social&label=Star) \u003Cbr> [**MIMIC-IT：多模态上下文指令调优**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05425.pdf) \u003Cbr> | arXiv | 2023-06-08 | [Github](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter) | [Demo](https:\u002F\u002Fotter.cliangyu.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyongliang-wu\u002FExploreCfg.svg?style=social&label=Star) \u003Cbr> [**探索用于图像字幕生成的多样化上下文配置**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14800.pdf) \u003Cbr> | NeurIPS | 2023-05-24 | [Github](https:\u002F\u002Fgithub.com\u002Fyongliang-wu\u002FExploreCfg) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flupantech\u002Fchameleon-llm.svg?style=social&label=Star) \u003Cbr> [**Chameleon：利用大型语言模型实现即插即用的组合推理**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.09842.pdf) \u003Cbr> | arXiv | 2023-04-19 | [Github](https:\u002F\u002Fgithub.com\u002Flupantech\u002Fchameleon-llm) | [Demo](https:\u002F\u002Fchameleon-llm.github.io\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star) \u003Cbr> [**HuggingGPT：借助ChatGPT及其在HuggingFace中的伙伴解决AI任务**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.17580.pdf) \u003Cbr> | arXiv | 2023-03-30 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FJARVIS) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FHuggingGPT) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMM-REACT.svg?style=social&label=Star) \u003Cbr> [**MM-REACT：提示ChatGPT进行多模态推理与行动**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.11381.pdf) \u003Cbr> | arXiv | 2023-03-20 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMM-REACT) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft-cognitive-service\u002Fmm-react) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMAEHCM\u002FICL-D3IE.svg?style=social&label=Star) \u003Cbr> [**ICL-D3IE：使用多样化的演示进行文档信息抽取的上下文学习**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05063.pdf) \u003Cbr> | ICCV | 2023-03-09 | [Github](https:\u002F\u002Fgithub.com\u002FMAEHCM\u002FICL-D3IE) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMILVLG\u002Fprophet.svg?style=social&label=Star) \u003Cbr> [**利用答案启发式提示大型语言模型进行基于知识的视觉问答**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.01903.pdf) \u003Cbr> | CVPR | 2023-03-03 | [Github](https:\u002F\u002Fgithub.com\u002FMILVLG\u002Fprophet) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Fvisprog.svg?style=social&label=Star) \u003Cbr> [**视觉编程：无需训练的组合式视觉推理**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FGupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf) \u003Cbr> | CVPR | 2022-11-18 | [Github](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fvisprog) | 本地演示 | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FPICa.svg?style=social&label=Star) \u003Cbr> [**关于GPT-3在少样本知识型VQA中的实证研究**](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAAAI\u002Farticle\u002Fdownload\u002F20215\u002F19974) \u003Cbr> | AAAI | 2022-06-28 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPICa) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlfoundations\u002Fopen_flamingo.svg?style=social&label=Star) \u003Cbr> [**Flamingo：一种用于少样本学习的视觉语言模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.14198.pdf) \u003Cbr> | NeurIPS | 2022-04-29 | [Github](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_flamingo) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdhansmair\u002Fflamingo-mini-cap) | \n| [**冻结语言模型下的多模态少样本学习**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.13884.pdf) | NeurIPS | 2021-06-25 | - | - |\n\n## 多模态思维链\n|  标题   |    会议\u002F平台    |    日期    |    代码    |    演示    |\n|:--------|:--------------:|:----------:|:----------:|:----------:|\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdongyh20\u002FInsight-V.svg?style=social&label=Star) \u003Cbr> [**Insight-V：利用多模态大语言模型探索长链式视觉推理**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.14432) \u003Cbr> | arXiv | 2024-11-21 | [Github](https:\u002F\u002Fgithub.com\u002Fdongyh20\u002FInsight-V) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fggg0919\u002Fcantor.svg?style=social&label=Star) \u003Cbr> [**Cantor：激发MLLM的多模态思维链**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.16033.pdf) \u003Cbr> | arXiv | 2024-04-24 | [Github](https:\u002F\u002Fgithub.com\u002Fggg0919\u002Fcantor) | 本地演示 |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepcs233\u002FVisual-CoT.svg?style=social&label=Star) \u003Cbr> [**Visual CoT：释放多模态语言模型中的思维链推理能力**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.16999.pdf) \u003Cbr> | arXiv | 2024-03-25 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepcs233\u002FVisual-CoT) | 本地演示 |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fchancharikmitra\u002FCCoT.svg?style=social&label=Star) \u003Cbr> [**面向大型多模态模型的组合式思维链提示**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17076) \u003Cbr> | CVPR | 2023-11-27 | [Github](https:\u002F\u002Fgithub.com\u002Fchancharikmitra\u002FCCoT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSooLab\u002FDDCOT.svg?style=social&label=Star) \u003Cbr> [**DDCoT：用于语言模型多模态推理的职责分明思维链提示**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16436.pdf) \u003Cbr> | NeurIPS | 2023-10-25 | [Github](https:\u002F\u002Fgithub.com\u002FSooLab\u002FDDCOT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshikras\u002Fshikra.svg?style=social&label=Star) \u003Cbr> [**Shikra：释放多模态LLM的指代对话魔力**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15195.pdf) \u003Cbr> | arXiv | 2023-06-27 | [Github](https:\u002F\u002Fgithub.com\u002Fshikras\u002Fshikra) | [演示](http:\u002F\u002Fdemo.zhaozhang.net:7860\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FzeroQiaoba\u002FExplainable-Multimodal-Emotion-Reasoning.svg?style=social&label=Star) \u003Cbr> [**可解释的多模态情感推理**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15401.pdf) \u003Cbr> | arXiv | 2023-06-27 | [Github](https:\u002F\u002Fgithub.com\u002FzeroQiaoba\u002FExplainable-Multimodal-Emotion-Reasoning) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEmbodiedGPT\u002FEmbodiedGPT_Pytorch.svg?style=social&label=Star) \u003Cbr> [**EmbodiedGPT：通过具身思维链进行视觉-语言预训练**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.15021.pdf) \u003Cbr> | arXiv | 2023-05-24 | [Github](https:\u002F\u002Fgithub.com\u002FEmbodiedGPT\u002FEmbodiedGPT_Pytorch) | - |\n| [**逐帧思考：用视频补全与预测评估视频思维链**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13903.pdf) | arXiv | 2023-05-23 | - | - |\n| [**T-SciQ：通过大语言模型信号教授多模态思维链推理以解答科学问题**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.03453.pdf) | arXiv | 2023-05-05 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fttengwang\u002FCaption-Anything.svg?style=social&label=Star) \u003Cbr> [**Caption Anything：借助多样化的多模态控件实现交互式图像描述**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.02677.pdf) \u003Cbr> | arXiv | 2023-05-04 | [Github](https:\u002F\u002Fgithub.com\u002Fttengwang\u002FCaption-Anything) | [演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTencentARC\u002FCaption-Anything) |\n| [**视觉思维链：用多模态补全弥合逻辑断层**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.02317.pdf) | arXiv | 2023-05-03 | [即将发布](https:\u002F\u002Fgithub.com\u002Fdannyrose30\u002FVCOT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flupantech\u002Fchameleon-llm.svg?style=social&label=Star) \u003Cbr> [**Chameleon：使用大语言模型实现即插即用的组合式推理**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.09842.pdf) \u003Cbr> | arXiv | 2023-04-19 | [Github](https:\u002F\u002Fgithub.com\u002Flupantech\u002Fchameleon-llm) | [演示](https:\u002F\u002Fchameleon-llm.github.io\u002F) |\n| [**视觉语言模型中的思维链提示调优**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.07919.pdf) | arXiv | 2023-04-16 | [即将发布]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMM-REACT.svg?style=social&label=Star) \u003Cbr> [**MM-REACT：提示ChatGPT实现多模态推理与行动**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.11381.pdf) \u003Cbr> | arXiv | 2023-03-20 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMM-REACT) | [演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft-cognitive-service\u002Fmm-react) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FTaskMatrix.svg?style=social&label=Star) \u003Cbr> [**视觉ChatGPT：与视觉基础模型对话、绘图和编辑**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.04671.pdf) \u003Cbr> | arXiv | 2023-03-08 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FTaskMatrix) | [演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002Fvisual_chatgpt) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Famazon-science\u002Fmm-cot.svg?style=social&label=Star) \u003Cbr> [**语言模型中的多模态思维链推理**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00923.pdf) \u003Cbr> | arXiv | 2023-02-02 | [Github](https:\u002F\u002Fgithub.com\u002Famazon-science\u002Fmm-cot) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Fvisprog.svg?style=social&label=Star) \u003Cbr> [**视觉编程：无需训练的组合式视觉推理**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FGupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf) \u003Cbr> | CVPR | 2022-11-18 | [Github](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fvisprog) | 本地演示 |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flupantech\u002FScienceQA.svg?style=social&label=Star) \u003Cbr> [**学会解释：通过思维链进行多模态推理以解答科学问题**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2022\u002Ffile\u002F11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf) \u003Cbr> | NeurIPS | 2022-09-20 | [Github](https:\u002F\u002Fgithub.com\u002Flupantech\u002FScienceQA) | - |\n\n## 大语言模型辅助的视觉推理\n|  标题   |    会议\u002F平台    |    日期    |    代码    |    演示    |\n|:--------|:--------------:|:----------:|:----------:|:----------:|\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyhy-2000\u002FVideoDeepResearch.svg?style=social&label=Star) \u003Cbr> [**VideoDeepResearch: 基于智能体工具的长视频理解**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.10821) \u003Cbr> | arXiv | 2025-06-12 | [Github](https:\u002F\u002Fgithub.com\u002Fyhy-2000\u002FVideoDeepResearch) | 本地演示 |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLaVi-Lab\u002FVisual-Table.svg?style=social&label=Star) \u003Cbr> [**超越嵌入：视觉表格在多模态模型中的潜力**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.18252.pdf) \u003Cbr> | arXiv | 2024-03-27 | [Github](https:\u002F\u002Fgithub.com\u002FLaVi-Lab\u002FVisual-Table) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpenghao-wu\u002Fvstar.svg?style=social&label=Star) \u003Cbr> [**V∗：引导式视觉搜索作为多模态大语言模型的核心机制**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14135.pdf) \u003Cbr> | arXiv | 2023-12-21 | [Github](https:\u002F\u002Fgithub.com\u002Fpenghao-wu\u002Fvstar) | 本地演示 |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLaVA-VL\u002FLLaVA-Interactive-Demo.svg?style=social&label=Star) \u003Cbr> [**LLaVA-Interactive：图像聊天、分割、生成与编辑的一体化演示**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.00571.pdf) \u003Cbr> | arXiv | 2023-11-01 | [Github](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Interactive-Demo) | [演示](https:\u002F\u002F6dd3-20-163-117-69.ngrok-free.app\u002F) |\n| [**MM-VID：利用GPT-4V（视觉）推进视频理解**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.19773.pdf) | arXiv | 2023-10-30 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FControlLLM.svg?style=social&label=Star) \u003Cbr> [**ControlLLM：通过图搜索为语言模型增强工具能力**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.17796.pdf) \u003Cbr> | arXiv | 2023-10-26 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FControlLLM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FWoodpecker.svg?style=social&label=Star) \u003Cbr> [**Woodpecker：多模态大型语言模型的幻觉纠正**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16045.pdf) \u003Cbr> | arXiv | 2023-10-24 | [Github](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FWoodpecker) | [演示](https:\u002F\u002Fdeb6a97bae6fab67ae.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmindagent\u002Fmindagent.svg?style=social&label=Star) \u003Cbr> [**MindAgent：涌现的游戏交互**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.09971.pdf) \u003Cbr> | arXiv | 2023-09-18 | [Github](https:\u002F\u002Fgithub.com\u002Fmindagent\u002Fmindagent) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FContextualAI\u002Flens.svg?style=social&label=Star) \u003Cbr> [**迈向能“看见”的语言模型：通过自然语言之“镜”看计算机视觉**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.16410.pdf) \u003Cbr> | arXiv | 2023-06-28 | [Github](https:\u002F\u002Fgithub.com\u002FContextualAI\u002Flens) | [演示](https:\u002F\u002Flens.contextual.ai\u002F) |\n| [**检索问答：基于冻结大型语言模型的零样本视频问答**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.11732.pdf) | arXiv | 2023-06-15 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002Fassistgpt.svg?style=social&label=Star) \u003Cbr> [**AssistGPT：能够规划、执行、检查和学习的通用多模态助手**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.08640.pdf) \u003Cbr> | arXiv | 2023-06-14 | [Github](https:\u002F\u002Fgithub.com\u002Fshowlab\u002Fassistgpt) | - |\n| [**AVIS：基于大型语言模型的自主视觉信息搜索**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.08129.pdf) | arXiv | 2023-06-13 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FStevenGrove\u002FGPT4Tools.svg?style=social&label=Star) \u003Cbr> [**GPT4Tools：通过自我指导训练大型语言模型使用工具**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.18752.pdf) \u003Cbr> | arXiv | 2023-05-30 | [Github](https:\u002F\u002Fgithub.com\u002FStevenGrove\u002FGPT4Tools) | [演示](https:\u002F\u002Fc60eb7e9400930f31b.gradio.live\u002F) |\n| [**基于自然语言的心智社会中的思维风暴**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.17066.pdf) | arXiv | 2023-05-26 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fweixi-feng\u002FLayoutGPT.svg?style=social&label=Star) \u003Cbr> [**LayoutGPT：利用大型语言模型进行组合式的视觉规划与生成**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.15393.pdf) \u003Cbr> | arXiv | 2023-05-24 | [Github](https:\u002F\u002Fgithub.com\u002Fweixi-feng\u002FLayoutGPT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHxyou\u002FIdealGPT.svg?style=social&label=Star) \u003Cbr> [**IdealGPT：通过大型语言模型迭代分解视觉与语言推理**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14985.pdf) \u003Cbr> | arXiv | 2023-05-24 | [Github](https:\u002F\u002Fgithub.com\u002FHxyou\u002FIdealGPT) | 本地演示 |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmatrix-alpha\u002FAccountable-Textual-Visual-Chat.svg?style=social&label=Star) \u003Cbr> [**可问责的文本-视觉聊天学会了在图像重建中拒绝人类指令**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05983.pdf) \u003Cbr> | arXiv | 2023-05-10 | [Github](https:\u002F\u002Fgithub.com\u002Fmatrix-alpha\u002FAccountable-Textual-Visual-Chat) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fttengwang\u002FCaption-Anything.svg?style=social&label=Star) \u003Cbr> [**Caption Anything：具有多样化多模态控件的交互式图像描述**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.02677.pdf) \u003Cbr> | arXiv | 2023-05-04 | [Github](https:\u002F\u002Fgithub.com\u002Fttengwang\u002FCaption-Anything) | [演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTencentARC\u002FCaption-Anything) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flupantech\u002Fchameleon-llm.svg?style=social&label=Star) \u003Cbr> [**Chameleon：利用大型语言模型实现即插即用的组合式推理**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.09842.pdf) \u003Cbr> | arXiv | 2023-04-19 | [Github](https:\u002F\u002Fgithub.com\u002Flupantech\u002Fchameleon-llm) | [演示](https:\u002F\u002Fchameleon-llm.github.io\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star) \u003Cbr> [**HuggingGPT：借助ChatGPT及其在HuggingFace中的伙伴解决AI任务**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.17580.pdf) \u003Cbr> | arXiv | 2023-03-30 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FJARVIS) | [演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FHuggingGPT) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FMM-REACT.svg?style=social&label=Star) \u003Cbr> [**MM-REACT：提示ChatGPT进行多模态推理与行动**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.11381.pdf) \u003Cbr> | arXiv | 2023-03-20 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMM-REACT) | [演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft-cognitive-service\u002Fmm-react) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcvlab-columbia\u002Fviper.svg?style=social&label=Star) \u003Cbr> [**ViperGPT：通过Python执行进行视觉推理**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.08128.pdf) \u003Cbr> | arXiv | 2023-03-14 | [Github](https:\u002F\u002Fgithub.com\u002Fcvlab-columbia\u002Fviper) | 本地演示 |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FChatCaptioner.svg?style=social&label=Star) \u003Cbr> [**ChatGPT提问，BLIP-2回答：自动提问以丰富视觉描述**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.06594.pdf) \u003Cbr> | arXiv | 2023-03-12 | [Github](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FChatCaptioner) | 本地演示 |\n| [**ICL-D3IE：利用多样化的示范更新进行文档信息抽取的上下文学习**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05063.pdf) | ICCV | 2023-03-09 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FTaskMatrix.svg?style=social&label=Star) \u003Cbr> [**视觉ChatGPT：与视觉基础模型对话、绘图和编辑**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.04671.pdf) \u003Cbr> | arXiv | 2023-03-08 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FTaskMatrix) | [演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002Fvisual_chatgpt) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZrrSkywalker\u002FCaFo.svg?style=social&label=Star) \u003Cbr> [**提示、生成并缓存：基础模型的级联使少样本学习者更强大**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.02151.pdf) \u003Cbr> | CVPR | 2023-03-03 | [Github](https:\u002F\u002Fgithub.com\u002FZrrSkywalker\u002FCaFo) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsalesforce\u002FLAVIS.svg?style=social&label=Star) \u003Cbr> [**从图像到文本提示：利用冻结大型语言模型进行零样本VQA**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10846.pdf) \u003Cbr> | CVPR | 2022-12-21 | [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fimg2llm-vqa) | [演示](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsalesforce\u002FLAVIS\u002Fblob\u002Fmain\u002Fprojects\u002Fimg2llm-vqa\u002Fimg2llm_vqa.ipynb) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvishaal27\u002FSuS-X.svg?style=social&label=Star) \u003Cbr> [**SuS-X：无需训练的语言-视觉模型仅凭名称迁移**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.16198.pdf) \u003Cbr> | arXiv | 2022-11-28 | [Github](https:\u002F\u002Fgithub.com\u002Fvishaal27\u002FSuS-X) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangyangyang127\u002FPointCLIP_V2.svg?style=social&label=Star) \u003Cbr> [**PointCLIP V2：适配CLIP以实现强大的3D开放世界学习**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.11682.pdf) \u003Cbr> | CVPR | 2022-11-21 | [Github](https:\u002F\u002Fgithub.com\u002Fyangyangyang127\u002FPointCLIP_V2) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Fvisprog.svg?style=social&label=Star) \u003Cbr> [**视觉编程：无需训练的组合式视觉推理**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FGupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf) \u003Cbr> | CVPR | 2022-11-18 | [Github](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fvisprog) | 本地演示 |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle-research\u002Fgoogle-research.svg?style=social&label=Star) \u003Cbr> [**苏格拉底模型：利用语言构建零样本多模态推理**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.00598.pdf) \u003Cbr> | arXiv | 2022-04-01 | [Github](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002Ftree\u002Fmaster\u002Fsocraticmodels) | - |\n\n\n\n## 基础模型\n|  标题   |    场所    |    日期    |    代码    |    演示    |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| [**介绍GPT-5**](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-gpt-5\u002F) | OpenAI | 2025-08-07 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoLLaMA3.svg?style=social&label=Star) \u003Cbr> [**VideoLLaMA 3：用于图像和视频理解的前沿多模态基础模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.13106) \u003Cbr> | arXiv | 2025-01-22 | [Github](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flixin4ever\u002FVideoLLaMA3) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu3.svg?style=social&label=Star) \u003Cbr> [**Emu3：只需预测下一个token即可**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.18869) \u003Cbr> | arXiv | 2024-09-27 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu3) | 本地演示 |\n| [**Llama 3.2：通过开放、可定制的模型革新边缘AI与视觉技术**](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fllama-3-2-connect-2024-vision-edge-mobile-devices\u002F) | Meta | 2024-09-25 | - | [Demo](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-3.2-11B-Vision-Instruct) | \n| [**Pixtral-12B**](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fpixtral-12b\u002F) | Mistral | 2024-09-17 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsalesforce\u002FLAVIS.svg?style=social&label=Star) \u003Cbr> [**xGen-MM（BLIP-3）：一系列开源大型多模态模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.08872) \u003Cbr> | arXiv | 2024-08-16 | [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fxgen-mm) | - |\n| [**Llama 3模型家族**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.21783) | arXiv | 2024-07-31 | - | - |\n| [**Chameleon：混合模态早期融合基础模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.09818) | arXiv | 2024-05-16 | - | - |\n| [**你好，GPT-4o**](https:\u002F\u002Fopenai.com\u002Findex\u002Fhello-gpt-4o\u002F) | OpenAI | 2024-05-13 | - | - | \n| [**Claude 3模型家族：Opus、Sonnet、Haiku**](https:\u002F\u002Fwww-cdn.anthropic.com\u002Fde8ba9b01c9ab7cbabf5c33b80b7bbc618857627\u002FModel_Card_Claude_3.pdf) | Anthropic | 2024-03-04 | - | - |\n| [**Gemini 1.5：解锁跨越数百万个token上下文的多模态理解能力**](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_v1_5_report.pdf) | Google | 2024-02-15 | - | - |\n| [**Gemini：一个功能强大的多模态模型家族**](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_1_report.pdf) | Google | 2023-12-06 | - | - |\n| [**Fuyu-8B：面向AI代理的多模态架构**](https:\u002F\u002Fwww.adept.ai\u002Fblog\u002Ffuyu-8b) | 博客 | 2023-10-17 | [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fadept\u002Ffuyu-8b) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fadept\u002Ffuyu-8b) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmshukor\u002FUnIVAL.svg?style=social&label=Star) \u003Cbr> [**用于图像、视频、音频和语言任务的统一模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.16184.pdf) \u003Cbr> | arXiv | 2023-07-30 | [Github](https:\u002F\u002Fgithub.com\u002Fmshukor\u002FUnIVAL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmshukor\u002FUnIVAL) |\n| [**PaLI-3视觉语言模型：更小、更快、更强**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.09199.pdf) | arXiv | 2023-10-13 | - | - |\n| [**GPT-4V（vision）系统卡片**](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002FGPTV_System_Card.pdf) | OpenAI | 2023-09-25 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjy0205\u002FLaVIT.svg?style=social&label=Star) \u003Cbr> [**在LLM中进行动态离散视觉标记化的统一语言-视觉预训练**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.04669.pdf) \u003Cbr> | arXiv | 2023-09-09 | [Github](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FLaVIT) | - |\n| [**多模态基础模型：从专家到通用助手**](https:\u002F\u002Fbrowse.arxiv.org\u002Fpdf\u002F2309.10020.pdf) | arXiv | 2023-09-18 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyiren-jian\u002FBLIText.svg?style=social&label=Star) \u003Cbr> [**通过解耦的语言预训练来启动视觉-语言学习**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.07063.pdf) \u003Cbr> | NeurIPS | 2023-07-13 | [Github](https:\u002F\u002Fgithub.com\u002Fyiren-jian\u002FBLIText) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu.svg?style=social&label=Star) \u003Cbr> [**多模态中的生成式预训练**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.05222.pdf) \u003Cbr> | arXiv | 2023-07-11 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu) | [Demo](http:\u002F\u002F218.91.113.230:9002\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social&label=Star) \u003Cbr> [**Kosmos-2：将多模态大型语言模型与世界连接起来**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14824.pdf) \u003Cbr> | arXiv | 2023-06-26 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Fkosmos-2) | [Demo](https:\u002F\u002Faka.ms\u002Fkosmos-2-demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVPGTrans\u002FVPGTrans.svg?style=social&label=Star) \u003Cbr> [**跨LLM传递视觉提示生成器**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.01278.pdf) \u003Cbr> | arXiv | 2023-05-02 | [Github](https:\u002F\u002Fgithub.com\u002FVPGTrans\u002FVPGTrans) | [Demo](https:\u002F\u002F3fc7715dbc44234a7f.gradio.live\u002F) | \n| [**GPT-4技术报告**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.08774.pdf) | arXiv | 2023-03-15 | - | - |\n| [**PaLM-E：一种具身化多模态语言模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.03378.pdf) | arXiv | 2023-03-06 | - | [Demo](https:\u002F\u002Fpalm-e.github.io\u002F#demo) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002Fprismer.svg?style=social&label=Star) \u003Cbr> [**Prismer：一种具有专家集成的视觉-语言模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.02506.pdf) \u003Cbr> | arXiv | 2023-03-04 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fprismer) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Florenmt\u002Fprismer) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social&label=Star) \u003Cbr> [**语言并非一切：将感知与语言模型对齐**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.14045.pdf) \u003Cbr> | arXiv | 2023-02-27 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsalesforce\u002FLAVIS.svg?style=social&label=Star) \u003Cbr> [**BLIP-2：利用冻结的图像编码器和大型语言模型启动语言-图像预训练**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12597.pdf) \u003Cbr> | arXiv | 2023-01-30 | [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fblip2) | [Demo](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsalesforce\u002FLAVIS\u002Fblob\u002Fmain\u002Fexamples\u002Fblip2_instructed_generation.ipynb) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvimalabs\u002FVIMA.svg?style=social&label=Star) \u003Cbr> [**VIMA：利用多模态提示进行通用机器人操作**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.03094.pdf) \u003Cbr> | ICML | 2022-10-06 | [Github](https:\u002F\u002Fgithub.com\u002Fvimalabs\u002FVIMA) | 本地演示 | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMineDojo\u002FMineDojo.svg?style=social&label=Star) \u003Cbr> [**MineDojo：构建具有互联网规模知识的开放式具身智能体**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08853.pdf) \u003Cbr> | NeurIPS | 2022-06-17 | [Github](https:\u002F\u002Fgithub.com\u002FMineDojo\u002FMineDojo) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshizhediao\u002FDaVinci.svg?style=social&label=Star) \u003Cbr> [**写作与绘画：生成式视觉-语言模型是统一的模态学习者**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07699.pdf) \u003Cbr> | ICLR | 2022-06-15 | [Github](https:\u002F\u002Fgithub.com\u002Fshizhediao\u002FDaVinci) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social&label=Star) \u003Cbr> [**语言模型是通用接口**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06336.pdf) \u003Cbr> | arXiv | 2022-06-13 | [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm) | - |\n\n## 评估\n|  标题   |    场所    |    日期    |    页面    |\n|:--------|:----------:|:----------:|:----------:|\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flerogo\u002FMMGenBench?style=social&label=Star) \u003Cbr> [**空间中的思考：多模态大语言模型如何感知、记忆和回忆空间**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.14171) \u003Cbr> | arXiv | 2024-12-18 | [Github](https:\u002F\u002Fgithub.com\u002Fvision-x-nyu\u002Fthinking-in-space) |\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flerogo\u002FMMGenBench?style=social&label=Star) \u003Cbr> [**MMGenBench：从文本到图像生成的角度评估多模态大模型的极限**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.14062) \u003Cbr> | arXiv | 2024-11-21 | [Github](https:\u002F\u002Fgithub.com\u002Flerogo\u002FMMGenBench) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmultimodal-art-projection\u002FOmniBench?style=social&label=Star) \u003Cbr> [**OmniBench：迈向通用全能语言模型的未来**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.15272) \u003Cbr> | arXiv | 2024-09-23 | [Github](https:\u002F\u002Fgithub.com\u002Fmultimodal-art-projection\u002FOmniBench) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfzhang114\u002FMME-RealWorld?style=social&label=Star) \u003Cbr> [**MME-RealWorld：你的多模态大模型能否应对连人类都难以处理的高分辨率真实场景？**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.13257) \u003Cbr> | arXiv | 2024-08-23 | [Github](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FMME-RealWorld) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fguoyang9\u002FUNK-VQA?style=social&label=Star) \u003Cbr> [**UNK-VQA：一个多模态大模型的弃权能力数据集及探针**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.10942) \u003Cbr> | TPAMI | 2023-10-17 | [Github](https:\u002F\u002Fgithub.com\u002Fguoyang9\u002FUNK-VQA) |\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fchenllliang\u002FMMEvalPro?style=social&label=Star) \u003Cbr> [**MMEvalPro：校准多模态基准测试，实现可信高效的评估**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.00468) \u003Cbr> | arXiv | 2024-06-29 | [Github](https:\u002F\u002Fgithub.com\u002Fchenllliang\u002FMMEvalPro) |\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMBZUAI-LLM\u002Fweb2code?style=social&label=Star) \u003Cbr> [**Web2Code：面向多模态大模型的大规模网页转代码数据集与评估框架**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.20098) \u003Cbr> | arXiv | 2024-06-28 | [Github](https:\u002F\u002Fgithub.com\u002FMBZUAI-LLM\u002Fweb2code) |\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fprinceton-nlp\u002FCharXiv?style=social&label=Star) \u003Cbr> [**CharXiv：揭示多模态大模型在现实图表理解上的差距**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.18521) \u003Cbr> | arXiv | 2024-06-26 | [Github](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FCharXiv) |\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChartMimic\u002FChartMimic?style=social&label=Star) \u003Cbr> [**ChartMimic：通过图表到代码生成评估多模态大模型的跨模态推理能力**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.09961) \u003Cbr> | arXiv | 2024-04-15 | [Github](https:\u002F\u002Fgithub.com\u002FChartMimic\u002FChartMimic) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FVideo-MME?style=social&label=Star) \u003Cbr> [**Video-MME：首个全面的多模态大模型视频分析评估基准**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.21075) \u003Cbr> | arXiv | 2024-05-31 | [Github](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsail-sg\u002FMMCBench?style=social&label=Star) \u003Cbr> [**针对常见干扰对大型多模态模型进行基准测试**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.11943.pdf) \u003Cbr> | NAACL | 2024-01-22 | [Github](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FMMCBench) |\n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftsb0601\u002FMMVP?style=social&label=Star) \u003Cbr> [**睁眼瞎？探索多模态大模型的视觉缺陷**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.06209.pdf) \u003Cbr> | arXiv | 2024-01-11 | [Github](https:\u002F\u002Fgithub.com\u002Ftsb0601\u002FMMVP) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models?style=social&label=Star) \u003Cbr> [**GPT-4V的挑战者？Gemini在视觉专长方面的早期探索**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.12436.pdf) \u003Cbr> | arXiv | 2023-12-19 | [Github](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models) | \n| ![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIFEG\u002FBenchLMM?style=social&label=Star) \u003Cbr> [**BenchLMM：大型多模态模型跨风格视觉能力的基准测试**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02896.pdf) \u003Cbr> | arXiv | 2023-12-05 | [Github](https:\u002F\u002Fgithub.com\u002FAIFEG\u002FBenchLMM) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUCSC-VLAA\u002Fvllm-safety-benchmark.svg?style=social&label=Star) \u003Cbr> [**这张图里有多少只独角兽？视觉大模型的安全性评估基准**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16101.pdf) \u003Cbr> | arXiv | 2023-11-27 | [Github](https:\u002F\u002Fgithub.com\u002FUCSC-VLAA\u002Fvllm-safety-benchmark) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjonathan-roberts1\u002Fcharting-new-territories.svg?style=social&label=Star) \u003Cbr> [**开拓新领域：探索多模态大模型的地缘与地理空间能力**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.14656.pdf) \u003Cbr> | arXiv | 2023-11-24 | [Github](https:\u002F\u002Fgithub.com\u002Fjonathan-roberts1\u002Fcharting-new-territories) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFreedomIntelligence\u002FMLLM-Bench?style=social&label=Star) \u003Cbr> [**MLLM-Bench，使用GPT-4V评估多模态大模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.13951) \u003Cbr> | arXiv | 2023-11-23 | [Github](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FMLLM-Bench) |\n| [**VLM-Eval：关于视频大语言模型的一般性评估**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.11865.pdf) | arXiv | 2023-11-20 | [即将发布]() |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgzcch\u002FBingo.svg?style=social&label=Star) \u003Cbr> [**GPT-4V(ision)中幻觉现象的综合分析：偏见与干扰挑战**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03287.pdf) \u003Cbr> | arXiv | 2023-11-06 | [Github](https:\u002F\u002Fgithub.com\u002Fgzcch\u002FBingo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPJLab-ADG\u002FGPT4V-AD-Exploration.svg?style=social&label=Star) \u003Cbr> [**与GPT-4V(ision)同行：视觉-语言模型在自动驾驶领域的早期探索**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.05332.pdf) \u003Cbr> | arXiv | 2023-11-09 | [Github](https:\u002F\u002Fgithub.com\u002FPJLab-ADG\u002FGPT4V-AD-Exploration) |\n| [**迈向通用异常检测与理解：大规模视觉-语言模型（GPT-4V）引领潮流**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.02782.pdf) | arXiv | 2023-11-05 | - |\n| [**GPT-4V在医学影像中的多模态能力综合研究**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.20381.pdf) | arXiv | 2023-10-31 | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falbertwy\u002FGPT-4V-Evaluation.svg?style=social&label=Star) \u003Cbr> [**GPT-4V(ision)的早期评估**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16534.pdf) \u003Cbr> | arXiv | 2023-10-25 | [Github](https:\u002F\u002Fgithub.com\u002Falbertwy\u002FGPT-4V-Evaluation) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSCUT-DLVCLab\u002FGPT-4V_OCR.svg?style=social&label=Star) \u003Cbr> [**探索GPT-4V(ision)的OCR能力：一项定量且深入的评估**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.16809.pdf) \u003Cbr> | arXiv | 2023-10-25 | [Github](https:\u002F\u002Fgithub.com\u002FSCUT-DLVCLab\u002FGPT-4V_OCR) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftianyi-lab\u002FHallusionBench.svg?style=social&label=Star) \u003Cbr> [**HallusionBench：你看到的是你想到的，还是你想到的是你看到的？一个对GPT-4V(ision)、LLaVA-1.5及其他多模态模型构成挑战的图像-上下文推理基准**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.14566.pdf) \u003Cbr> | CVPR | 2023-10-23 | [Github](https:\u002F\u002Fgithub.com\u002Ftianyi-lab\u002FHallusionBench) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flupantech\u002FMathVista.svg?style=social&label=Star) \u003Cbr> [**MathVista：利用GPT-4V、Bard等大型多模态模型评估视觉情境下的数学推理能力**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.02255.pdf) \u003Cbr> | ICLR | 2023-10-03 | [Github](https:\u002F\u002Fgithub.com\u002Flupantech\u002FMathVista) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fys-zong\u002FFoolyourVLLMs.svg?style=social&label=Star) \u003Cbr> [**用极其简单的排列组合就能愚弄你的（视觉和）语言模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01651.pdf) \u003Cbr> | arXiv | 2023-10-02 | [Github](https:\u002F\u002Fgithub.com\u002Fys-zong\u002FFoolyourVLLMs) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmshukor\u002FEvALign-ICL.svg?style=social&label=Star) \u003Cbr> [**超越任务表现：通过上下文学习评估并减少大型多模态模型的缺陷**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00647.pdf) \u003Cbr> | arXiv | 2023-10-01 | [Github](https:\u002F\u002Fgithub.com\u002Fmshukor\u002FEvALign-ICL) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzjunlp\u002FEasyEdit.svg?style=social&label=Star) \u003Cbr> [**我们能编辑多模态大语言模型吗？**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.08475.pdf) \u003Cbr> | arXiv | 2023-10-12 | [Github](https:\u002F\u002Fgithub.com\u002Fzjunlp\u002FEasyEdit) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliaoning97\u002FREVO-LION.svg?style=social&label=Star) \u003Cbr> [**REVO-LION：评估和优化视觉-语言指令微调数据集**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.06594.pdf) \u003Cbr> | arXiv | 2023-10-10 | [Github](https:\u002F\u002Fgithub.com\u002Fliaoning97\u002FREVO-LION) |\n| [**多模态大模型的黎明：与GPT-4V(vision)的初步探索**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.17421.pdf) | arXiv | 2023-09-29 | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOFA-Sys\u002FTouchStone.svg?style=social&label=Star) \u003Cbr> [**TouchStone：用语言模型评估视觉-语言模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16890.pdf) \u003Cbr> | arXiv | 2023-08-31 | [Github](https:\u002F\u002Fgithub.com\u002FOFA-Sys\u002FTouchStone) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHYPJUDY\u002FSparkles.svg?style=social&label=Star) \u003Cbr> [**✨Sparkles：为多模态指令遵循模型解锁跨多张图片的对话能力**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16463.pdf) \u003Cbr> | arXiv | 2023-08-31 | [Github](https:\u002F\u002Fgithub.com\u002FHYPJUDY\u002FSparkles#sparkleseval) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffindalexli\u002FSciGraphQA.svg?style=social&label=Star) \u003Cbr> [**SciGraphQA：一个用于科学图谱的大规模合成多轮问答数据集**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.03349.pdf) \u003Cbr> | arXiv | 2023-08-07 | [Github](https:\u002F\u002Fgithub.com\u002Ffindalexli\u002FSciGraphQA) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FMulti-Modality-Arena.svg?style=social&label=Star) \u003Cbr> [**Tiny LVLM-eHub：与Bard的早期多模态实验**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.03729.pdf) \u003Cbr> | arXiv | 2023-08-07 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FMulti-Modality-Arena) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyuweihao\u002FMM-Vet.svg?style=social&label=Star) \u003Cbr> [**MM-Vet：评估大型多模态模型的综合能力**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.02490.pdf) \u003Cbr> | arXiv | 2023-08-04 | [Github](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMM-Vet) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FSEED-Bench.svg?style=social&label=Star) \u003Cbr> [**SEED-Bench：以生成式理解为基准评估多模态大模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.16125.pdf) \u003Cbr> | CVPR | 2023-07-30 | [Github](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FSEED-Bench) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopen-compass\u002FMMBench.svg?style=social&label=Star) \u003Cbr> [**MMBench：你的多模态模型是全能选手吗？**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.06281.pdf) \u003Cbr> | arXiv | 2023-07-12 | [Github](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FMMBench) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models.svg?style=social&label=Star) \u003Cbr> [**MME：多模态大语言模型的综合评估基准**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13394.pdf) \u003Cbr> | arXiv | 2023-06-23 | [Github](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FEvaluation) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FMulti-Modality-Arena.svg?style=social&label=Star) \u003Cbr> [**LVLM-eHub：大型视觉-语言模型的综合评估基准**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.09265.pdf) \u003Cbr> | arXiv | 2023-06-15 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FMulti-Modality-Arena) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenLAMM\u002FLAMM.svg?style=social&label=Star) \u003Cbr> [**LAMM：语言辅助的多模态指令微调数据集、框架和基准**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.06687.pdf) \u003Cbr> | arXiv | 2023-06-11 | [Github](https:\u002F\u002Fgithub.com\u002FOpenLAMM\u002FLAMM#lamm-benchmark) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FM3Exam.svg?style=social&label=Star) \u003Cbr> [**M3Exam：一个多语种、多模态、多层次的基准，用于评估大型语言模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05179.pdf) \u003Cbr> | arXiv | 2023-06-08 | [Github](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FM3Exam) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuliang-Liu\u002FMultimodalOCR.svg?style=social&label=Star) \u003Cbr> [**大型多模态模型中OCR功能的隐秘奥秘**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.07895.pdf) \u003Cbr> | arXiv | 2023-05-13 | [Github](https:\u002F\u002Fgithub.com\u002FYuliang-Liu\u002FMultimodalOCR) |\n\n## 多模态RLHF\n|  标题   |    会议\u002F期刊    |    日期    |    代码    |    演示    |\n|:--------|:--------------:|:----------:|:----------:|:----------:|\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfzhang114\u002Fr1_reward.svg?style=social&label=Star) \u003Cbr> [**R1-Reward：通过稳定强化学习训练多模态奖励模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.02835) \u003Cbr> | arXiv | 2025-05-09 | [Github](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002Fr1_reward) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models.svg?style=social&label=Star) \u003Cbr> [**多模态大语言模型与人类偏好对齐：综述**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.14504) \u003Cbr> | arXiv | 2025-03-23 | [Github](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FAlignment) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKwai-YuanQi\u002FMM-RLHF.svg?style=social&label=Star) \u003Cbr> [**MM-RLHF：多模态大语言模型对齐的下一步进展**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.10391) \u003Cbr> | arXiv | 2025-02-14 | [Github](https:\u002F\u002Fgithub.com\u002FKwai-YuanQi\u002FMM-RLHF) | - |\n| [**利用多轮偏好优化提升多模态大语言模型在精细准确视频字幕生成上的能力**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.06682) | arXiv | 2024-10-09 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvlf-silkie\u002FVLFeedback.svg?style=social&label=Star) \u003Cbr> [**Silkie：大型视觉语言模型的偏好蒸馏**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10665.pdf) \u003Cbr> | arXiv | 2023-12-17 | [Github](https:\u002F\u002Fgithub.com\u002Fvlf-silkie\u002FVLFeedback) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRLHF-V\u002FRLHF-V.svg?style=social&label=Star) \u003Cbr> [**RLHF-V：通过细粒度纠正性人类反馈实现行为对齐，迈向可信的多模态大语言模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00849.pdf) \u003Cbr> | arXiv | 2023-12-01 | [Github](https:\u002F\u002Fgithub.com\u002FRLHF-V\u002FRLHF-V) | [演示](http:\u002F\u002F120.92.209.146:8081\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fllava-rlhf\u002FLLaVA-RLHF.svg?style=social&label=Star) \u003Cbr> [**基于事实增强的RLHF对齐大型多模态模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.14525.pdf) \u003Cbr> | arXiv | 2023-09-25 | [Github](https:\u002F\u002Fgithub.com\u002Fllava-rlhf\u002FLLaVA-RLHF) | [演示](http:\u002F\u002Fpitt.lti.cs.cmu.edu:7890\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwangclnlp\u002FVision-LLM-Alignment.svg?style=social&label=Star) \u003Cbr> [**RoVRM：一种通过辅助文本偏好数据优化的鲁棒视觉奖励模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.12109) \u003Cbr> | arXiv | 2024-08-22 | [Github](https:\u002F\u002Fgithub.com\u002Fwangclnlp\u002FVision-LLM-Alignment) | - |\n\n## 其他\n|  标题   |    会议\u002F期刊    |    日期    |    代码    |    演示    |\n|:--------|:--------------:|:----------:|:----------:|:----------:|\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftingyu215\u002FTS-LLaVA.svg?style=social&label=Star) \u003Cbr> [**TS-LLaVA：通过缩略图采样构建视觉 token，用于免训练视频大语言模型**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.11066) \u003Cbr> | arXiv | 2024-11-17 | [Github](https:\u002F\u002Fgithub.com\u002Ftingyu215\u002FTS-LLaVA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fys-zong\u002FVLGuard.svg?style=social&label=Star) \u003Cbr> [**几乎零成本的安全微调：视觉大语言模型的基线方法**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.02207.pdf) \u003Cbr> | arXiv | 2024-02-03 | [Github](https:\u002F\u002Fgithub.com\u002Fys-zong\u002FVLGuard) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSHI-Labs\u002FVCoder.svg?style=social&label=Star) \u003Cbr> [**VCoder：多模态大语言模型的通用视觉编码器**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14233.pdf) \u003Cbr> | arXiv | 2023-12-21 | [Github](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FVCoder) | 本地演示 |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FPrompt-Highlighter.svg?style=social&label=Star) \u003Cbr> [**Prompt Highlighter：多模态大语言模型的交互式控制工具**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.04302.pdf) \u003Cbr> | arXiv | 2023-12-07 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FPrompt-Highlighter) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIlab-CVC\u002FSEED.svg?style=social&label=Star) \u003Cbr> [**在大语言模型中植入视觉“种子”**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.08041.pdf) \u003Cbr> | arXiv | 2023-07-16 | [Github](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FSEED) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuawei-noah\u002FEfficient-Computing.svg?style=social&label=Star) \u003Cbr> [**大型预训练模型能否帮助视觉模型完成感知任务？**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.00693.pdf) \u003Cbr> | arXiv | 2023-06-01 | [Github](https:\u002F\u002Fgithub.com\u002Fhuawei-noah\u002FEfficient-Computing\u002Ftree\u002Fmaster\u002FGPT4Image\u002F) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyuhangzang\u002FContextDET.svg?style=social&label=Star) \u003Cbr> [**利用多模态大语言模型进行上下文感知目标检测**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.18279.pdf) \u003Cbr> | arXiv | 2023-05-29 | [Github](https:\u002F\u002Fgithub.com\u002Fyuhangzang\u002FContextDET) | [演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fyuhangzang\u002FContextDet-Demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkohjingyu\u002Fgill.svg?style=social&label=Star) \u003Cbr> [**利用多模态语言模型生成图像**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.17216.pdf) \u003Cbr> | arXiv | 2023-05-26 | [Github](https:\u002F\u002Fgithub.com\u002Fkohjingyu\u002Fgill) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyunqing-me\u002FAttackVLM.svg?style=social&label=Star) \u003Cbr> [**关于评估大型视觉-语言模型的对抗鲁棒性**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.16934.pdf) \u003Cbr> | arXiv | 2023-05-26 | [Github](https:\u002F\u002Fgithub.com\u002Fyunqing-me\u002FAttackVLM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkohjingyu\u002Ffromage.svg?style=social&label=Star) \u003Cbr> [**将语言模型与图像对齐，实现多模态输入输出**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13823.pdf) \u003Cbr> | ICML | 2023-01-31 | [Github](https:\u002F\u002Fgithub.com\u002Fkohjingyu\u002Ffromage) | [演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fjykoh\u002Ffromage) |\n\n---\n\n# 优秀数据集\n\n## 对齐预训练数据集\n| 名称 | 论文 | 类型 | 模态 |\n|:-----|:-----:|:----:|:----------:|\n| **ShareGPT4Video** | [ShareGPT4Video：通过更优质的字幕提升视频理解和生成能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.04325v1) | 字幕 | 视频-文本 |\n| **COYO-700M** | [COYO-700M：图像-文本对数据集](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fcoyo-dataset\u002F) | 字幕 | 图像-文本 |\n| **ShareGPT4V** | [ShareGPT4V：通过更优质的字幕提升多模态大模型性能](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.12793.pdf) | 字幕 | 图像-文本 |\n| **AS-1B** | [全视项目：迈向开放世界的全景视觉识别与理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.01907.pdf) | 混合 | 图像-文本 |\n| **InternVid** | [InternVid：用于多模态理解和生成的大规模视频-文本数据集](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.06942.pdf) | 字幕 | 视频-文本 |\n| **MS-COCO** | [微软COCO：上下文中的常见物体](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1405.0312.pdf) | 字幕 | 图像-文本 |\n| **SBU Captions** | [Im2Text：使用100万张带字幕的照片描述图像](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2011\u002Ffile\u002F5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf) | 字幕 | 图像-文本 |\n| **Conceptual Captions** | [概念性字幕：一个经过清理、采用上位词标注的图像替代文本数据集，用于自动图像字幕生成](https:\u002F\u002Faclanthology.org\u002FP18-1238.pdf) | 字幕 | 图像-文本 |\n| **LAION-400M** | [LAION-400M：CLIP筛选后的4亿对图像-文本公开数据集](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.02114.pdf) | 字幕 | 图像-文本 |\n| **VG Captions** |[视觉图谱：利用众包密集图像标注连接语言与视觉](https:\u002F\u002Flink.springer.com\u002Fcontent\u002Fpdf\u002F10.1007\u002Fs11263-016-0981-7.pdf) | 字幕 | 图像-文本 |\n| **Flickr30k** | [Flickr30k Entities：收集区域与短语对应关系，以构建更丰富的图像到句子模型](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_iccv_2015\u002Fpapers\u002FPlummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf) | 字幕 | 图像-文本 |\n| **AI-Caps** | [AI Challenger：一个用于深入图像理解的大规模数据集](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1711.06475.pdf) | 字幕 | 图像-文本 |\n| **Wukong Captions** | [悟空：一个1亿规模的中文跨模态预训练基准数据集](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2022\u002Ffile\u002Fa90b9a09a6ee43d6631cf42e225d73b4-Paper-Datasets_and_Benchmarks.pdf) | 字幕 | 图像-文本 |\n| **GRIT** | [Kosmos-2：将多模态大型语言模型与现实世界关联起来](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14824.pdf) | 字幕 | 图像-文本-边界框 |\n| **Youku-mPLUG** | [优酷-mPLUG：一个1000万规模的中文视频-语言数据集，用于预训练和基准测试](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.04362.pdf) | 字幕 | 视频-文本 |\n| **MSR-VTT** | [MSR-VTT：一个大型视频描述数据集，用于连接视频与语言](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_cvpr_2016\u002Fpapers\u002FXu_MSR-VTT_A_Large_CVPR_2016_paper.pdf) | 字幕 | 视频-文本 |\n| **Webvid10M** | [Frozen in Time：用于端到端检索的联合视频和图像编码器](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.00650.pdf) | 字幕 | 视频-文本 |\n| **WavCaps** | [WavCaps：一个由ChatGPT辅助的弱标签音频字幕数据集，用于音频-语言多模态研究](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.17395.pdf) | 字幕 | 音频-文本 |\n| **AISHELL-1** | [AISHELL-1：一个开源的普通话语音语料库及语音识别基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1709.05522.pdf) | ASR | 音频-文本 |\n| **AISHELL-2** | [AISHELL-2：将普通话语音识别研究推向工业规模](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1808.10583.pdf) | ASR | 音频-文本 |\n| **VSDial-CN** | [X-LLM：将多模态视为外语，从而构建先进的大型语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04160.pdf) | ASR | 图像-音频-文本 |\n\n## 多模态指令微调数据集\n| 名称 | 论文 | 链接 | 备注 |\n|:-----|:-----:|:----:|:-----:|\n| **Inst-IT 数据集** | [Inst-IT：通过显式视觉提示指令微调提升多模态实例理解能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03565) | [链接](https:\u002F\u002Fgithub.com\u002Finst-it\u002Finst-it) | 一个包含21,000个视频和51,000张图像的细粒度多层级标注指令微调数据集 |\n| **E.T. Instruct 164K** | [E.T. Bench：迈向开放式事件级视频-语言理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.18111) | [链接](https:\u002F\u002Fgithub.com\u002FPolyU-ChenLab\u002FETBench) | 一个用于时序敏感视频理解的指令微调数据集 |\n| **MSQA** | [3D场景中的多模态情境推理](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.02389) | [链接](https:\u002F\u002Fmsr3d.github.io\u002F) | 一个大规模的3D场景多模态情境推理数据集 |\n| **MM-Evol** | [MMEvol：借助Evol-Instruct增强多模态大语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.05840) | [链接](https:\u002F\u002Fmmevol.github.io\u002Fhome_page.html) | 一个具有丰富多样性的指令数据集 |\n| **UNK-VQA** | [UNK-VQA：一个多模态大模型回避回答能力的数据集与探究](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.10942) | [链接](https:\u002F\u002Fgithub.com\u002Fguoyang9\u002FUNK-VQA) | 一个旨在训练模型对无法回答的问题保持沉默的数据集 |\n| **VEGA** | [VEGA：在视觉-语言大模型中学习交错图文理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.10228) | [链接](https:\u002F\u002Fgithub.com\u002Fzhourax\u002FVEGA) | 一个用于提升模型交错信息理解能力的数据集 |\n| **ALLaVA-4V** | [ALLaVA：利用GPT4V合成数据构建轻量级视觉-语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11684.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FFreedomIntelligence\u002FALLaVA-4V) | 由GPT4V生成的视觉与语言字幕及指令数据集 |\n| **IDK** | [视觉去幻觉指令生成：知之为知，不知为不知](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.09717.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fncsoft\u002Fidk) | 针对“I Know”幻觉的去幻觉视觉指令 |\n| **CAP2QA** | [视觉去幻觉指令生成](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.08348.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fncsoft\u002Fcap2qa) | 图像对齐的视觉指令数据集 |\n| **M3DBench** | [M3DBench：用多模态3D提示指导大模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10763.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FOpenM3D\u002FM3DBench) | 一个大规模的3D指令微调数据集 |\n| **ViP-LLaVA-Instruct** | [让大型多模态模型理解任意视觉提示](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00784.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmucai\u002FViP-LLaVA-Instruct) | LLaVA-1.5指令数据与区域级视觉提示数据的混合 |\n| **LVIS-Instruct4V** | [眼见为实：通过GPT-4V提示优化视觉指令微调](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.07574.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FX2FD\u002FLVIS-Instruct4V) | 由GPT-4V自我生成的视觉指令数据集 |\n| **ComVint** | [什么样的视觉指令才是好的？为视觉指令微调合成复杂视觉推理指令](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.01487.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FComVint#comvint-data) | 一个用于复杂视觉推理的合成指令数据集 |\n| **SparklesDialogue** | [✨Sparkles：解锁多图像对话，赋能多模态指令遵循模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16463.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FHYPJUDY\u002FSparkles#sparklesdialogue) | 一个机器生成的对话数据集，专为跨多张图像和多轮对话的指令遵循型大语言模型设计，以增强其对话能力。 |\n| **StableLLaVA** | [StableLLaVA：利用合成图像-对话数据提升视觉指令微调效果](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.10253v1.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Ficoz69\u002FStableLLAVA) | 一种经济高效地收集视觉指令微调数据的方法 |\n| **M-HalDetect** | [检测并预防大型视觉-语言模型中的幻觉](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.06394.pdf) | [即将发布]() | 一个用于训练和评估模型幻觉检测与预防能力的数据集 |\n| **MGVLID** | [ChatSpot：通过精准指代指令微调启动多模态大语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.09474.pdf) | - | 一个高质量的指令微调数据集，包含图像-文本和区域-文本对 |\n| **BuboGPT** | [BuboGPT：在多模态大语言模型中实现视觉定位](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.08581.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmagicr\u002FBuboGPT) | 一个高质量的指令微调数据集，包含音频-文本、音频字幕以及音频-图像-文本定位数据 |\n| **SVIT** | [SVIT：扩大视觉指令微调规模](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.04087.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FSVIT) | 一个大规模数据集，包含420万条富含信息的视觉指令微调数据，涵盖对话、详细描述、复杂推理和指代问答等任务 |\n| **mPLUG-DocOwl** | [mPLUG-DocOwl：模块化多模态大语言模型用于文档理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02499.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-DocOwl\u002Ftree\u002Fmain\u002FDocLLM) | 一个指令微调数据集，涵盖广泛的视觉-文本理解任务，包括无需OCR的文档理解 |\n| **PF-1M** | [使用Polite Flamingo进行视觉指令微调](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.01003.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fchendelong\u002FPF-1M\u002Ftree\u002Fmain) | 一个包含37个视觉-语言数据集的合集，其回复均由Polite Flamingo改写而成。 |\n| **ChartLlama** | [ChartLlama：用于图表理解和生成的多模态大语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16483.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flisten2you002\u002FChartLlama-Dataset) | 一个用于图表理解和生成的多模态指令微调数据集 |\n| **LLaVAR** | [LLaVAR：针对富含文本的图像理解增强视觉指令微调](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.17107.pdf) | [链接](https:\u002F\u002Fllavar.github.io\u002F#data) | 一个用于富含文本图像理解的视觉指令微调数据集 |\n| **MotionGPT** | [MotionGPT：将人体运动视为一门外语](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14795.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT) | 一个包含多项人体运动相关任务的指令微调数据集 |\n| **LRV-Instruction** | [通过稳健的指令微调缓解大型多模态模型中的幻觉问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14565.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FFuxiaoLiu\u002FLRV-Instruction#visual-instruction-data-lrv-instruction) | 一个用于解决幻觉问题的视觉指令微调数据集 |\n| **Macaw-LLM** | [Macaw-LLM：融合图像、音频、视频和文本的多模态语言建模](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.09093.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Flyuchenyang\u002FMacaw-LLM\u002Ftree\u002Fmain\u002Fdata) | 一个大规模的多模态指令数据集，以多轮对话形式呈现 |\n| **LAMM-Dataset** | [LAMM：语言辅助的多模态指令微调数据集、框架与基准测试](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.06687.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FOpenLAMM\u002FLAMM#lamm-dataset) | 一个全面的多模态指令微调数据集 |\n| **Video-ChatGPT** | [Video-ChatGPT：借助大型视觉和语言模型实现精细化视频理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05424.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT#video-instruction-dataset-open_file_folder) | 一个包含10万个高质量视频指令的数据集 |\n| **MIMIC-IT** | [MIMIC-IT：多模态上下文指令微调](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05425.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter\u002Fblob\u002Fmain\u002Fmimic-it\u002FREADME.md) | 多模态上下文指令微调 |\n| **M\u003Csup>3\u003C\u002Fsup>IT** | [M\u003Csup>3\u003C\u002Fsup>IT：迈向多模态多语言指令微调的大规模数据集](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.04387.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMMInstruction\u002FM3IT) | 一个大规模、覆盖广泛的多模态指令微调数据集 |\n| **LLaVA-Med** | [LLaVA-Med：一天内训练一个面向生物医学领域的大型语言-视觉助手](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.00890.pdf) | [即将发布](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLaVA-Med#llava-med-dataset) | 一个大规模、覆盖广泛的生物医学指令遵循数据集 |\n| **GPT4Tools** | [GPT4Tools：通过自我指令教学大语言模型使用工具](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.18752.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FStevenGrove\u002FGPT4Tools#dataset) | 工具相关的指令数据集 |\n| **MULTIS** | [ChatBridge：以大语言模型为语言催化剂连接不同模态](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.16103.pdf) | [即将发布](https:\u002F\u002Fiva-chatbridge.github.io\u002F) | 一个涵盖16种多模态任务的指令微调数据集 |\n| **DetGPT** | [DetGPT：通过推理检测你需要的东西](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14167.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FOptimalScale\u002FDetGPT\u002Ftree\u002Fmain\u002Fdataset) | 一个包含5,000张图像和约30,000组问答对的指令微调数据集 |\n| **PMC-VQA** | [PMC-VQA：用于医学视觉问答的视觉指令微调](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.10415.pdf) | [即将发布](https:\u002F\u002Fxiaoman-zhang.github.io\u002FPMC-VQA\u002F) | 一个大规模的医学视觉问答数据集 |\n| **VideoChat** | [VideoChat：以聊天为中心的视频理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.06355.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVideo\u002Ftree\u002Fmain\u002FData\u002Finstruction_data) | 一个以视频为中心的多模态指令数据集 |\n| **X-LLM** | [X-LLM：将多模态视为外语来构建先进大语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04160.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fphellonchen\u002FX-LLM) | 一个中文多模态指令微调数据集 |\n| **LMEye** | [LMEye：为大语言模型打造的交互式感知网络](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.03701.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYunxinLi\u002FMultimodal_Insturction_Data_V2) | 一个多模态指令微调数据集 |\n| **cc-sbu-align** | [MiniGPT-4：利用先进大语言模型提升视觉-语言理解能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.10592.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FVision-CAIR\u002Fcc_sbu_align) | 一个用于提高模型可用性和生成流畅性的多模态对齐数据集 |\n| **LLaVA-Instruct-150K** | [视觉指令微调](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.08485.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fliuhaotian\u002FLLaVA-Instruct-150K) | 由GPT生成的多模态指令遵循数据 |\n| **MultiInstruct** | [MultiInstruct：通过指令微调提升多模态零样本学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10773.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FVT-NLP\u002FMultiInstruct) | 第一个多模态指令微调基准数据集 |\n\n## 上下文学习数据集\n| 名称 | 论文 | 链接 | 备注 |\n|:-----|:-----:|:----:|:-----:|\n| **MIC** | [MMICL：通过多模态上下文学习增强视觉-语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.07915.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBleachNick\u002FMIC_full) | 一个手动构建的指令微调数据集，包含交错的文本-图像输入、相互关联的多张图像输入以及多模态上下文学习输入。 |\n| **MIMIC-IT** | [MIMIC-IT：多模态上下文指令微调](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05425.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter\u002Fblob\u002Fmain\u002Fmimic-it\u002FREADME.md) | 多模态上下文指令数据集 |\n\n## 多模态思维链数据集\n| 名称 | 论文 | 链接 | 备注 |\n|:-----|:-----:|:----:|:-----:|\n| **EMER** | [可解释的多模态情感推理](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15401.pdf) | [即将发布](https:\u002F\u002Fgithub.com\u002FzeroQiaoba\u002FExplainable-Multimodal-Emotion-Reasoning) | 用于可解释情感推理任务的基准数据集 |\n| **EgoCOT** | [EmbodiedGPT：通过具身思维链进行视觉-语言预训练](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.15021.pdf) | [即将发布](https:\u002F\u002Fgithub.com\u002FEmbodiedGPT\u002FEmbodiedGPT_Pytorch) | 大规模具身规划数据集 |\n| **VIP** | [逐帧思考：利用视频补全与预测评估视频思维链](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13903.pdf) | [即将发布]() | 可用于评估VideoCOT的推理时数据集 |\n| **ScienceQA** | [学会解释：基于思维链的多模态推理在科学问答中的应用](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2022\u002Ffile\u002F11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Flupantech\u002FScienceQA#ghost-download-the-dataset) | 大规模选择题数据集，包含多模态科学问题和多样化的领域 |\n\n## 多模态RLHF数据集\n| 名称 | 论文 | 链接 | 备注 |\n|:-----|:-----:|:----:|:-----:|\n| **VLFeedback** | [Silkie：大型视觉-语言模型的偏好蒸馏](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10665.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMMInstruction\u002FVLFeedback) | 由AI标注的视觉-语言反馈数据集 |\n\n## 评估基准\n| 名称 | 论文 | 链接 | 备注 |\n|:-----|:-----:|:----:|:-----:|\n| **Inst-IT Bench** | [Inst-IT: 通过显式视觉提示指令微调提升多模态实例理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03565) | [链接](https:\u002F\u002Fgithub.com\u002Finst-it\u002Finst-it) | 用于评估图像和视频中细粒度实例级理解的基准 |\n| **M\u003Csup>3\u003C\u002Fsup>CoT** | [M\u003Csup>3\u003C\u002Fsup>CoT: 一种新型的多领域、多步骤、多模态思维链基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.16473) | [链接](https:\u002F\u002Fgithub.com\u002FLightChen233\u002FM3CoT) | 用于多模态思维链的多领域、多步骤基准 |\n| **MMGenBench** | [MMGenBench: 从文本到图像生成的角度评估大型多模态模型的极限](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.14062) | [链接](https:\u002F\u002Fgithub.com\u002Flerogo\u002FMMGenBench) | 一个衡量给定图像生成图像描述提示性能的基准 |\n| **MiCEval** | [MiCEval: 通过图像描述和推理步骤揭示多模态思维链的质量](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.14668) | [链接](https:\u002F\u002Fgithub.com\u002Falenai97\u002FMiCEval) | 用于评估多模态LLM推理能力的多模态思维链基准 |\n| **LiveXiv** | [LiveXiv -- 基于Arxiv论文内容的多模态实时基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.10783) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FLiveXiv\u002FLiveXiv) | 基于Arxiv论文的实时基准 |\n| **TemporalBench** | [TemporalBench: 为多模态视频模型评估细粒度时间理解能力的基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.10818) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmicrosoft\u002FTemporalBench) | 用于评估细粒度时间理解能力的基准 |\n| **OmniBench** | [OmniBench: 通往通用全语言模型未来之路](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.15272) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fm-a-p\u002FOmniBench) | 一个评估模型同时处理视觉、听觉和文本输入能力的基准 |\n| **MME-RealWorld** | [MME-RealWorld: 您的多模态大模型能否应对对人类来说也极具挑战性的高分辨率真实场景？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.13257) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyifanzhang114\u002FMME-RealWorld) | 一个包含真实生活场景的高难度基准 |\n| **VELOCITI** | [VELOCITI: 视频-语言模型能否在时间维度上绑定语义概念？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.10889) | [链接](https:\u002F\u002Fgithub.com\u002Fkatha-ai\u002FVELOCITI) | 一个评估感知和绑定能力的视频基准 |\n| **MMR** | [看得清楚，答得错误：用于评估多模态大模型在诱导性问题上的理解和鲁棒性基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.10638.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FMultimodal-Robustness-Benchmark) | 一个用于衡量多模态大模型理解能力和对诱导性问题鲁棒性的基准 |\n| **CharXiv** | [CharXiv: 揭示多模态大模型在现实图表理解方面的差距](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.18521) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fprinceton-nlp\u002FCharXiv) | 由人类专家策划的图表理解基准 |\n| **Video-MME** | [Video-MME: 首个全面评估多模态大模型视频分析能力的基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.21075) | [链接](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME) | 一个全面评估多模态大模型视频分析能力的基准 |\n| **VL-ICL Bench** | [VL-ICL Bench: 多模态上下文学习评估中的细节陷阱](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.13164.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fys-zong\u002FVL-ICL) | 一个涵盖广泛任务的多模态上下文学习评估基准 |\n| **TempCompass** | [TempCompass: 视频大模型真的能理解视频吗？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.00476.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fllyx97\u002FTempCompass) | 一个评估视频大模型时间感知能力的基准 |\n| **GVLQA** | [GITA: 图到视觉与文本的融合，用于视觉-语言图推理](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.02130) | [链接](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FYanbin99\u002Fgvlqa-datasets-65c705c9488606617e246bd3) | 一个评估图推理能力的基准 |\n| **CoBSAT** | [多模态大模型能否进行文本到图像的上下文学习？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01293.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyzeng58\u002FCoBSAT) | 一个用于文本到图像上下文学习的基准 |\n| **VQAv2-IDK** | [视觉去幻觉指令生成：知道自己不知道什么](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.09717.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fncsoft\u002Fidk) | 一个用于评估“我知道”型视觉幻觉的基准 |\n| **Math-Vision** | [使用MATH-Vision数据集衡量多模态数学推理能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.14804.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fmathvision-cuhk\u002FMathVision) | 一个多样化的数学推理基准 |\n| **SciMMIR** | [SciMMIR: 科学领域多模态信息检索评估基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.13478) | [链接](https:\u002F\u002Fgithub.com\u002FWusiwei0410\u002FSciMMIR) | 一个用于科学领域多模态信息检索的基准 |\n| **CMMMU** | [CMMMU: 中国大规模跨学科多模态理解基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.11944.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FCMMMU-Benchmark\u002FCMMMU) | 一个涉及多学科推理和知识的中文基准 |\n| **MMCBench** | [针对常见扰动对大型多模态模型进行基准测试](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.11943.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FMMCBench) | 一个用于检验模型在常见扰动下自我一致性的基准 |\n| **MMVP** | [睁眼瞎？探索多模态大模型的视觉缺陷](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.06209.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Ftsb0601\u002FMMVP) | 一个评估视觉能力的基准 |\n| **TimeIT** | [TimeChat: 一款面向长视频理解的时间敏感型多模态大语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02051.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FShuhuaiRen\u002FTimeIT) | 一个带有时间戳标注的视频指令微调数据集，覆盖多种时间敏感的视频理解任务。 |\n| **ViP-Bench** | [让大型多模态模型理解任意视觉提示](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00784.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmucai\u002FViP-Bench) | 一个用于视觉提示的基准 |\n| **M3DBench** | [M3DBench: 让我们用多模态3D提示来指导大型模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10763.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FOpenM3D\u002FM3DBench) | 一个以3D为中心的基准 |\n| **Video-Bench** | [Video-Bench: 一个全面的基准和工具包，用于评估基于视频的大语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16103.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-Bench) | 一个用于视频MLLM评估的基准 |\n| **Charting-New-Territories** | [开拓新领域：探索多模态大模型的地缘和地理空间能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.14656.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fjonathan-roberts1\u002Fcharting-new-territories) | 一个用于评估地缘和地理空间能力的基准 |\n| **MLLM-Bench** | [MLLM-Bench，使用GPT-4V评估多模态大模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.13951.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FMLLM-Bench) | 基于逐样本标准的GPT-4V评估 |\n| **BenchLMM** | [BenchLMM: 基准测试大型多模态模型的跨风格视觉能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02896.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAIFEG\u002FBenchLMM) | 一个评估模型对不同图像风格鲁棒性的基准 |\n| **MMC-Benchmark** | [MMC: 通过大规模指令微调推进多模态图表理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.10774.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FFuxiaoLiu\u002FMMC) | 一个全面的人工标注基准，包含多个评估图表推理能力的任务 |\n| **MVBench** | [MVBench: 一个全面的多模态视频理解基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17005.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything\u002Fblob\u002Fmain\u002Fvideo_chat2\u002FMVBENCH.md) | 一个用于视频理解的综合性多模态基准 |\n| **Bingo** | [GPT-4V(ision)中幻觉的整体分析：偏见与干扰挑战](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03287.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fgzcch\u002FBingo) | 一个专注于两种常见类型的幻觉评估基准 |\n| **MagnifierBench** | [OtterHD: 一款高分辨率多模态模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04219.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOtter-AI\u002FMagnifierBench) | 一个旨在探测模型细粒度感知能力的基准 |\n| **HallusionBench** | [HallusionBench: 你看到的是你想到的，还是你想到的是你看到的？一个对GPT-4V(ision)、LLaVA-1.5及其他多模态模型具有挑战性的图像-上下文推理基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.14566.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Ftianyi-lab\u002FHallusionBench) | 一个用于评估幻觉的图像-上下文推理基准 |\n| **PCA-EVAL** | [通过多模态大语言模型实现端到端具身决策：与GPT4-Vision及其他模型的探索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.02071.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fpkunlp-icler\u002FPCA-EVAL) | 一个用于评估多领域具身决策的基准 |\n| **MMHal-Bench** | [通过事实增强的RLHF对齐大型多模态模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.14525.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FShengcao1006\u002FMMHal-Bench) | 一个用于幻觉评估的基准 |\n| **MathVista** | [MathVista: 使用GPT-4V、Bard及其他大型多模态模型评估视觉情境下的数学推理能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.02255.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAI4Math\u002FMathVista) | 一个同时挑战视觉和数学推理能力的基准 |\n| **SparklesEval** | [✨Sparkles: 解锁多张图片间的对话，适用于多模态指令遵循模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16463.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FHYPJUDY\u002FSparkles#sparkleseval) | 一个基于GPT的基准，依据三个不同标准定量评估模型在多张图片和多轮对话中的会话能力。 |\n| **ISEKAI** | [多模态大模型的链接-上下文学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.07891.pdf) | [链接](https:\u002F\u002Fhuggingface.co\u002FISEKAI-Portal) | 一个仅由未见过的生成图像-标签对组成的基准，专为链接-上下文学习设计。 |\n| **M-HalDetect** | [检测并预防大型视觉-语言模型中的幻觉](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.06394.pdf) | [即将推出]() | 一个用于训练和评估模型幻觉检测与预防能力的数据集 |\n| **I4** | [赋能视觉-语言模型执行交错的视觉-语言指令](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.04152.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FDCDmllm\u002FCheetah) | 一个全面评估模型在复杂交错视觉-语言指令下指令跟随能力的基准 |\n| **SciGraphQA** | [SciGraphQA: 一个大规模的合成多轮问答数据集，用于科学图表](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.03349.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Ffindalexli\u002FSciGraphQA#data) | 一个大规模的图表-视觉问答数据集 |\n| **MM-Vet** | [MM-Vet: 评估大型多模态模型的综合能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.02490.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMM-Vet) | 一个考察大型多模态模型在复杂多模态任务中表现的评估基准 |\n| **SEED-Bench** | [SEED-Bench: 以生成式理解为基准评估多模态大模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.16125.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FSEED-Bench) | 一个用于评估多模态大模型生成式理解能力的基准 |\n| **MMBench** | [MMBench: 您的多模态模型是全能选手吗？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.06281.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FMMBench) | 一个系统化设计的客观基准，用于稳健地评估视觉-语言模型的各项能力 |\n| **Lynx** | [使用多模态输入训练GPT4风格语言模型的关键是什么？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02469.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Flynx-llm#prepare-data) | 一个包含图像和视频任务的全面评估基准 |\n| **GAVIE** | [通过稳健的指令微调减轻大型多模态模型的幻觉](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.14565.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FFuxiaoLiu\u002FLRV-Instruction#evaluationgavie) | 一个用于评估幻觉和指令跟随能力的基准 |\n| **MME** | [MME: 一个多模态大语言模型的全面评估基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13394.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FEvaluation) | 一个全面的多模态大模型评估基准 |\n| **LVLM-eHub** | [LVLM-eHub: 一个全面的大型视觉-语言模型评估基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.09265.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FMulti-Modality-Arena) | 一个用于MLLM评估的平台 |\n| **LAMM-Benchmark** | [LAMM: 语言辅助的多模态指令微调数据集、框架和基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.06687.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FOpenLAMM\u002FLAMM#lamm-benchmark) | 一个用于评估多模态大模型在各种2D\u002F3D视觉任务中量化表现的基准 |\n| **M3Exam** | [M3Exam: 一个多语言、多模态、多层次的基准，用于评估大型语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05179.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FM3Exam) | 一个用于评估多模态大模型的多语言、多模态、多层次基准 |\n| **OwlEval** | [mPLUG-Owl: 模块化使大型语言模型具备多模态能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.14178.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-Owl\u002Ftree\u002Fmain\u002FOwlEval) | 一个用于评估多种能力的数据集 |\n\n## 其他\n| 名称 | 论文 | 链接 | 备注 |\n|:-----|:-----:|:----:|:-----:|\n| **IMAD** | [IMAD: 基于图像增强的多模态对话](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.10512.pdf) | [链接](https:\u002F\u002Fgithub.com\u002FVityaVitalich\u002FIMAD) | 多模态对话数据集|\n| **Video-ChatGPT** | [Video-ChatGPT: 基于大型视觉与语言模型实现详细视频理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05424.pdf) | [链接](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT#quantitative-evaluation-bar_chart) | 一个用于视频对话模型的定量评估框架 |\n| **CLEVR-ATVC** | [可问责的文本-视觉聊天模型学习拒绝人类指令以进行图像重建](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05983.pdf) | [链接](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1TqBzkyqxOSg1hgCXF8JjpYIAuRV-uVft) | 一个用于学习拒绝指令的合成多模态微调数据集 |\n| **Fruit-ATVC** | [可问责的文本-视觉聊天模型学习拒绝人类指令以进行图像重建](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05983.pdf) | [链接](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1Saaia2rRRb1nz5sKdmpzYdS4jHiMDaP0) | 一个手工拍摄的多模态微调数据集，用于学习拒绝指令 |\n| **InfoSeek** | [预训练的视觉与语言模型能否回答视觉信息检索问题？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11713.pdf) | [链接](https:\u002F\u002Fopen-vision-language.github.io\u002Finfoseek\u002F) | 一个专注于提出信息检索型问题的VQA数据集 |\n| **OVEN** | [开放域视觉实体识别：迈向识别数百万个维基百科实体](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.11154.pdf) | [链接](https:\u002F\u002Fopen-vision-language.github.io\u002Foven\u002F) | 一个专注于从自然场景图像中识别维基百科视觉实体的数据集 |","# Awesome-Multimodal-Large-Language-Models 快速上手指南\n\n本项目并非单一的可执行软件，而是一个**多模态大语言模型（MLLM）的开源资源汇总库**。它主要提供最新的论文列表、数据集链接、评测基准（如 MME 系列）以及相关模型（如 VITA 系列）的代码仓库索引。\n\n本指南将指导你如何利用该仓库获取资源，并以其中核心的 **MME 评测工具** 和 **VITA 模型** 为例，演示如何搭建环境与运行。\n\n## 1. 环境准备\n\n由于本项目涵盖多个子项目，建议根据你具体想要运行的模型或评测任务准备环境。以下以通用的深度学习环境和 MME 评测工具为例。\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+) 或 macOS\n*   **硬件要求**: \n    *   运行评测脚本：CPU 即可，或任意 NVIDIA GPU。\n    *   部署\u002F微调模型（如 VITA, Qwen-VL）：建议 NVIDIA GPU (显存 ≥ 24GB 用于大模型推理)。\n*   **前置依赖**:\n    *   Python >= 3.8\n    *   Git\n    *   CUDA Toolkit (如需运行模型)\n    *   Conda (推荐用于环境管理)\n\n## 2. 安装步骤\n\n### 第一步：克隆仓库\n首先获取资源列表和相关工具代码。国内用户建议使用 Gitee 镜像（如有）或通过代理加速 GitHub 访问。\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models.git\ncd Awesome-Multimodal-Large-Language-Models\n```\n\n### 第二步：配置评测工具环境 (以 MME Benchmark 为例)\n如果你需要使用项目中提供的 **MME 评测工具** 来评估模型性能，请进入对应目录并安装依赖。\n\n```bash\n# 进入评测工具目录\ncd Evaluation\u002Ftools\n\n# 解压评测工具包 (如果尚未解压)\nunzip eval_tool.zip\n\n# 创建并激活虚拟环境\nconda create -n mme_eval python=3.9 -y\nconda activate mme_eval\n\n# 安装基础依赖 (根据具体模型需求，通常包括 torch, transformers 等)\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\npip install pandas tqdm pillow\n```\n\n> **💡 国内加速提示**：推荐使用清华或阿里镜像源安装 Python 包，以提升下载速度。\n> ```bash\n> pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n### 第三步：获取特定模型代码 (以 VITA 为例)\n如果你想运行项目中推荐的 **VITA** 系列多模态模型，需要单独克隆其官方仓库。\n\n```bash\n# 返回上级目录或直接在新位置克隆\ncd ..\u002F..\ngit clone https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA.git\ncd VITA\n\n# 安装 VITA 模型依赖\nconda create -n vita python=3.10 -y\nconda activate vita\npip install -e .\n```\n\n## 3. 基本使用\n\n### 场景一：使用 MME 工具评估模型\n假设你已经有了一个多模态模型的输出结果（JSON 格式），可以使用 MME 工具进行打分。\n\n1.  **准备数据**：确保你的模型生成结果符合 MME 格式要求。\n2.  **运行评估**：\n    ```bash\n    cd Evaluation\u002Ftools\n    # 示例命令：计算 MME 分数 (需替换实际路径)\n    python eval_tool.py --result_path .\u002Fyour_model_results.json --output_path .\u002Fmme_score.txt\n    ```\n    *注：具体参数请参考 `Evaluation\u002Ftools` 目录下的 README 或脚本帮助信息。*\n\n### 场景二：运行 VITA 模型进行推理\n在配置好 `vita` 环境后，你可以加载预训练权重进行简单的图文对话。\n\n```bash\ncd VITA\n\n# 启动交互式 Demo (需提前下载权重并配置 config)\npython demo\u002Fapp.py --model-path .\u002Fcheckpoints\u002Fvita-7b\n\n# 或在命令行直接运行推理脚本\npython inference.py \\\n    --model-path .\u002Fcheckpoints\u002Fvita-7b \\\n    --image-path .\u002Fimages\u002Fexample.jpg \\\n    --query \"Please describe this image in detail.\"\n```\n\n### 场景三：查阅最新论文与数据集\n作为资源库，最核心的用法是查阅 `README.md` 中的表格。\n\n*   **查找论文**：在根目录打开 `README.md`，搜索 \"Awesome Papers\" 章节，按类别（如 `Multimodal Instruction Tuning`, `Evaluation`）查找最新 arXiv 论文链接。\n*   **查找数据集**：滚动至 \"Awesome Datasets\" 章节，获取预训练、指令微调或评测基准（如 Video-MME）的 HuggingFace 下载链接。\n\n---\n**提示**：由于该仓库更新极快（包含 2025-2026 年的前沿工作），具体模型的运行命令请以各子项目（如 `VITA`, `Qwen`, `InternVL`）官方仓库的最新说明为准。","某自动驾驶研发团队急需评估最新多模态大模型在复杂路况视频理解与实时交互方面的能力，以决定下一代车载系统的技术选型。\n\n### 没有 Awesome-Multimodal-Large-Language-Models 时\n- **调研效率低下**：研究人员需在 arXiv 和 GitHub 上手动搜索分散的论文与代码，难以区分哪些是真正的 SOTA（最先进）模型，哪些只是早期实验。\n- **评测标准缺失**：缺乏统一的基准测试集，团队不得不自行构建简单的视频问答数据集，导致评估结果无法与业界主流水平横向对比。\n- **技术盲区明显**：容易忽略如 VITA 系列这类支持“看听说做”并发交互的前沿开源项目，错失实现类 GPT-4o 实时语音视觉交互的机会。\n- **场景覆盖不足**：现有的内部测试仅关注静态图像，无法验证模型在高分辨率真实世界场景（如恶劣天气、复杂路口）下的鲁棒性。\n\n### 使用 Awesome-Multimodal-Large-Language-Models 后\n- **一站式获取前沿成果**：直接通过该仓库的综述和分类列表，快速定位到 NeurIPS 2025  highlight 的 VITA-1.5 等关键模型，将技术调研时间从数周缩短至几天。\n- **引入权威评测基准**：直接复用 MME、Video-MME-v2 及 MME-RealWorld 等专业基准数据集与评估工具，确保模型性能评估具备行业公信力。\n- **解锁全模态交互能力**：基于仓库指引集成 VITA-E 或 VITA-Audio，迅速验证了车辆在行驶中同时处理视觉信号与语音指令的可行性。\n- **覆盖极端真实场景**：利用 MME-RealWorld 数据集挑战高分辨率难点场景，提前发现模型在人类都难以判断的复杂路况中的潜在缺陷。\n\nAwesome-Multimodal-Large-Language-Models 不仅消除了信息不对称，更为团队提供了从理论调研到落地评测的全链路权威指南，极大加速了多模态技术的工程化进程。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBradyFU_Awesome-Multimodal-Large-Language-Models_eb439a35.png","BradyFU","Chaoyou Fu","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FBradyFU_a0790f20.png","南京大学-研究员|助理教授|博导-中国科协青年人才托举工程|中科院院长特别奖  Lead MME & VITA & Awesome-MLLM","MiG-NJU (Multimodal intelligence Group, 南京大学米格小组)",null,"https:\u002F\u002Fbradyfu.github.io\u002F","https:\u002F\u002Fgithub.com\u002FBradyFU",17662,1126,"2026-04-17T12:29:08","","未说明",{"notes":90,"python":88,"dependencies":91},"该仓库（Awesome-Multimodal-Large-Language-Models）是一个多模态大模型（MLLM）的论文、数据集和基准测试的汇总列表（Awesome List），本身不是一个可独立运行的软件工具或模型框架，因此 README 中未包含具体的操作系统、GPU、内存、Python 版本或依赖库等运行环境需求。用户若需运行列表中提到的具体模型（如 VITA, Qwen, InternVL 等），需前往各模型对应的独立项目仓库查看其特定的环境配置要求。",[],[54,15],[94,95,96,97,98,99,100,101,102,103,104,105,106],"instruction-tuning","instruction-following","large-vision-language-model","visual-instruction-tuning","multi-modality","in-context-learning","large-language-models","large-vision-language-models","multimodal-chain-of-thought","multimodal-in-context-learning","multimodal-instruction-tuning","multimodal-large-language-models","chain-of-thought","2026-03-27T02:49:30.150509","2026-04-18T11:10:07.454509",[110,115,120,125,130,135],{"id":111,"question_zh":112,"answer_zh":113,"source_url":114},38926,"在 MME 基准测试中，'poster'（海报）任务具体是指什么？","该任务指的是识别电影海报对应的是哪一部电影。","https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Fissues\u002F30",{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},38927,"下载 MME 地标（landmark）图像时大量失败，是否有替代方案或修复方法？","可以通过修改 `download_landmark.py` 脚本来解决。主要改进是增加了对 URL 解码的处理以及更健壮的错误捕获机制。参考代码如下：\n```python\nimport os\nimport argparse\nimport pandas as pd\nimport urllib.request\nfrom tqdm import tqdm\nimport requests\nimport urllib.parse\n\ndef download_url_images(csv_file, output_folder, extension='.jpg'):\n    df = pd.read_csv(csv_file)\n    ids = df['id'].tolist()\n    urls = df['url'].tolist()\n    for i, url in enumerate(urls):\n        if not os.path.exists(output_folder):\n            os.makedirs(output_folder)\n        save_path = output_folder + str(ids[i]) + extension\n        if not os.path.isfile(save_path):\n            try:\n                urllib.request.urlretrieve(url, save_path)\n            except:\n                try:\n                    # 尝试对 URL 进行解码后再次下载\n                    url = urllib.parse.unquote(url)\n                    urllib.request.urlretrieve(url, save_path)\n                except:\n                    print(f\"Failed to download: {url}\")\n```","https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Fissues\u002F80",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},38928,"为什么在列表中找不到 Octopus 模型的论文或代码？","Octopus 的论文当时仍在准备中，尚未公开发布。如果需要联系作者或获取最新进展，可以留下微信号（WeChat ID），维护者可以帮忙介绍作者。此外，项目曾提供过在线 Demo 供体验。","https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Fissues\u002F59",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},38929,"如何申请访问该项目的数据集？如果发送邮件没有收到回复怎么办？","通常需要通过项目中列出的管理员邮箱（如 guilinli@stu.xmu.edu.cn）发送申请邮件。如果长时间未收到回复，很可能是邮件被误判到了垃圾邮件文件夹（Spam Folder）。建议检查垃圾邮件箱，或者重新发送一封邮件并说明情况，管理员核实后会尽快处理。","https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Fissues\u002F97",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},38930,"如果想让自己的多模态大模型相关工作被收录到这个仓库，应该怎么做？","可以通过提交 Issue 的方式请求添加。在 Issue 中提供项目的标题、论文链接（arXiv 等）、代码仓库链接、主页地址以及简要介绍。维护者审核通过后会将工作添加到相应的分类（如 Evaluation 或 Model 列表）中。同时，维护者可能会建议在引用该综述时使用他们的相关论文。","https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Fissues\u002F173",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},38931,"ViP-LLaVA 和 Video-LLaVA 等新模型是否已被收录？状态如何？","是的，这些工作（包括 ViP-LLaVA 和 Video-LLaVA）已经被添加到仓库中。对于已发表的会议论文（如 CVPR），维护者也会更新其状态。用户在引用时，也可以考虑引用该仓库相关的综述论文（如 'A Survey on Multimodal Large Language Models'）和基准测试论文（如 'MME'）。","https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Fissues\u002F117",[]]