[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-yunlong10--Awesome-LLMs-for-Video-Understanding":3,"tool-yunlong10--Awesome-LLMs-for-Video-Understanding":64},[4,17,27,36,44,52],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160015,2,"2026-04-18T11:30:52",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[13,26,14,35],"视频",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":10,"last_commit_at":42,"category_tags":43,"status":16},8553,"spec-kit","github\u002Fspec-kit","Spec Kit 是一款专为提升软件开发效率而设计的开源工具包，旨在帮助团队快速落地“规格驱动开发”（Spec-Driven Development）模式。传统开发中，需求文档往往与代码实现脱节，导致沟通成本高且结果不可控；而 Spec Kit 通过将规格说明书转化为可执行的指令，让 AI 直接依据明确的业务场景生成高质量代码，从而减少从零开始的随意编码，确保产出结果的可预测性。\n\n该工具特别适合希望利用 AI 辅助编程的开发者、技术负责人及初创团队。无论是启动全新项目还是在现有工程中引入规范化流程，用户只需通过简单的命令行操作，即可初始化项目并集成主流的 AI 编程助手。其核心技术亮点在于“规格即代码”的理念，支持社区扩展与预设模板，允许用户根据特定技术栈定制开发流程。此外，Spec Kit 强调官方维护的安全性，提供稳定的版本管理，帮助开发者在享受 AI 红利的同时，依然牢牢掌握架构设计的主动权，真正实现从“凭感觉写代码”到“按规格建系统”的转变。",88749,"2026-04-17T09:48:14",[15,26,14,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":53,"name":54,"github_repo":55,"description_zh":56,"stars":57,"difficulty_score":10,"last_commit_at":58,"category_tags":59,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85267,"2026-04-18T11:00:28",[26,60,35,61,14,62,15,13,63],"数据工具","插件","其他","音频",{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":79,"stars":83,"forks":84,"last_commit_at":85,"license":79,"difficulty_score":86,"env_os":87,"env_gpu":88,"env_ram":88,"env_deps":89,"category_tags":92,"github_topics":79,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":93,"updated_at":94,"faqs":95,"releases":126},9238,"yunlong10\u002FAwesome-LLMs-for-Video-Understanding","Awesome-LLMs-for-Video-Understanding","🔥🔥🔥 [IEEE TCSVT] Latest Papers, Codes and Datasets on Vid-LLMs.","Awesome-LLMs-for-Video-Understanding 是一个专注于视频理解与大语言模型（Vid-LLMs）前沿技术的开源资源库。它系统性地整理了该领域最新的学术论文、代码实现、数据集及评测基准，旨在解决研究人员和开发者在面对海量且快速迭代的 Vid-LLM 文献时，难以高效获取核心信息和构建完整知识体系的痛点。\n\n该项目不仅提供了一份被 IEEE TCSVT 接收的权威综述论文，还持续更新包含上百个主流模型和十余个新基准的详细列表。其独特亮点在于提出了一套基于视频表示和 LLM 功能的全新分类法，并从任务粒度与语言参与度等维度对视频理解任务进行了重新梳理，帮助使用者更清晰地把握技术演进脉络。此外，资源库还深入探讨了训练策略及跨领域应用，为后续研究提供了坚实的理论基础与实践参考。\n\n无论是从事多模态算法研究的学者，还是希望将视频分析能力融入产品的工程师，都能从中找到极具价值的指引。通过这一平台，用户可以快速定位所需的技术方案，跟踪最新的研究动态，从而加速在智能视频分析领域的创新与落地。","# Awesome-LLMs-for-Video-Understanding [![Awesome](https:\u002F\u002Fawesome.re\u002Fbadge.svg)](https:\u002F\u002Fawesome.re)\n\n### 🔥🔥🔥 [Video Understanding with Large Language Models: A Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17432)\n\n> *[Yolo Y. Tang](https:\u002F\u002Fyunlong10.github.io\u002F)\u003Csup>1\u003C\u002Fsup>, [Jing Bi](https:\u002F\u002Fjing-bi.github.io\u002F)\u003Csup>1\u003C\u002Fsup>, [Siting Xu](https:\u002F\u002Fsai-01.github.io\u002F)\u003Csup>2\u003C\u002Fsup>, [Luchuan Song](https:\u002F\u002Fsongluchuan.github.io\u002F)\u003Csup>1\u003C\u002Fsup>, [Susan Liang](https:\u002F\u002Fliangsusan-git.github.io\u002F)\u003Csup>1\u003C\u002Fsup> , [Teng Wang](http:\u002F\u002Fttengwang.com\u002F)\u003Csup>2,3\u003C\u002Fsup> , [Daoan Zhang](https:\u002F\u002Fdwan.ch\u002F)\u003Csup>1\u003C\u002Fsup> , [Jie An](https:\u002F\u002Fpkuanjie.com\u002F)\u003Csup>1\u003C\u002Fsup> , [Jingyang Lin](https:\u002F\u002Fjylin.me\u002F)\u003Csup>1\u003C\u002Fsup> , [Rongyi Zhu](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Frongyi-zhu-41b15124a\u002F)\u003Csup>1\u003C\u002Fsup> , [Ali Vosoughi](https:\u002F\u002Falivosoughi.com\u002F)\u003Csup>1\u003C\u002Fsup> , [Chao Huang](https:\u002F\u002Fwikichao.github.io\u002F)\u003Csup>1\u003C\u002Fsup> , [Zeliang Zhang](https:\u002F\u002Fzhangaipi.github.io\u002F)\u003Csup>1\u003C\u002Fsup> , [Pinxin Liu](https:\u002F\u002Fandypinxinliu.github.io\u002F)\u003Csup>1\u003C\u002Fsup> , [Mingqian Feng](https:\u002F\u002Ffmmarkmq.github.io\u002F)\u003Csup>1\u003C\u002Fsup> , [Feng Zheng](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=PcmyXHMAAAAJ)\u003Csup>2\u003C\u002Fsup> , [Jianguo Zhang](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=ypSmZtIAAAAJ)\u003Csup>2\u003C\u002Fsup> , [Ping Luo](http:\u002F\u002Fluoping.me\u002F)\u003Csup>3\u003C\u002Fsup> , [Jiebo Luo](https:\u002F\u002Fwww.cs.rochester.edu\u002Fu\u002Fjluo\u002F)\u003Csup>1\u003C\u002Fsup>, [Chenliang Xu](https:\u002F\u002Fwww.cs.rochester.edu\u002F~cxu22\u002Findex.html)\u003Csup>1\u003C\u002Fsup>.*\n\n> *\u003Csup>1\u003C\u002Fsup>University of Rochester, \u003Csup>2\u003C\u002Fsup>Southern University of Science and Technology, \u003Csup>3\u003C\u002Fsup>The University of Hong Kong*\n\n\u003Ch5 align=\"center\">  \n\n **[Paper](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10982110)** | **[arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17432)** | **[Project Page](https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding)**\n\n\u003C\u002Fh5>\n\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyunlong10_Awesome-LLMs-for-Video-Understanding_readme_d95a7c20c08f.png)\n\n\n## 📢 News\n[10\u002F06\u002F2025]\n\n🔥 Our follow-up work—[Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models](https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-Video-LMM-Post-Training)—is now available on [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.05034) and [Hugging Face Papers](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2510.05034)!\n\n\n[05\u002F04\u002F2025]\n\n🌟 Our Vid-LLM survey has been accepted to the IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)!\n👉 [IEEE Xplore](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10982110) \\| [GitHub](https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding)\n\n[07\u002F23\u002F2024]\n\n📢 We've recently updated our survey: “Video Understanding with Large Language Models: A Survey”!\n\n✨ This comprehensive survey covers video understanding techniques powered by large language models (Vid-LLMs), training strategies, relevant tasks, datasets, benchmarks, and evaluation methods, and discusses the applications of Vid-LLMs across various domains.\n\n🚀 **What's New in This Update**:\n\u003Cbr>✅ Updated to include around 100 additional Vid-LLMs and 15 new benchmarks as of June 2024.\n\u003Cbr>✅ Introduced a novel taxonomy for Vid-LLMs based on video representation and LLM functionality.\n\u003Cbr>✅ Added a Preliminary chapter, reclassifying video understanding tasks from the perspectives of granularity and language involvement, and enhanced the LLM Background section.\n\u003Cbr>✅ Added a new Training Strategies chapter, removing adapters as a factor for model classification.\n\u003Cbr>✅ All figures and tables have been redesigned.\n\nMultiple minor updates will follow this major update. And the GitHub repository will be gradually updated soon. We welcome your reading and feedback ❤️\n\n\u003Cfont size=5>\u003Ccenter>\u003Cb> Table of Contents \u003C\u002Fb> \u003C\u002Fcenter>\u003C\u002Ffont>\n- [Awesome-LLMs-for-Video-Understanding ](#awesome-llms-for-video-understanding-)\n    - [🔥🔥🔥 Video Understanding with Large Language Models: A Survey](#-video-understanding-with-large-language-models-a-survey)\n  - [Why we need Vid-LLMs?](#why-we-need-vid-llms)\n  - [😎 Vid-LLMs: Models](#-vid-llms-models)\n    - [📑 Citation](#-citation)\n      - [🗒️ Taxonomy 1](#️-taxonomy-1)\n        - [🕹️ Video Analyzer × LLM](#️-video-analyzer--llm)\n          - [LLM as Summarizer](#llm-as-summarizer)\n          - [LLM as Manager](#llm-as-manager)\n        - [👾 Video Embedder × LLM](#-analyzer--embedder--llm)\n          - [LLM as Text Decoder](#llm-as-text-decoder)\n          - [LLM as Regressor](#llm-as-regressor)\n          - [LLM as Hidden Layer](#llm-as-hidden-layer)\n        - [🧭 (Analyzer + Embedder) × LLM](#-analyzer--embedder--llm)\n          - [LLM as Manager](#llm-as-manager-1)\n          - [LLM as Summarizer](#llm-as-summarizer-1)\n          - [LLM as Text Decoder](#llm-as-text-decoder-1)\n          - [LLM as Regressor](#llm-as-regressor-1)\n          - [LLM as Hidden Layer](#llm-as-hidden-layer-1)\n      - [🗒️ Taxonomy 2](#️-taxonomy-2)\n        - [🤖 LLM-based Video Agents](#-llm-based-video-agents)\n        - [🎥 Vid-LLM Pretraining](#-vid-llm-pretraining)\n        - [👀 Vid-LLM Instruction Tuning](#-vid-llm-instruction-tuning)\n          - [Fine-tuning with Connective Adapters](#fine-tuning-with-connective-adapters)\n          - [Fine-tuning with Insertive Adapters](#fine-tuning-with-insertive-adapters)\n          - [Fine-tuning with Hybrid Adapters](#fine-tuning-with-hybrid-adapters)\n        - [🦾 Hybrid Methods](#-hybrid-methods)\n        - [Training-free Methods](#-training-free-methods)\n  - [Tasks, Datasets, and Benchmarks](#tasks-datasets-and-benchmarks)\n      - [Recognition and Anticipation](#recognition-and-anticipation)\n      - [Captioning and Description](#captioning-and-description)\n      - [Grounding and Retrieval](#grounding-and-retrieval)\n      - [Question Answering](#question-answering)\n      - [Video Instruction Tuning](#video-instruction-tuning)\n        - [Pretraining Dataset](#pretraining-dataset)\n        - [Fine-tuning Dataset](#fine-tuning-dataset)\n      - [Video-based Large Language Models Benchmark](#video-based-large-language-models-benchmark)\n  - [Contributing](#contributing)\n    - [🌟 Star History](#-star-history)\n    - [♥️ Contributors](#️-contributors)\n\n## Why we need Vid-LLMs?\n\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyunlong10_Awesome-LLMs-for-Video-Understanding_readme_4ce7b1bfa126.png)\n\n\n## 😎 Vid-LLMs: Models \n\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyunlong10_Awesome-LLMs-for-Video-Understanding_readme_b30792fe303a.png)\n\n### 📑 Citation\n\nIf you find our survey useful for your research, please cite the following paper:\n\n```bibtex\n@article{vidllmsurvey,\n  author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Liu, Pinxin and Feng, Mingqian and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},\n  journal={IEEE Transactions on Circuits and Systems for Video Technology}, \n  title={Video Understanding with Large Language Models: A Survey}, \n  year={2025},\n  doi={10.1109\u002FTCSVT.2025.3566695}\n}\n```\n\n### 🗒️ Taxonomy 1\n\n#### 🕹️ Video Analyzer × LLM\n\n##### LLM as Summarizer\n| Title                                                        |        Model        |  Date   |                             Code                             | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**Seeing the Unseen: Visual Metaphor Captioning for Videos**](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2406.04886v1) |   GIT-LLaVA   | 06\u002F2024 |      [code]()       | arXiv |\n| [**Zero-shot long-form video understanding through screenplay**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.17309) |   MM-Screenplayer   | 06\u002F2024 |      [project page]()       | CVPR |\n| [**MoReVQA exploring modular reasoning models for video question answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.06511) |   MoReVQA   | 04\u002F2024 |      [project page]()       | CVPR |\n| [**An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.18406) |   IG-VLM   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fimagegridworth\u002FIG-VLM)       | arXiv |\n| [**Language repository for long video understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.14622) |   LangRepo   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fkkahatapitiya\u002FLangRepo)       | arXiv |\n| [**Understanding long videos in one multimodal language model pass**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.16998) |   MVU   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fkahnchana\u002Fmvu)       | arXiv |\n| [**Video ReCap recursive captioning of hour-long videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.13250) |   Video ReCap   | 02\u002F2024 |      [code](https:\u002F\u002Fsites.google.com\u002Fview\u002Fvidrecap)       | CVPR |\n| [**A Simple LLM Framework for Long-Range Video Question-Answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17235) |   LLoVi   | 12\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FCeeZh\u002FLLoVi)       | arXiv |\n| [**Grounding-prompter prompting LLM with multimodal information for temporal sentence grounding in long videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17117) |   Grounding-prompter   | 12\u002F2023 |      [code]()       | arXiv |\n| [**Learning object state changes in videos an open-world perspective**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.11782) |   VIDOSC   | 12\u002F2023 |      [code]()       | CVPR |\n| [**AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16368) |   AntGPT   | 07\u002F2023 |      [code](https:\u002F\u002Fbrown-palm.github.io\u002FAntGPT)       | ICLR |\n| [**VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18500v1)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftxh-mercury\u002Fvast?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftxh-mercury\u002Fvast) |  VAST   | 05\u002F2023 |    [code](https:\u002F\u002Fgithub.com\u002Ftxh-mercury\u002Fvast)     | NeurIPS |\n| [**VLog: Video as a Long Document**](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FVLog)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FVLog.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FVLog) |        VLog         | 04\u002F2023 |    [code](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTencentARC\u002FVLog)     |   -   |\n| [**Learning Video Representations from Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.04501)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Flavila?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flavila) | LaViLa  | 12\u002F2022 | [code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flavila) |  CVPR   |\n\n##### LLM as Manager\n| Title                                                        |        Model        |  Date   |                             Code                             | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**DrVideo: Document Retrieval Based Long Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12846) |   DrVideo   | 06\u002F2024 |      [code]()       | arXiv |\n| [**OmAgent a multi-modal agent framework for complex video understanding with task divide-and-conquer**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16620) |   OmAgent   | 06\u002F2024 |      [code]()       | arXiv |\n| [**Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09396) |   LVNet   | 06\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fjongwoopark7978\u002FLVNet)       | arXiv |\n| [**VideoTree adaptive tree-based video representation for LLM reasoning on long videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.19209) |   VideoTree   | 05\u002F2024 |      [code](https:\u002F\u002Fvideotree2024.github.io\u002F)       | arXiv |\n| [**Harnessing Large Language Models for Training-free Video Anomaly Detection**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.01014) |   LAVAD   | 04\u002F2024 |      [code](https:\u002F\u002Flucazanella.github.io\u002Flavad\u002F)       | CVPR |\n| [**TraveLER a multi-LMM agent framework for video question-answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.01476) |   TraveLER   | 04\u002F2024 |      [code]()       | arXiv |\n| [**GPTSee enhancing moment retrieval and highlight detection via description-based similarity features**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01437) |   GPTSee   | 03\u002F2024 |      [code]()       | arXiv |\n| [**Reframe anything LLM agent for open world video reframing**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06070) |   RAVA   | 03\u002F2024 |      [code]()       | arXiv |\n| [**SCHEMA state CHangEs MAtter for procedure planning in instructional videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01599) |   SCHEMA   | 03\u002F2024 |      [code]()       | ICLR |\n| [**TV-TREES multimodal entailment trees for neuro-symbolic video reasoning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.19467) |   TV-TREES   | 02\u002F2024 |      [code]()       | arXiv |\n| [**VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.11481) | VideoAgent  | 03\u002F2024 |[project page](https:\u002F\u002Fvideoagent.github.io\u002F)| arXiv |\n| [**VideoAgent long-form video understanding with large language model as agent**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.10517) |   VideoAgent   | 03\u002F2024 |      [code]()       | arXiv |\n| [**VURF a general-purpose reasoning and self-refinement framework for video understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.14743) |   VURF   | 03\u002F2024 |      [code]()       | arXiv |\n| [**Why not use your textbook knowledge-enhanced procedure planning of instructional videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.02782) |   KEPP   | 03\u002F2024 |      [code]()       | CVPR |\n| [**DoraemonGPT toward understanding dynamic scenes with large language models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.08392) |   DoraemonGPT   | 01\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FDoraemonGPT)       | arXiv |\n| [**LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.05269) |   LifelongMemory   | 12\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FAgentic-Learning-AI-Lab\u002Flifelong-memory)       | arXiv |\n| [**Zero-Shot Video Question Answering with Procedural Programs**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00937) |   ProViQ   | 12\u002F2023 |      [code](https:\u002F\u002Frccchoudhury.github.io\u002Fproviq2023)       | arXiv |\n| [**AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.08640) |   AssistGPT   | 06\u002F2023 |      [code](https:\u002F\u002Fshowlab.github.io\u002Fassistgpt\u002F)       | arXiv |\n| [**ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.14407) |      ChatVideo      | 04\u002F2023 |    [project page](https:\u002F\u002Fwww.wangjunke.info\u002FChatVideo\u002F)     | arXiv |\n| [**Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.04227)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FChatCaptioner.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FChatCaptioner\u002Ftree\u002Fmain\u002FVideo_ChatCaptioner) | Video ChatCaptioner | 04\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FChatCaptioner\u002Ftree\u002Fmain\u002FVideo_ChatCaptioner) | arXiv |\n| [**ViperGPT: Visual Inference via Python Execution for Reasoning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.08128) |   ViperGPT   | 03\u002F2023 |      [code](https:\u002F\u002Fviper.cs.columbia.edu\u002F)       | arXiv |\n| [**Hawk: Learning to Understand Open-World Video Anomalies**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16886) |   Hawk   | 05\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fjqtangust\u002Fhawk)       | arXiv |\n\n#### 👾 Video Embedder × LLM\n\n##### LLM as Text Decoder\n\n| Title                                                        |        Model        |  Date   |                             Code                             | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03051) |   AuroraCap   | 10\u002F2024 |      [project page](https:\u002F\u002Frese1f.github.io\u002Faurora-web\u002F)       | arXiv |\n| [**Artemis towards referential understanding in complex videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.00258) |   Artemis   | 06\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fqiujihao19\u002FArtemis)       | arXiv |\n| [**EmoLLM multimodal emotional understanding meets large language models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16442) |   EmoLLM   | 06\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fyan9qu\u002FEmoLLM)       | arXiv |\n| [**Fewer tokens and fewer videos extending video understanding abilities in large vision-language models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.08024) |   FTFV-LLM   | 06\u002F2024 |      -      | arXiv |\n| [**Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.08085) |   Flash-VStream   | 06\u002F2024 |      [code](https:\u002F\u002Finvinciblewyq.github.io\u002Fvstream-page\u002F)       | arXiv |\n| [**LLAVIDAL benchmarking large language vision models for daily activities of living**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09390) |   LLAVIDAL   | 06\u002F2024 |      [code](https:\u002F\u002Fadl-x.github.io\u002F)       | arXiv |\n| [**Long context transfer from language to vision**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16852) |   LongVA   | 06\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002FLongVA)       | arXiv |\n| [**ShareGPT4Video improving video understanding and generation with better captions**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.04325) |   ShareGPT4Video   | 06\u002F2024 |      [code](https:\u002F\u002Fsharegpt4video.github.io\u002F)       | arXiv |\n| [**Towards event-oriented long video understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.14129) |   VIM   | 06\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FEvent-Bench)       | arXiv |\n| [**Video-SALMONN speech-enhanced audio-visual large language models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.15704) |   Video-SALMONN   | 06\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FSALMONN\u002F)       | ICML |\n| [**VideoGPT+ integrating image and video encoders for enhanced video understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09418) |   VideoGPT+   | 06\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideoGPT-plus)       | arXiv |\n| [**VideoLLaMA 2 advancing spatial-temporal modeling and audio understanding in video-LLMs**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07476) |   VideoLLaMA 2   | 06\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2)       | arXiv |\n| [**MotionLLM: Understanding Human Behaviors from Human Motions and Videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.20340) |   MotionLLM   | 05\u002F2024 |      [project page](https:\u002F\u002Flhchen.top\u002FMotionLLM)       | arXiv |\n| [**MVBench: A Comprehensive Multi-modal Video Understanding Benchmark**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17005) |   VideoChat2   | 11\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything)       | CVPR |\n| [**Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.20648) |   Shotluck Holmes   | 05\u002F2024 |      -      | arXiv |\n| [**Streaming long video understanding with large language models**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2405.16009) |   VideoStreaming   | 05\u002F2024 |      -       | arXiv |\n| [**Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline**](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2405.14040v1) |   VideoNarrator   | 05\u002F2024 |      -       | arXiv |\n| [**TOPA extend large language models for video understanding via text-only pre-alignment**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.13911) |   TOPA   | 05\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fdhg-wei\u002FTOPA)       | NeurIPS |\n| [**MovieChat+: Question-aware Sparse Memory for Long Video Question Answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.17176) |   MovieChat+   | 04\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat)       | arXiv |\n| [**AutoAD III: The Prequel – Back to the Pixels**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14412) |   AutoAD III   | 04\u002F2024 |      [project page](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fresearch\u002Fautoad\u002F)       | CVPR |\n| [**Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.01258) |   LLaVA-Hound-DPO   | 04\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FRifleZhang\u002FLLaVA-Hound-DPO)       | arXiv |\n| [**From image to video, what do we need in multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.11865) |   RED-VILLM   | 04\u002F2024 |      -       | arXiv |\n| [**Koala key frame-conditioned long video-LLM**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.04346) |   Koala   | 04\u002F2024 |      [project page](https:\u002F\u002Fcs-people.bu.edu\u002Frxtan\u002Fprojects\u002FKoala)       | CVPR |\n| [**LongVLM efficient long video understanding via large language models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03384) |   LongVLM   | 04\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fziplab\u002FLongVLM)       | ECCV |\n| [**MA-LMM memory-augmented large multimodal model for long-term video understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05726) |   MA-LMM   | 04\u002F2024 |      [code](https:\u002F\u002Fboheumd.github.io\u002FMA-LMM\u002F)       | CVPR |\n| [**MiniGPT4-video advancing multimodal LLMs for video understanding with interleaved visual-textual tokens**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03413) |   MiniGPT4-Video   | 04\u002F2024 |      [code](https:\u002F\u002Fvision-cair.github.io\u002FMiniGPT4-video\u002F)       | arXiv |\n| [**Pegasus-v1 technical report**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14687) |   Pegasus-v1   | 04\u002F2024 |      [code]()       | arXiv |\n| [**PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.16994) |   PLLaVA   | 04\u002F2024 |      [code](https:\u002F\u002Fpllava.github.io\u002F)       | arXiv |\n| [**ST-LLM: Large Language Models Are Effective Temporal Learners**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.00308) |   ST-LLM   | 04\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FST-LLM)       | arXiv |\n| [**Tarsier recipes for training and evaluating large video description models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00634) |   Tarsier   | 07\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Ftarsier)       | arXiv |\n| [**X-VARS introducing explainability in football refereeing with multi-modal large language model**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.06332) |   X-VARS   | 04\u002F2024 |      [code]()       | arXiv |\n| [**CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.04640) |   CAT   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Frikeilong\u002FBay-CAT)       | arXiv |\n| [**InternVideo2 scaling video foundation models for multimodal video understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.15377) |   InternVideo2   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVideo\u002Ftree\u002Fmain\u002FInternVideo2)       | ECCV |\n| [**MovieLLM enhancing long video understanding with AI-generated movies**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01422) |   MovieLLM   | 03\u002F2024 |      [code](https:\u002F\u002Fdeaddawn.github.io\u002FMovieLLM\u002F)       | arXiv |\n| [**LLMs meet long video advancing long video comprehension with an interactive visual adapter in LLMs**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.13546) |   IVAwithLLM   | 02\u002F2024 |      [code]()       | arXiv |\n| [**LSTP language-guided spatial-temporal prompt learning for long-form video-text understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16050) |   LSTP   | 02\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fbigai-nlco\u002FVideoTGB)       | EMNLP |\n| [**LVCHAT facilitating long video comprehension**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12079) |   LVCHAT   | 02\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fwangyu-ustc\u002FLVChat)       | arXiv |\n| [**OSCaR: Object State Captioning and State Change Representation**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17128) |   OSCaR   | 02\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fnguyennm1024\u002FOSCaR)       | NAACL |\n| [**Slot-VLM SlowFast slots for video-language modeling**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.13088) |   Slot-VLM   | 02\u002F2024 |      [code]()       | arXiv |\n| [**COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00849) |   COSMO   | 01\u002F2024 |      [code](http:\u002F\u002Ffingerrec.github.io\u002Fcosmo)       | arXiv |\n| [**Weakly supervised gaussian contrastive grounding with large multimodal models for video question answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10711) |   GCG   | 01\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FWHB139426\u002FGCG)       | ACMMM |\n| [**Audio-Visual LLM for Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06720) |   AV-LLM   | 12\u002F2023 |      [code]()       | arXiv |\n| [**Generative Multimodal Models are In-Context Learners**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13286) |   Emu2   | 12\u002F2023 |      [project page](https:\u002F\u002Fbaaivision.github.io\u002Femu2)       | CVPR |\n| [**MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06363) |   MMICT   | 12\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FKDEGroup\u002FMMICT)       | TOMM |\n| [**VaQuitA : Enhancing Alignment in LLM-Assisted Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02310) |   VaQuitA   | 12\u002F2023 |      [code]()       | arXiv |\n| [**VILA: On Pre-training for Visual Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.07533) |   VILA   | 12\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVILA)       | CVPR |\n| [**Vista-LLaMA reliable video narrator via equal distance to visual tokens**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08870) |   Vista-LLaMA   | 12\u002F2023 |      [project page](https:\u002F\u002Fjinxxian.github.io\u002FVista-LLaMA)       | arXiv |\n| [**Chat-UniVi unified visual representation empowers large language models with image and video understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.08046) |   Chat-UniVi   | 11\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FChat-UniVi)       | CVPR |\n| [**LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17043) |   LLaMA-VID   | 11\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID)       | arXiv |\n| [**Video-LLaVA learning united visual representation by alignment before projection**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10122) |   Video-LLaVA   | 11\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA)       | arXiv |\n| [**Large Language Models are Temporal and Causal Reasoners for Video Question Answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.15747) |   LLaMA-VQA   | 10\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002Fmlvlab\u002FFlipped-VQA)       | EMNLP |\n| [**MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449) |   MovieChat   | 07\u002F2023 |      [code](https:\u002F\u002Frese1f.github.io\u002FMovieChat\u002F)       | CVPR |\n| [**LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.10354) |   LLMVA-GEBC   | 06\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002Fzjr2000\u002FLLMVA-GEBC)       | CVPR |\n| [**Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.09093) |   Macaw-LLM   | 06\u002F2023 |      [project page](https:\u002F\u002Fgithub.com\u002Flyuchenyang\u002FMacaw-LLM)       | arXiv |\n| [**Valley: Video Assistant with Large Language model Enhanced abilitY**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07207) |   VALLEY   | 06\u002F2023 |      [code]()       | arXiv |\n| [**Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05424) |   Video-ChatGPT   | 06\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT)       | ACL |\n| [**Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.02858) |   Video-LLaMA   | 06\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideo-LLaMA)       | EMNLP |\n| [**Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04362) |   mPLUG-video   | 06\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FYouku-mPLUG)       | arXiv |\n| [**ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.16103) |   ChatBridge   | 05\u002F2023 |      [code](https:\u002F\u002Fiva-chatbridge.github.io)       | arXiv |\n| [**Otter: A Multi-Modal Model with In-Context Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.03726) |   Otter   | 05\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter)       | arXiv |\n| [**VideoLLM: Modeling Video Sequence with Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13292) |   VideoLLM   | 05\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002Fcg1177\u002FVideoLLM)       | arXiv |\n| [**One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23617) |   -   | 05\u002F2025 |      [code](https:\u002F\u002Fgithub.com\u002FRAIVNLab\u002Ftrajvit)       | ICCV 2025 |\n\n\n##### LLM as Regressor\n\n\u003C!-- \n| [**title**](link) |   model   | date |      [code](link)       | venue |\n -->\n| Title                                                        |        Model        |  Date   |                             Code                             | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n|[**LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.14505)|LLaVA-MR|11\u002F2024|[code]()| arXiv |\n| [**Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12235) |   Holmes-VAD   | 06\u002F2024 |      [code](https:\u002F\u002Fholmesvad.github.io\u002F)       | arXiv |\n| [**VideoLLM-online online video large language model for streaming video**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.11816) |   VideoLLM-online   | 06\u002F2024 |      [code](https:\u002F\u002Fshowlab.github.io\u002Fvideollm-online)       | CVPR |\n| [**HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.09933) |   VLM4HOI   | 04\u002F2024 |      [project page](https:\u002F\u002Fsid2697.github.io\u002Fhoi-ref\u002F)       | arXiv |\n| [**V2Xum-LLM cross-modal video summarization with temporal prompt instruction tuning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12353) |   V2Xum-LLaMA   | 04\u002F2024 |      [code](https:\u002F\u002Fhanghuacs.github.io\u002Fv2xum\u002F)       | arXiv |\n| [**AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.16276) |   AVicuna   | 03\u002F2024 |      [code]()       | arXiv |\n| [**Elysium exploring object-level perception in videos via MLLM**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.16558) |   Elysium   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FHon-Wong\u002FElysium)       | arXiv |\n| [**HawkEye training video-text LLMs for grounding text in videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.10228) |   HawkEye   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fyellow-binary-tree\u002FHawkEye)       | arXiv |\n| [**LITA language instructed temporal-localization assistant**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.19046) |   LITA   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FLITA)       | arXiv |\n| [**OmniViD: A Generative Framework for Universal Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.17935) |   OmniViD   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fwangjk666\u002FOmniVid)       | CVPR |\n| [**GroundingGPT: Language Enhanced Multi-modal Grounding Model**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06071) |   GroundingGPT   | 01\u002F2024 |      [code](https: \u002F\u002Fgithub.com\u002Flzw-lzw\u002FGroundingGPT)       | arXiv |\n| [**TimeChat a time-sensitive multimodal large language model for long video understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02051) |   TimeChat   | 12\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FRenShuhuai-Andy\u002FTimeChat)       | CVPR |\n| [**Self-Chained Image-Language Model for Video Localization and Question Answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06988) |   SeViLA   | 11\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FYui010206\u002FSeViLA)       | NeurIPS |\n| [**VTimeLLM: Empower LLM to Grasp Video Moments**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.18445) |   VTimeLLM   | 11\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002Fhuangb23\u002FVTimeLLM)       | arXiv |\n\n##### LLM as Hidden Layer\n | Title                                                        |        Model        |  Date   |                             Code                             | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**VTG-LLM integrating timestamp knowledge into video LLMs for enhanced video temporal grounding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.13382) |   VTG-LLM   | 05\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fgyxxyg\u002FVTG-LLM)       | arXiv |\n| [**VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing**](https:\u002F\u002Fhaofei.vip\u002Fdownloads\u002Fpapers\u002FSkywork_Vitron_2024.pdf) |   VITRON   | 04\u002F2024 |      [project page](https:\u002F\u002Fvitron-llm.github.io\u002F)       | NeurIPS |\n| [**VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.02076) |   VTG-GPT   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FYoucanBaby\u002FVTG-GPT)       | arXiv |\n| [**Momentor advancing video large language model with fine-grained temporal reasoning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11435) |   Momentor   | 02\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FDCDmllm\u002FMomentor)       | ICML |\n| [**Detours for navigating instructional videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.01823) |   VidDetours   | 01\u002F2024 |      [code]()       | CVPR |\n| [**OneLLM: One Framework to Align All Modalities with Language**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03700) |   OneLLM   | 12\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002Fcsuhan\u002FOneLLM)       | arXiv |\n| [**GPT4Video a unified multimodal large language model for lnstruction-followed understanding and safety-aware generation**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16511) |   GPT4Video   | 11\u002F2023 |      [code](https:\u002F\u002Fgpt4video.github.io)       | ACMMM |\n\n#### 🧭 (Analyzer + Embedder) × LLM\n\n##### LLM as Manager\n | Title                                                        |        Model        |  Date   |                             Code                             | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**MM-VID: Advancing Video Understanding with GPT-4V(ision)**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.19773) |       MM-VID        | 10\u002F2023 |                              -                               | arXiv |\n##### LLM as Summarizer\n | Title                                                        |        Model        |  Date   |                             Code                             | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**Shot2Story20K a new benchmark for comprehensive understanding of multi-shot videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.10300) |   SUM-shot   | 12\u002F2023 |      [code](https:\u002F\u002Fmingfei.info\u002Fshot2story\u002F)       | arXiv |\n##### LLM as Regressor\n | Title                                                        |        Model        |  Date   |                             Code                             | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**Vript: A Video Is Worth Thousands of Words**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06040) |   Vriptor   | 06\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fmutonix\u002FVript)       | NeurIPS |\n| [**Merlin:Empowering Multimodal LLMs with Foresight Minds**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00589) |   Merlin   | 12\u002F2023 |      [project page](https:\u002F\u002Fahnsun.github.io\u002Fmerlin)       | ECCV |\n| [**VideoChat: Chat-Centric Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06355) |   VideoChat   | 05\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything)       | arXiv |\n| [**Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.14115) |   Vid2Seq   | 02\u002F2023 |      [code](https:\u002F\u002Fantoyang.github.io\u002Fvid2seq.html)       | CVPR |\n##### LLM as Text Decoder\n | Title                                                        |        Model        |  Date   |                             Code                             | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**Contextual AD Narration with Interleaved Multimodal Sequence**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12922) |   Uni-AD   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FMCG-NJU\u002FUni-AD)       | arXiv |\n| [**MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17435) |   MM-narrator   | 11\u002F2023 |      [project page](https:\u002F\u002Fmm-narrator.github.io\u002F)       | arXiv |\n| [**Vamos: Versatile Action Models for Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13627) |   Vamos   | 11\u002F2023 |      [project page](https:\u002F\u002Fbrown-palm.github.io\u002FVamos\u002F)       | ECCV |\n| [**AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06838) |   Auto-AD II   | 10\u002F2023 |      [project page](https:\u002F\u002Fwww.robots.ox.ac.uk\u002Fvgg\u002Fresearch\u002Fautoad\u002F)       | ICCV |\n##### LLM as Hidden Layer\n | Title                                                        |        Model        |  Date   |                             Code                             | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**PG-Video-LLaVA: Pixel Grounding Large Video-Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13435v2)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002Fvideo-llava.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002Fvideo-llava) |   PG-Video-LLaVA    | 11\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002Fvideo-llava)      | arXiv |\n### 🗒️ Taxonomy 2\n\n#### 🤖 LLM-based Video Agents\n\n| Title                                                        |        Model        |  Date   |                             Code                             | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.00598) |   Socratic Models   | 04\u002F2022 |      [project page](https:\u002F\u002Fsocraticmodels.github.io\u002F)       | arXiv |\n| [**Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.04227)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FChatCaptioner.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FChatCaptioner\u002Ftree\u002Fmain\u002FVideo_ChatCaptioner) | Video ChatCaptioner | 04\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FChatCaptioner\u002Ftree\u002Fmain\u002FVideo_ChatCaptioner) | arXiv |\n| [**VLog: Video as a Long Document**](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FVLog)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FVLog.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FVLog) |        VLog         | 04\u002F2023 |    [code](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTencentARC\u002FVLog)     |   -   |\n| [**ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.14407) |      ChatVideo      | 04\u002F2023 |    [project page](https:\u002F\u002Fwww.wangjunke.info\u002FChatVideo\u002F)     | arXiv |\n| [**MM-VID: Advancing Video Understanding with GPT-4V(ision)**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.19773) |       MM-VID        | 10\u002F2023 |                              -                               | arXiv |\n| [**MISAR: A Multimodal Instructional System with Augmented Reality**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11699v1)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnguyennm1024\u002Fmisar.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fnguyennm1024\u002Fmisar) |        MISAR        | 10\u002F2023 |    [project page](https:\u002F\u002Fgithub.com\u002Fnguyennm1024\u002Fmisar)     | ICCV  |\n| [**Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17117) | Grounding-Prompter  | 12\u002F2023 |                              -                               | arXiv |\n| [**NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.15852) | NaVid  | 02\u002F2024 |     [project page](https:\u002F\u002Fpku-epic.github.io\u002FNaVid\u002F)                         -      | RSS |\n| [**VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.11481) | VideoAgent  | 03\u002F2024 |                              [project page](https:\u002F\u002Fvideoagent.github.io\u002F)                               | arXiv |\n| [**VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.20365) |   VideoINSTA   | 09\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fmayhugotong\u002FVideoINSTA)       | EMNLP |\n| [**Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.13654) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fegolife-ai\u002FEgo-R1.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fegolife-ai\u002FEgo-R1) |   Ego-R1 Agent   | 06\u002F2025 |      [code](https:\u002F\u002Fgithub.com\u002Fegolife-ai\u002FEgo-R1)       | arXiv |\n\n\n#### 🎥 Vid-LLM Pretraining\n\n| Title                                                        |  Model  |  Date   |                        Code                        |  Venue  |\n| :----------------------------------------------------------- | :-----: | :-----: | :------------------------------------------------: | :-----: |\n| [**Learning Video Representations from Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.04501)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Flavila?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flavila) | LaViLa  | 12\u002F2022 | [code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flavila) |  CVPR   |\n| [**Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.14115) | Vid2Seq | 02\u002F2023 |  [code](https:\u002F\u002Fantoyang.github.io\u002Fvid2seq.html)   |  CVPR   |\n| [**VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18500v1)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftxh-mercury\u002Fvast?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftxh-mercury\u002Fvast) |  VAST   | 05\u002F2023 |    [code](https:\u002F\u002Fgithub.com\u002Ftxh-mercury\u002Fvast)     | NeurIPS |\n| [**Merlin:Empowering Multimodal LLMs with Foresight Minds**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00589v1) | Merlin  | 12\u002F2023 |                         -                          |  arXiv  |\n\n#### 👀 Vid-LLM Instruction Tuning\n\n##### Fine-tuning with Connective Adapters\n\n| Title                                                        |     Model     |  Date   |                         Code                         | Venue |\n| :----------------------------------------------------------- | :-----------: | :-----: | :--------------------------------------------------: | :---: |\n| [**Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.02858) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideo-LLaMA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideo-LLaMA) |  Video-LLaMA  | 06\u002F2023 |  [code](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideo-LLaMA)  | arXiv |\n| [**VALLEY: Video Assistant with Large Language model Enhanced abilitY**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07207)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRupertLuo\u002FValley.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FRupertLuo\u002FValley) |    VALLEY     | 06\u002F2023 |     [code](https:\u002F\u002Fgithub.com\u002FRupertLuo\u002FValley)      |   -   |\n| [**Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05424)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FVideo-ChatGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT) | Video-ChatGPT | 06\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT) | arXiv |\n| [**Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.09093)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flyuchenyang\u002Fmacaw-llm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Flyuchenyang\u002Fmacaw-llm) |   Macaw-LLM   | 06\u002F2023 |   [code](https:\u002F\u002Fgithub.com\u002Flyuchenyang\u002Fmacaw-llm)   | arXiv |\n| [**LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.10354) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzjr2000\u002Fllmva-gebc.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzjr2000\u002Fllmva-gebc) |  LLMVA-GEBC   | 06\u002F2023 |    [code](https:\u002F\u002Fgithub.com\u002Fzjr2000\u002Fllmva-gebc)     | CVPR  |\n| [**Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04362) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fx-plug\u002Fyouku-mplug.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fx-plug\u002Fyouku-mplug) |  mPLUG-video  | 06\u002F2023 |    [code](https:\u002F\u002Fgithub.com\u002Fx-plug\u002Fyouku-mplug)     | arXiv |\n| [**MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frese1f\u002FMovieChat.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat) |   MovieChat   | 07\u002F2023 |     [code](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat)      | arXiv |\n| [**Large Language Models are Temporal and Causal Reasoners for Video Question Answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.15747)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlvlab\u002FFlipped-VQA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmlvlab\u002FFlipped-VQA) |   LLaMA-VQA   | 10\u002F2023 |    [code](https:\u002F\u002Fgithub.com\u002Fmlvlab\u002FFlipped-VQA)     | EMNLP |\n| [**Video-LLaVA: Learning United Visual Representation by Alignment Before Projection**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.10122v2.pdf)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FVideo-LLaVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA) |  Video-LLaVA  | 11\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA) | arXiv |\n| [**Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17043v1.pdf)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpku-yuangroup\u002Fchat-univi.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpku-yuangroup\u002Fchat-univi) |  Chat-UniVi   | 11\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002Fpku-yuangroup\u002Fchat-univi)  | arXiv |\n| [**LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.08046v1)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLaMA-VID)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID) |  LLaMA-VID   | 11\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID)  | arXiv |\n| [**VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual Tokens**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08870) |  VISTA-LLAMA   | 12\u002F2023 | - | arXiv |\n| [**Audio-Visual LLM for Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06720) | - | 12\u002F2023 | - | arXiv |\n| [**AutoAD: Movie Description in Context**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FHan_AutoAD_Movie_Description_in_Context_CVPR_2023_paper.pdf) |    AutoAD     | 06\u002F2023 |     [code](https:\u002F\u002Fgithub.com\u002FTengdaHan\u002FAutoAD)      | CVPR  |\n| [**AutoAD II: The Sequel - Who, When, and What in Movie Audio Description**](https:\u002F\u002Fopenaccess.thecvf.com\u002F\u002Fcontent\u002FICCV2023\u002Fpapers\u002FHan_AutoAD_II_The_Sequel_-_Who_When_and_What_in_ICCV_2023_paper.pdf) |   AutoAD II   | 10\u002F2023 |                          -                           | ICCV  |\n| [**AutoAD III: The Prequel -- Back to the Pixels**](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fpublications\u002F2024\u002FHan24\u002Fhan24.pdf) |   AutoAD III   | 04\u002F2024 |                          -                           | CVPR  |\n| [**Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05863v2)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthe-anonymous-bs\u002Ffavor.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthe-anonymous-bs\u002Ffavor) |     FAVOR     | 10\u002F2023 |  [code](https:\u002F\u002Fgithub.com\u002Fthe-anonymous-bs\u002Ffavor)   | arXiv |\n| [**VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07476)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoLLaMA2.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2) |     VideoLLaMA2     | 06\u002F2024 |  [code](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2)   | arXiv |\n| [**PAVE: Patching and Adapting Video Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.19794) |   PAVE   | 03\u002F2025 | [code](https:\u002F\u002Fgithub.com\u002Fdragonlzm\u002FPAVE) | CVPR |\n| [**Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12605)|     Temporal Recipe     | 05\u002F2025 |  [code](https:\u002F\u002Fgithub.com\u002Fnguyentthong\u002Ftemporal_recipe)   | arXiv |\n\n\n##### Fine-tuning with Insertive Adapters\n\n| Title                                                        |  Model   |  Date   |                    Code                    | Venue |\n| :----------------------------------------------------------- | :------: | :-----: | :----------------------------------------: | :---: |\n| [**Otter: A Multi-Modal Model with In-Context Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.03726v1)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fluodian\u002Fotter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fluodian\u002Fotter) |  Otter   | 06\u002F2023 |  [code](https:\u002F\u002Fgithub.com\u002Fluodian\u002Fotter)  | arXiv |\n| [**VideoLLM: Modeling Video Sequence with Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13292)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcg1177\u002Fvideollm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fcg1177\u002Fvideollm) | VideoLLM | 05\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002Fcg1177\u002Fvideollm) | arXiv |\n\n##### Fine-tuning with Hybrid Adapters\n\n| Title                                                        |   Model   |  Date   |                     Code                     | Venue |\n| :----------------------------------------------------------- | :-------: | :-----: | :------------------------------------------: | :---: |\n| [**VTimeLLM: Empower LLM to Grasp Video Moments**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.18445v1)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuangb23\u002Fvtimellm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhuangb23\u002Fvtimellm) | VTimeLLM  | 11\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002Fhuangb23\u002Fvtimellm) | arXiv |\n| [**GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16511v1) | GPT4Video | 11\u002F2023 |                      -                       | arXiv |\n\n#### 🦾 Hybrid Methods\n\n| Title                                                        |        Model        |  Date   |                             Code                             | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**VideoChat: Chat-Centric Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06355)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FAsk-Anything.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything) |      VideoChat      | 05\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything)  [demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fynhe\u002FAskAnything) | arXiv |\n| [**PG-Video-LLaVA: Pixel Grounding Large Video-Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13435v2)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002Fvideo-llava.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002Fvideo-llava) |   PG-Video-LLaVA    | 11\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002Fvideo-llava)      | arXiv |\n| [**TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02051)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRenShuhuai-Andy\u002FTimeChat.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FRenShuhuai-Andy\u002FTimeChat) |      TimeChat       | 12\u002F2023 |     [code](https:\u002F\u002Fgithub.com\u002FRenShuhuai-Andy\u002FTimeChat)      | CVPR |\n| [**Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.00901.pdf)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTalalWasim\u002FVideo-GroundingDINO.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTalalWasim\u002FVideo-GroundingDINO) | Video-GroundingDINO | 12\u002F2023 |  [code](https:\u002F\u002Fgithub.com\u002FTalalWasim\u002FVideo-GroundingDINO)   | arXiv |\n| [**A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot**](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.608\u002F) | Video4096 | 05\u002F2023 |  | EMNLP |\n\n#### 💎 Training-free Methods\n\n| Title                                                        |        Model        |  Date   | Code | Venue |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :--: | :---: |\n| [**Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14401) |   DyTo    | 11\u002F2024 |  [code](https:\u002F\u002Fgithub.com\u002FJam1ezhang\u002FDYTO)   | ICCV2025 |\n| [**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.15841) |   SlowFast-LLaVA    | 07\u002F2024 |  -   | arXiv |\n| [**TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11066) | TS-LLaVA | 11\u002F2024 | [code](https:\u002F\u002Fgithub.com\u002Ftingyu215\u002FTS-LLaVA) | arXiv |\n| [**Can Sound Replace Vision in LLaVA With Token Substitution?**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.10416) | SoundCLIP | 08\u002F2025 | [code](https:\u002F\u002Fgithub.com\u002Fali-vosoughi\u002FSoundCLIP) | arXiv |\n| [**D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition**](https:\u002F\u002Faclanthology.org\u002F2025.emnlp-main.597\u002F) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhukcc\u002FD-CoDe.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhukcc\u002FD-CoDe) | D-CoDe | 08\u002F2025 | [code](https:\u002F\u002Fgithub.com\u002Fhukcc\u002FD-CoDe) [project page](https:\u002F\u002Fhukcc.github.io\u002FD-CoDe\u002F) | EMNLP |\n\n---\n\n## Tasks, Datasets, and Benchmarks\n\n#### Recognition and Anticipation\n\n| Name               |                            Paper                             | Date |                            Link                             |  Venue  |\n| :----------------- | :----------------------------------------------------------: | :--: | :---------------------------------------------------------: | :-----: |\n| **Charades**       | [**Hollywood in homes: Crowdsourcing data collection for activity understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.01753v3) | 2016 |        [Link](http:\u002F\u002Fvuchallenge.org\u002Fcharades.html)         |  ECCV   |\n| **YouTube8M**      | [**YouTube-8M: A Large-Scale Video Classification Benchmark**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.08675v1) | 2016 | [Link](https:\u002F\u002Fresearch.google.com\u002Fyoutube8m\u002Fdownload.html) |    -    |\n| **ActivityNet**    | [**ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_cvpr_2015\u002Fpapers\u002FHeilbron_ActivityNet_A_Large-Scale_2015_CVPR_paper.pdf) | 2015 |              [Link](http:\u002F\u002Factivity-net.org\u002F)               |  CVPR   |\n| **Kinetics-GEBC**  | [**GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.00486v4) | 2022 |         [Link](https:\u002F\u002Fgithub.com\u002Fshowlab\u002Fgeb-plus)         |  ECCV   |\n| **Kinetics-400**   | [**The Kinetics Human Action Video Dataset**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.06950) | 2017 |  [Link](https:\u002F\u002Fpaperswithcode.com\u002Fdataset\u002Fkinetics-400-1)  |    -    |\n| **VidChapters-7M** | [**VidChapters-7M: Video Chapters at Scale**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.13952) | 2023 |     [Link](https:\u002F\u002Fantoyang.github.io\u002Fvidchapters.html)     | NeurIPS |\n| **BlackSwanSuite** | [**Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05725) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsahithyaravi\u002FBlackSwan.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsahithyaravi\u002FBlackSwan) | 2025 |[Link](https:\u002F\u002Fblackswan.cs.ubc.ca) |                CVPR |\n\n#### Captioning and Description\n| Name               |                            Paper                             | Date |                            Link                             |  Venue  |\n| :----------------- | :----------------------------------------------------------: | :--: | :---------------------------------------------------------: | :-----: |\n|**Microsoft Research Video Description Corpus (MSVD)**|[**Collecting Highly Parallel Data for Paraphrase Evaluation**](https:\u002F\u002Faclanthology.org\u002FP11-1020.pdf)|2011|[Link](https:\u002F\u002Fwww.cs.utexas.edu\u002Fusers\u002Fml\u002Fclamp\u002FvideoDescription\u002F#data)|ACL|\n|**Microsoft Research Video-to-Text (MSR-VTT)**|[**MSR-VTT: A Large Video Description Dataset for Bridging Video and Language**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_cvpr_2016\u002Fpapers\u002FXu_MSR-VTT_A_Large_CVPR_2016_paper.pdf)|2016|[Link](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Fmsr-vtt-a-large-video-description-dataset-for-bridging-video-and-language\u002F)|CVPR|\n|**Tumblr GIF (TGIF)**|[**TGIF: A New Dataset and Benchmark on Animated GIF Description**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.02748v2)|2016|[Link](https:\u002F\u002Fgithub.com\u002Fraingo\u002FTGIF-Release)|CVPR|\n|**Charades**|[**Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.01753v3)|2016|[Link](https:\u002F\u002Fprior.allenai.org\u002Fprojects\u002Fcharades)|ECCV|\n|**Charades-Ego**|[**Actor and Observer: Joint Modeling of First and Third-Person Videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.0962)|2018|[Link](https:\u002F\u002Fprior.allenai.org\u002Fprojects\u002Fcharades-ego)|CVPR|\n|**ActivityNet Captions**|[**Dense-Captioning Events in Videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.00754)|2017|[Link](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Franjaykrishna\u002Fdensevid\u002F)|ICCV|\n|**HowTo100m**|[**HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.03327)|2019|[Link](https:\u002F\u002Fwww.di.ens.fr\u002Fwillow\u002Fresearch\u002Fhowto100m\u002F)|ICCV|\n|**Movie Audio Descriptions (MAD)**|[**MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.00431)|2021|[Link](https:\u002F\u002Fgithub.com\u002FSoldelli\u002FMAD)|CVPR|\n|**YouCook2**|[**Towards Automatic Learning of Procedures from Web Instructional Videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.09788)|2017|[Link](http:\u002F\u002Fyoucook2.eecs.umich.edu\u002F)|AAAI|\n|**MovieNet**|[**MovieNet: A Holistic Dataset for Movie Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.10937)|2020|[Link](https:\u002F\u002Fmovienet.github.io\u002F)|ECCV|\n|**Youku-mPLUG**|[**Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04362)|2023|[Link](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FYouku-mPLUG)|arXiv|\n|**Video Timeline Tags (ViTT)**|[**Multimodal Pretraining for Dense Video Captioning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.11760)|2020|[Link](https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002FVideo-Timeline-Tags-ViTT)|AACL-IJCNLP|\n|**TVSum**|[**TVSum: Summarizing web videos using titles**](https:\u002F\u002Fwww.cv-foundation.org\u002Fopenaccess\u002Fcontent_cvpr_2015\u002Fpapers\u002FSong_TVSum_Summarizing_Web_2015_CVPR_paper.pdf)|2015|[Link](https:\u002F\u002Fgithub.com\u002Fyalesong\u002Ftvsum)|CVPR|\n|**SumMe**|[**Creating Summaries from User Videos**](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FCreating-Summaries-from-User-Videos-Gygli-Grabner\u002F799bf307438ec2171e6f0bd5b8040f678d5b28da)|2014|[Link](http:\u002F\u002Fwww.vision.ee.ethz.ch\u002F~gyglim\u002Fvsum\u002F)|ECCV|\n|**VideoXum**|[**VideoXum: Cross-modal Visual and Textural Summarization of Videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12060)|2023|[Link](https:\u002F\u002Fvideoxum.github.io\u002F)|IEEE Trans Multimedia|\n|**Multi-Source Video Captioning (MSVC)**|[**VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07476)|2024|[Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDAMO-NLP-SG\u002FMulti-Source-Video-Captioning)|arXiv|\n\n#### Grounding and Retrieval\n| Name               |                            Paper                             | Date |                            Link                             |  Venue  |\n| :----------------- | :----------------------------------------------------------: | :--: | :---------------------------------------------------------: | :-----: |\n|**Epic-Kitchens-100**|[**Rescaling Egocentric Vision**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.13256v4)|2021|[Link](https:\u002F\u002Fepic-kitchens.github.io\u002F2021)|IJCV|\n|**VCR (Visual Commonsense Reasoning)**|[**From Recognition to Cognition: Visual Commonsense Reasoning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.10830v2)|2019|[Link](https:\u002F\u002Fvisualcommonsense.com\u002F)|CVPR|\n|**Ego4D-MQ and Ego4D-NLQ**|[**Ego4D: Around the World in 3,000 Hours of Egocentric Video**](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fego4d-unscripted-first-person-video-from-around-the-world-and-a-benchmark-suite-for-egocentric-perception\u002F)|2021|[Link](https:\u002F\u002Fego4d-data.org\u002F)|CVPR|\n|**Vid-STG**|[**Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2001.06891)|2020|[Link](https:\u002F\u002Fgithub.com\u002FGuaranteer\u002FVidSTG-Dataset)|CVPR|\n|**Charades-STA**|[**TALL: Temporal Activity Localization via Language Query**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.02101)|2017|[Link](https:\u002F\u002Fgithub.com\u002Fjiyanggao\u002FTALL)|ICCV|\n|**DiDeMo**|[**Localizing Moments in Video with Natural Language**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1708.01641)|2017|[Link](https:\u002F\u002Fgithub.com\u002FLisaAnne\u002FTemporalLanguageRelease)|ICCV|\n\n#### Question Answering\n| Name               |                            Paper                             | Date |                            Link                             |  Venue  |\n| :----------------- | :----------------------------------------------------------: | :--: | :---------------------------------------------------------: | :-----: |\n|**MSVD-QA**|[**Video Question Answering via Gradually Refined Attention over Appearance and Motion**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3123266.3123427)|2017|[Link](https:\u002F\u002Fgithub.com\u002Fxudejing\u002Fvideo-question-answering)|ACM Multimedia|\n|**MSRVTT-QA**|[**Video Question Answering via Gradually Refined Attention over Appearance and Motion**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3123266.3123427)|2017|[Link](https:\u002F\u002Fgithub.com\u002Fxudejing\u002Fvideo-question-answering)|ACM Multimedia|\n|**TGIF-QA**|[**TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1704.04497)|2017|[Link](https:\u002F\u002Fgithub.com\u002FYunseokJANG\u002Ftgif-qa)|CVPR|\n|**ActivityNet-QA**|[**ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.02467)|2019|[Link](https:\u002F\u002Fgithub.com\u002FMILVLG\u002Factivitynet-qa)|AAAI|\n|**Pororo-QA**|[**DeepStory: Video Story QA by Deep Embedded Memory Networks**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.00836)|2017|[Link](https:\u002F\u002Fgithub.com\u002FKyung-Min\u002FPororoQA)|IJCAI|\n|**TVQA**|[**TVQA: Localized, Compositional Video Question Answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1809.01696)|2018|[Link](https:\u002F\u002Ftvqa.cs.unc.edu\u002F)|EMNLP|\n|**MAD-QA**|[**Encoding and Controlling Global Semantics for Long-form Video Question Answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.19723)|2024|[Link](https:\u002F\u002Fgithub.com\u002Fzhiyuanhubj\u002Flong_form_videoqa)|EMNLP|\n|**Ego-QA**|[**Encoding and Controlling Global Semantics for Long-form Video Question Answering**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.19723)|2024|[Link](https:\u002F\u002Fgithub.com\u002Fzhiyuanhubj\u002Flong_form_videoqa)|EMNLP|\n| **BlackSwanSuite** | [**Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05725) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsahithyaravi\u002FBlackSwan.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsahithyaravi\u002FBlackSwan) | 2025 | [Link](https:\u002F\u002Fblackswan.cs.ubc.ca) | CVPR |\n| **CrossVid** | [**CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.12263) | 2025 | [Link](https:\u002F\u002Fgithub.com\u002Fchuntianli666\u002FCrossVid) | AAAI |\n\n#### Video Instruction Tuning\n##### Pretraining Dataset\n| Name               |                            Paper                             | Date |                            Link                             |  Venue  |\n| :----------------- | :----------------------------------------------------------: | :--: | :---------------------------------------------------------: | :-----: |\n|**VidChapters-7M**|[**VidChapters-7M: Video Chapters at Scale**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.13952)|2023|[Link](https:\u002F\u002Fantoyang.github.io\u002Fvidchapters.html)|NeurIPS|\n|**VALOR-1M**|[**VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08345)|2023|[Link](https:\u002F\u002Fgithub.com\u002FTXH-mercury\u002FVALOR)|arXiv|\n|**Youku-mPLUG**|[**Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04362)|2023|[Link](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FYouku-mPLUG)|arXiv|\n|**InternVid**|[**InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06942)|2023|[Link](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVideo\u002Ftree\u002Fmain\u002FData\u002FInternVid)|arXiv|\n|**VAST-27M**|[**VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18500)|2023|[Link](https:\u002F\u002Fgithub.com\u002FTXH-mercury\u002FVAST)|NeurIPS|\n\n##### Fine-tuning Dataset\n| Name               |                            Paper                             | Date |                            Link                             |  Venue  |\n| :----------------- | :----------------------------------------------------------: | :--: | :---------------------------------------------------------: | :-----: |\n|**MIMIC-IT**|[**MIMIC-IT: Multi-Modal In-Context Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05425)|2023|[Link](https:\u002F\u002Fgithub.com\u002Fluodian\u002Fotter)|arXiv|\n|**VideoInstruct100K**|[**Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05424)|2023|[Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FVideoInstruct-100K)|arXiv|\n|**TimeIT**|[**TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02051)|2023|[Link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FShuhuaiRen\u002FTimeIT)|CVPR|\n\n#### Video-based Large Language Models Benchmark\n\n| Title                                                        |  Date   |                            Code                            |              Venue               |\n| :----------------------------------------------------------- | :-----: | :--------------------------------------------------------: | :------------------------------: |\n| [**LVBench: An Extreme Long Video Understanding Benchmark**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.08035) | 06\u002F2024 |    [code](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FLVBench)    |                -                 |\n| [**Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16103) | 11\u002F2023 |    [code](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-Bench)    |                -                 |\n| [**Perception Test: A Diagnostic Benchmark for Multimodal Video Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13786) | 05\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Fperception_test) | NeurIPS 2023, ICCV 2023 Workshop |\n| [**Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04362v1) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fx-plug\u002Fyouku-mplug.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fx-plug\u002Fyouku-mplug) | 07\u002F2023 |       [code](https:\u002F\u002Fgithub.com\u002Fx-plug\u002Fyouku-mplug)        |                -                 |\n| [**FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.01813) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fllyx97\u002Ffetv.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fllyx97\u002Ffetv) | 11\u002F2023 |           [code](https:\u002F\u002Fgithub.com\u002Fllyx97\u002Ffetv)           |                NeurIPS 2023                 |\n| [**MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04817) | 12\u002F2023 |         [code](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FMoVQA)         |                -                 |\n| [**MVBench: A Comprehensive Multi-modal Video Understanding Benchmark**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17005) | 12\u002F2023 |     [code](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything)      |                -                 |\n| [**TempCompass: Do Video LLMs Really Understand Videos?**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.00476) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fllyx97\u002FTempCompass.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fllyx97\u002FTempCompass) | 03\u002F2024 |           [code](https:\u002F\u002Fgithub.com\u002Fllyx97\u002FTempCompass)           |                ACL 2024                 |\n| [**Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.21075) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FVideo-MME.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME) | 06\u002F2024 |           [code](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME)           |                -                 |\n| [**VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16338) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpatrick-tssn\u002FVideoHallucer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpatrick-tssn\u002FVideoHallucer) | 06\u002F2024 |           [code](https:\u002F\u002Fgithub.com\u002Fpatrick-tssn\u002FVideoHallucer)           |                -                 |\n| [**Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05725) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsahithyaravi\u002FBlackSwan.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsahithyaravi\u002FBlackSwan) | 06\u002F2025 |           [code](https:\u002F\u002Fgithub.com\u002Fsahithyaravi\u002FBlackSwan)           |                CVPR 2025             |\n| [**Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.22385) | 08\u002F2025 |          -          |                -             |\n| [**CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.12263) | 11\u002F2025 |           [code](https:\u002F\u002Fgithub.com\u002Fchuntianli666\u002FCrossVid)           |                AAAI 2026             |\n| [**MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.07250) |   MVU-Eval   | 11\u002F2025 | [code](https:\u002F\u002Fgithub.com\u002FNJU-LINK\u002FMVU-Eval) | NeurIPS DB |\n| [**OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.10689) |   OmniVideoBench   | 10\u002F2025 | [code](https:\u002F\u002Fgithub.com\u002FNJU-LINK\u002FOmniVideoBench) | arXiv |\n| [**IF-VidCap: Can Video Caption Models Follow Instructions?**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.18726) |   IF-VidCap   | 10\u002F2025 | [code](https:\u002F\u002Fgithub.com\u002FNJU-LINK\u002FIF-VidCap) | arXiv |\n\n## Contributing\n\nWe welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the \"Title|Model|Date|Code|Venue\" format. Thank you for your valuable contributions!\n\n\n### 🌟 Star History\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyunlong10_Awesome-LLMs-for-Video-Understanding_readme_a5c6fbbc7e64.png)](https:\u002F\u002Fstar-history.com\u002F#yunlong10\u002FAwesome-LLMs-for-Video-Understanding&Date)\n\n### ♥️ Contributors\n\nOur project wouldn't be possible without the contributions of these amazing people! Thank you all for making this project better.\n\n[Yolo Y. Tang](https:\u002F\u002Fyunlong10.github.io) @ University of Rochester \\\n[Jing Bi](https:\u002F\u002Fjing-bi.github.io) @ University of Rochester \\\n[Siting Xu](https:\u002F\u002Fsai-01.github.io) @ Southern University of Science and Technology \\\n[Luchuan Song](https:\u002F\u002Fsongluchuan.github.io) @ University of Rochester \\\n[Susan Liang](https:\u002F\u002Fliangsusan-git.github.io) @ University of Rochester \\\n[Teng Wang](http:\u002F\u002Fttengwang.com\u002F) @ The University of Hong Kong \\\n[Daoan Zhang](https:\u002F\u002Fdwan.ch) @ University of Rochester \\\n[Jie An](https:\u002F\u002Fpkuanjie.com) @ University of Rochester \\\n[Jingyang Lin](https:\u002F\u002Fjylin.me) @ University of Rochester \\\n[Rongyi Zhu](https:\u002F\u002Frongyizhu.github.io) @ University of Rochester \\\n[Ali Vosoughi](https:\u002F\u002Falivosoughi.com) @ University of Rochester \\\n[Chao Huang](https:\u002F\u002Fwikichao.github.io) @ University of Rochester \\\n[Zeliang Zhang](https:\u002F\u002Fzhangaipi.github.io) @ University of Rochester \\\n[Pinxin Liu](https:\u002F\u002Fandypinxinliu.github.io) @ University of Rochester \\\n[Mingqian Feng](https:\u002F\u002Ffmmarkmq.github.io) @ University of Rochester \\\n[Feng Zheng](https:\u002F\u002Fwww.sustech.edu.cn\u002Fen\u002Ffaculties\u002Fzhengfeng.html) @ Southern University of Science and Technology \\\n[Jianguo Zhang](https:\u002F\u002Ffaculty.sustech.edu.cn\u002F?tagid=zhangjg&iscss=1&snapid=1&orderby=date&go=2&lang=en) @ Southern University of Science and Technology \\\n[Ping Luo](http:\u002F\u002Fluoping.me) @ University of Hong Kong \\\n[Jiebo Luo](https:\u002F\u002Fwww.cs.rochester.edu\u002Fu\u002Fjluo\u002F) @ University of Rochester \\\n[Chenliang Xu](https:\u002F\u002Fwww.cs.rochester.edu\u002F~cxu22\u002Findex.html) @ University of Rochester \n\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyunlong10_Awesome-LLMs-for-Video-Understanding_readme_cba89f61381f.png\" \u002F>\n\u003C\u002Fa>\n\n","# 用于视频理解的优秀大语言模型 [![Awesome](https:\u002F\u002Fawesome.re\u002Fbadge.svg)](https:\u002F\u002Fawesome.re)\n\n### 🔥🔥🔥 [利用大型语言模型进行视频理解：综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17432)\n\n> *[唐云龙](https:\u002F\u002Fyunlong10.github.io\u002F)\u003Csup>1\u003C\u002Fsup>, [毕静](https:\u002F\u002Fjing-bi.github.io\u002F)\u003Csup>1\u003C\u002Fsup>, [徐思婷](https:\u002F\u002Fsai-01.github.io\u002F)\u003Csup>2\u003C\u002Fsup>, [宋陆川](https:\u002F\u002Fsongluchuan.github.io\u002F)\u003Csup>1\u003C\u002Fsup>, [梁苏珊](https:\u002F\u002Fliangsusan-git.github.io\u002F)\u003Csup>1\u003C\u002Fsup> , [王腾](http:\u002F\u002Fttengwang.com\u002F)\u003Csup>2,3\u003C\u002Fsup> , [张道安](https:\u002F\u002Fdwan.ch\u002F)\u003Csup>1\u003C\u002Fsup> , [安杰](https:\u002F\u002Fpkuanjie.com\u002F)\u003Csup>1\u003C\u002Fsup> , [林景阳](https:\u002F\u002Fjylin.me\u002F)\u003Csup>1\u003C\u002Fsup> , [朱荣毅](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Frongyi-zhu-41b15124a\u002F)\u003Csup>1\u003C\u002Fsup> , [阿里·沃索吉](https:\u002F\u002Falivosoughi.com\u002F)\u003Csup>1\u003C\u002Fsup> , [黄超](https:\u002F\u002Fwikichao.github.io\u002F)\u003Csup>1\u003C\u002Fsup> , [张泽良](https:\u002F\u002Fzhangaipi.github.io\u002F)\u003Csup>1\u003C\u002Fsup> , [刘品欣](https:\u002F\u002Fandypinxinliu.github.io\u002F)\u003Csup>1\u003C\u002Fsup> , [冯明谦](https:\u002F\u002Ffmmarkmq.github.io\u002F)\u003Csup>1\u003C\u002Fsup> , [郑峰](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=PcmyXHMAAAAJ)\u003Csup>2\u003C\u002Fsup> , [张建国](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=ypSmZtIAAAAJ)\u003Csup>2\u003C\u002Fsup> , [罗平](http:\u002F\u002Fluoping.me\u002F)\u003Csup>3\u003C\u002Fsup> , [罗杰博](https:\u002F\u002Fwww.cs.rochester.edu\u002Fu\u002Fjluo\u002F)\u003Csup>1\u003C\u002Fsup>, [许晨亮](https:\u002F\u002Fwww.cs.rochester.edu\u002F~cxu22\u002Findex.html)\u003Csup>1\u003C\u002Fsup>.*\n\n> *\u003Csup>1\u003C\u002Fsup>罗切斯特大学, \u003Csup>2\u003C\u002Fsup>南方科技大学, \u003Csup>3\u003C\u002Fsup>香港大学*\n\n\u003Ch5 align=\"center\">  \n\n **[论文](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10982110)** | **[arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17432)** | **[项目页面](https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding)**\n\n\u003C\u002Fh5>\n\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyunlong10_Awesome-LLMs-for-Video-Understanding_readme_d95a7c20c08f.png)\n\n\n## 📢 新闻\n[10\u002F06\u002F2025]\n\n🔥 我们的后续工作——[视频-LMM 后训练：深入探讨大型多模态模型的视频推理](https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-Video-LMM-Post-Training)—现已在 [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.05034) 和 [Hugging Face Papers](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2510.05034) 上发布！\n\n\n[05\u002F04\u002F2025]\n\n🌟 我们的 Vid-LLM 综述已被 IEEE 视频技术电路与系统汇刊 (TCSVT) 接受！\n👉 [IEEE Xplore](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10982110) \\| [GitHub](https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding)\n\n[07\u002F23\u002F2024]\n\n📢 我们最近更新了我们的综述：“利用大型语言模型进行视频理解：综述”！\n\n✨ 这份全面的综述涵盖了由大型语言模型驱动的视频理解技术、训练策略、相关任务、数据集、基准测试和评估方法，并讨论了 Vid-LLMs 在各个领域的应用。\n\n🚀 **本次更新的新内容**：\n\u003Cbr>✅ 更新至包含截至2024年6月约100个额外的Vid-LLMs和15个新基准。\n\u003Cbr>✅ 基于视频表示和LLM功能提出了Vid-LLMs的新分类法。\n\u003Cbr>✅ 增加了初步章节，从粒度和语言参与的角度重新分类了视频理解任务，并增强了LLM背景部分。\n\u003Cbr>✅ 增加了新的训练策略章节，移除了适配器作为模型分类的因素。\n\u003Cbr>✅ 所有图表均已重新设计。\n\n在这次重大更新之后，还将进行多次小幅更新。GitHub仓库也将很快逐步更新。我们欢迎您的阅读和反馈 ❤️\n\n\u003Cfont size=5>\u003Ccenter>\u003Cb> 目录 \u003C\u002Fb> \u003C\u002Fcenter>\u003C\u002Ffont>\n- [用于视频理解的优秀大语言模型 ](#awesome-llms-for-video-understanding-)\n    - [🔥🔥🔥 利用大型语言模型进行视频理解：综述](#-video-understanding-with-large-language-models-a-survey)\n  - [我们为什么需要 Vid-LLMs？](#why-we-need-vid-llms)\n  - [😎 Vid-LLMs：模型](#-vid-llms-models)\n    - [📑 引用](#-citation)\n      - [🗒️ 分类法1](#️-taxonomy-1)\n        - [🕹️ 视频分析器 × LLM](#️-video-analyzer--llm)\n          - [LLM作为总结者](#llm-as-summarizer)\n          - [LLM作为管理者](#llm-as-manager)\n        - [👾 视频嵌入器 × LLM](#-analyzer--embedder--llm)\n          - [LLM作为文本解码器](#llm-as-text-decoder)\n          - [LLM作为回归器](#llm-as-regressor)\n          - [LLM作为隐藏层](#llm-as-hidden-layer)\n        - [🧭 （分析器+嵌入器）× LLM](#-analyzer--embedder--llm)\n          - [LLM作为管理者](#llm-as-manager-1)\n          - [LLM作为总结者](#llm-as-summarizer-1)\n          - [LLM作为文本解码器](#llm-as-text-decoder-1)\n          - [LLM作为回归器](#llm-as-regressor-1)\n          - [LLM作为隐藏层](#llm-as-hidden-layer-1)\n      - [🗒️ 分类法2](#️-taxonomy-2)\n        - [🤖 基于LLM的视频智能体](#-llm-based-video-agents)\n        - [🎥 Vid-LLM预训练](#-vid-llm-pretraining)\n        - [👀 Vid-LLM指令微调](#-vid-llm-instruction-tuning)\n          - [使用连接型适配器进行微调](#fine-tuning-with-connective-adapters)\n          - [使用插入型适配器进行微调](#fine-tuning-with-insertive-adapters)\n          - [使用混合型适配器进行微调](#fine-tuning-with-hybrid-adapters)\n        - [🦾 混合方法](#-hybrid-methods)\n        - [无需训练的方法](#-training-free-methods)\n  - [任务、数据集和基准测试](#tasks-datasets-and-benchmarks)\n      - [识别与预测](#recognition-and-anticipation)\n      - [字幕与描述](#captioning-and-description)\n      - [定位与检索](#grounding-and-retrieval)\n      - [问答](#question-answering)\n      - [视频指令微调](#video-instruction-tuning)\n        - [预训练数据集](#pretraining-dataset)\n        - [微调数据集](#fine-tuning-dataset)\n      - [基于视频的大语言模型基准](#video-based-large-language-models-benchmark)\n  - [贡献](#contributing)\n    - [🌟 星标历史](#-star-history)\n    - [♥️ 贡献者](#️-contributors)\n\n## 我们为什么需要 Vid-LLMs？\n\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyunlong10_Awesome-LLMs-for-Video-Understanding_readme_4ce7b1bfa126.png)\n\n\n## 😎 Vid-LLMs：模型 \n\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyunlong10_Awesome-LLMs-for-Video-Understanding_readme_b30792fe303a.png)\n\n### 📑 引用\n\n如果您发现我们的综述对您的研究有所帮助，请引用以下论文：\n\n```bibtex\n@article{vidllmsurvey,\n  author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Liu, Pinxin and Feng, Mingqian and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},\n  journal={IEEE Transactions on Circuits and Systems for Video Technology}, \n  title={Video Understanding with Large Language Models: A Survey}, \n  year={2025},\n  doi={10.1109\u002FTCSVT.2025.3566695}\n}\n```\n\n### 🗒️ 分类法1\n\n#### 🕹️ 视频分析器 × LLM\n\n##### 大语言模型作为摘要生成器\n| 标题                                                        |        模型        |  日期   |                             代码                             | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**看见未见：视频的视觉隐喻字幕生成**](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2406.04886v1) |   GIT-LLaVA   | 06\u002F2024 |      [代码]()       | arXiv |\n| [**通过剧本实现零样本长视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.17309) |   MM-Screenplayer   | 06\u002F2024 |      [项目页面]()       | CVPR |\n| [**MoReVQA：探索用于视频问答的模块化推理模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.06511) |   MoReVQA   | 04\u002F2024 |      [项目页面]()       | CVPR |\n| [**一张图像网格胜过一段视频：基于视觉语言模型的零样本视频问答**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.18406) |   IG-VLM   | 03\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fimagegridworth\u002FIG-VLM)       | arXiv |\n| [**用于长视频理解的语言库**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.14622) |   LangRepo   | 03\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fkkahatapitiya\u002FLangRepo)       | arXiv |\n| [**在一次多模态语言模型的前向传播中理解长视频**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.16998) |   MVU   | 03\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fkahnchana\u002Fmvu)       | arXiv |\n| [**Video ReCap：对长达一小时的视频进行递归字幕生成**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.13250) |   Video ReCap   | 02\u002F2024 |      [代码](https:\u002F\u002Fsites.google.com\u002Fview\u002Fvidrecap)       | CVPR |\n| [**用于长距离视频问答的简单大语言模型框架**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17235) |   LLoVi   | 12\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002FCeeZh\u002FLLoVi)       | arXiv |\n| [**接地提示器：利用多模态信息为大语言模型提供提示，以实现长视频中时间句的定位**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17117) |   地接提示器   | 12\u002F2023 |      [代码]()       | arXiv |\n| [**从开放世界视角学习视频中的物体状态变化**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.11782) |   VIDOSC   | 12\u002F2023 |      [代码]()       | CVPR |\n| [**AntGPT：大型语言模型能否帮助从视频中进行长期动作预测？**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16368) |   AntGPT   | 07\u002F2023 |      [代码](https:\u002F\u002Fbrown-palm.github.io\u002FAntGPT)       | ICLR |\n| [**VAST：一个视觉-音频-字幕-文本全模态基础模型及数据集**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18500v1)[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftxh-mercury\u002Fvast?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftxh-mercury\u002Fvast) |  VAST   | 05\u002F2023 |    [代码](https:\u002F\u002Fgithub.com\u002Ftxh-mercury\u002Fvast)     | NeurIPS |\n| [**VLog：将视频视为长文档**](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FVLog)[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FVLog.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FVLog) |        VLog         | 04\u002F2023 |    [代码](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTencentARC\u002FVLog)     |   -   |\n| [**从大型语言模型中学习视频表示**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.04501)[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Flavila?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flavila) | LaViLa  | 12\u002F2022 | [代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flavila) |  CVPR   |\n\n##### LLM作为管理者\n| 标题                                                        |        模型        |  日期   |                             代码                             | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**DrVideo：基于文档检索的长视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12846) |   DrVideo   | 06\u002F2024 |      [code]()       | arXiv |\n| [**OmAgent：一种用于复杂视频理解的多模态代理框架，采用任务分解与分治策略**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16620) |   OmAgent   | 06\u002F2024 |      [code]()       | arXiv |\n| [**帧太多，并非都有效：面向长视频问答的高效策略**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09396) |   LVNet   | 06\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fjongwoopark7978\u002FLVNet)       | arXiv |\n| [**VideoTree：一种自适应树状视频表示方法，用于LLM对长视频的推理**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.19209) |   VideoTree   | 05\u002F2024 |      [code](https:\u002F\u002Fvideotree2024.github.io\u002F)       | arXiv |\n| [**利用大型语言模型实现无需训练的视频异常检测**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.01014) |   LAVAD   | 04\u002F2024 |      [code](https:\u002F\u002Flucazanella.github.io\u002Flavad\u002F)       | CVPR |\n| [**TraveLER：一种用于视频问答的多LMM代理框架**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.01476) |   TraveLER   | 04\u002F2024 |      [code]()       | arXiv |\n| [**GPTSee：通过基于描述的相似性特征增强时刻检索和亮点检测**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01437) |   GPTSee   | 03\u002F2024 |      [code]()       | arXiv |\n| [**Reframe anything：用于开放世界视频重构的LLM代理**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06070) |   RAVA   | 03\u002F2024 |      [code]()       | arXiv |\n| [**SCHEMA：状态变化在教学视频中的流程规划中至关重要**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01599) |   SCHEMA   | 03\u002F2024 |      [code]()       | ICLR |\n| [**TV-TREES：用于神经符号式视频推理的多模态蕴含树**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.19467) |   TV-TREES   | 02\u002F2024 |      [code]()       | arXiv |\n| [**VideoAgent：一种记忆增强型多模态代理，用于视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.11481) | VideoAgent  | 03\u002F2024 |[项目页面](https:\u002F\u002Fvideoagent.github.io\u002F)| arXiv |\n| [**VideoAgent：以大型语言模型为代理的长视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.10517) |   VideoAgent   | 03\u002F2024 |      [code]()       | arXiv |\n| [**VURF：一种通用的视频理解推理与自我精炼框架**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.14743) |   VURF   | 03\u002F2024 |      [code]()       | arXiv |\n| [**为什么不使用你的教科书知识来增强教学视频的流程规划呢？**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.02782) |   KEPP   | 03\u002F2024 |      [code]()       | CVPR |\n| [**哆啦A梦GPT：迈向利用大型语言模型理解动态场景**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.08392) |   哆啦A梦GPT   | 01\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FDoraemonGPT)       | arXiv |\n| [**终身记忆：利用LLM回答长时程第一人称视角视频中的问题**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.05269) |   终身记忆   | 12\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FAgentic-Learning-AI-Lab\u002Flifelong-memory)       | arXiv |\n| [**零样本视频问答与程序化指令**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00937) |   ProViQ   | 12\u002F2023 |      [code](https:\u002F\u002Frccchoudhury.github.io\u002Fproviq2023)       | arXiv |\n| [**AssistGPT：一个能够计划、执行、检查并学习的通用多模态助手**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.08640) |   AssistGPT   | 06\u002F2023 |      [code](https:\u002F\u002Fshowlab.github.io\u002Fassistgpt\u002F)       | arXiv |\n| [**ChatVideo：一个以轨迹为中心的多模态且多功能的视频理解系统**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.14407) |      ChatVideo      | 04\u002F2023 |    [项目页面](https:\u002F\u002Fwww.wangjunke.info\u002FChatVideo\u002F)     | arXiv |\n| [**Video ChatCaptioner：迈向更丰富的时空描述**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.04227)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FChatCaptioner.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FChatCaptioner\u002Ftree\u002Fmain\u002FVideo_ChatCaptioner) | Video ChatCaptioner | 04\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FChatCaptioner\u002Ftree\u002Fmain\u002FVideo_ChatCaptioner) | arXiv |\n| [**ViperGPT：通过Python执行进行视觉推理**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.08128) |   ViperGPT   | 03\u002F2023 |      [code](https:\u002F\u002Fviper.cs.columbia.edu\u002F)       | arXiv |\n| [**Hawk：学习理解开放世界视频异常**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16886) |   Hawk   | 05\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fjqtangust\u002Fhawk)       | arXiv |\n\n#### 👾 视频嵌入器 × LLM\n\n##### LLM作为文本解码器\n\n| 标题                                                        |        模型        | 日期   |                             代码                             | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**AuroraCap：高效、高性能的视频详细字幕生成及新基准**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03051) |   AuroraCap   | 10\u002F2024 |      [项目页面](https:\u002F\u002Frese1f.github.io\u002Faurora-web\u002F)       | arXiv |\n| [**Artemis：迈向复杂视频中的指代理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.00258) |   Artemis   | 06\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fqiujihao19\u002FArtemis)       | arXiv |\n| [**EmoLLM：多模态情感理解与大型语言模型的结合**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16442) |   EmoLLM   | 06\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fyan9qu\u002FEmoLLM)       | arXiv |\n| [**减少令牌和视频数量以扩展大型视觉-语言模型的视频理解能力**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.08024) |   FTFV-LLM   | 06\u002F2024 |      -      | arXiv |\n| [**Flash-VStream：基于内存的长视频流实时理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.08085) |   Flash-VStream   | 06\u002F2024 |      [代码](https:\u002F\u002Finvinciblewyq.github.io\u002Fvstream-page\u002F)       | arXiv |\n| [**LLAVIDAL：面向日常生活活动的大规模视觉-语言模型基准测试**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09390) |   LLAVIDAL   | 06\u002F2024 |      [代码](https:\u002F\u002Fadl-x.github.io\u002F)       | arXiv |\n| [**从语言到视觉的长上下文迁移**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16852) |   LongVA   | 06\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002FLongVA)       | arXiv |\n| [**ShareGPT4Video：通过更好的字幕提升视频理解和生成能力**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.04325) |   ShareGPT4Video   | 06\u002F2024 |      [代码](https:\u002F\u002Fsharegpt4video.github.io\u002F)       | arXiv |\n| [**迈向事件导向的长视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.14129) |   VIM   | 06\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FEvent-Bench)       | arXiv |\n| [**Video-SALMONN：语音增强的视听大型语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.15704) |   Video-SALMONN   | 06\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FSALMONN\u002F)       | ICML |\n| [**VideoGPT+：集成图像和视频编码器以增强视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09418) |   VideoGPT+   | 06\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideoGPT-plus)       | arXiv |\n| [**VideoLLaMA 2：推进视频LLM中的时空建模和音频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07476) |   VideoLLaMA 2   | 06\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2)       | arXiv |\n| [**MotionLLM：从人体运动和视频中理解人类行为**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.20340) |   MotionLLM   | 05\u002F2024 |      [项目页面](https:\u002F\u002Flhchen.top\u002FMotionLLM)       | arXiv |\n| [**MVBench：全面的多模态视频理解基准测试**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17005) |   VideoChat2   | 11\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything)       | CVPR |\n| [**Shotluck Holmes：用于视频字幕和摘要的小规模高效视觉-语言模型家族**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.20648) |   Shotluck Holmes   | 05\u002F2024 |      -      | arXiv |\n| [**使用大型语言模型进行流式长视频理解**](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2405.16009) |   VideoStreaming   | 05\u002F2024 |      -       | arXiv |\n| [**同步视频叙事：生成具有结构化情节的视频旁白**](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2405.14040v1) |   VideoNarrator   | 05\u002F2024 |      -       | arXiv |\n| [**TOPA：通过纯文本预对齐扩展大型语言模型的视频理解能力**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.13911) |   TOPA   | 05\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fdhg-wei\u002FTOPA)       | NeurIPS |\n| [**MovieChat+：面向长视频问答的问题感知稀疏记忆**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.17176) |   MovieChat+   | 04\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat)       | arXiv |\n| [**AutoAD III：前传——回到像素**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14412) |   AutoAD III   | 04\u002F2024 |      [项目页面](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fresearch\u002Fautoad\u002F)       | CVPR |\n| [**基于语言模型奖励的视频多模态大模型直接偏好优化**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.01258) |   LLaVA-Hound-DPO   | 04\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002FRifleZhang\u002FLLaVA-Hound-DPO)       | arXiv |\n| [**从图像到视频，我们在多模态LLM中需要什么？**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.11865) |   RED-VILLM   | 04\u002F2024 |      -       | arXiv |\n| [**Koala：关键帧条件下的长视频LLM**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.04346) |   Koala   | 04\u002F2024 |      [项目页面](https:\u002F\u002Fcs-people.bu.edu\u002Frxtan\u002Fprojects\u002FKoala)       | CVPR |\n| [**LongVLM：通过大型语言模型实现高效的长视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03384) |   LongVLM   | 04\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fziplab\u002FLongVLM)       | ECCV |\n| [**MA-LMM：用于长期视频理解的记忆增强型大型多模态模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05726) |   MA-LMM   | 04\u002F2024 |      [代码](https:\u002F\u002Fboheumd.github.io\u002FMA-LMM\u002F)       | CVPR |\n| [**MiniGPT4-video：通过交错的视觉-文本令牌推进多模态LLM的视频理解能力**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03413) |   MiniGPT4-Video   | 04\u002F2024 |      [代码](https:\u002F\u002Fvision-cair.github.io\u002FMiniGPT4-video\u002F)       | arXiv |\n| [**Pegasus-v1技术报告**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14687) |   Pegasus-v1   | 04\u002F2024 |      [代码]()       | arXiv |\n| [**PLLaVA：无需参数的LLaVA扩展，从图像到视频用于视频密集字幕生成**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.16994) |   PLLaVA   | 04\u002F2024 |      [代码](https:\u002F\u002Fpllava.github.io\u002F)       | arXiv |\n| [**ST-LLM：大型语言模型是有效的时序学习者**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.00308) |   ST-LLM   | 04\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FST-LLM)       | arXiv |\n| [**Tarsier：训练和评估大型视频描述模型的配方**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00634) |   Tarsier   | 07\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Ftarsier)       | arXiv |\n| [**X-VARS：利用多模态大型语言模型在足球裁判中引入可解释性**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.06332) |   X-VARS   | 04\u002F2024 |      [代码]()       | arXiv |\n| [**CAT：增强多模态大型语言模型以应对动态视听场景中的问题**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.04640) |   CAT   | 03\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Frikeilong\u002FBay-CAT)       | arXiv |\n| [**InternVideo2：扩展视频基础模型以支持多模态视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.15377) |   InternVideo2   | 03\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVideo\u002Ftree\u002Fmain\u002FInternVideo2)       | ECCV |\n| [**MovieLLM：利用AI生成的电影增强长视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01422) |   MovieLLM   | 03\u002F2024 |      [代码](https:\u002F\u002Fdeaddawn.github.io\u002FMovieLLM\u002F)       | arXiv |\n| [**LLMs与长视频相遇：在LLM中加入交互式视觉适配器以推进长视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.13546) |   IVAwithLLM   | 02\u002F2024 |      [代码]()       | arXiv |\n| [**LSTP：面向长视频-文本理解的语言引导时空提示学习**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16050) |   LSTP   | 02\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fbigai-nlco\u002FVideoTGB)       | EMNLP |\n| [**LVCHAT：促进长视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12079) |   LVCHAT   | 02\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fwangyu-ustc\u002FLVChat)       | arXiv |\n| [**OSCaR：物体状态字幕及状态变化表示**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17128) |   OSCaR   | 02\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fnguyennm1024\u002FOSCaR)       | NAACL |\n| [**Slot-VLM：用于视频-语言建模的SlowFast插槽**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.13088) |   Slot-VLM   | 02\u002F2024 |      [代码]()       | arXiv |\n| [**COSMO：对比式简化多模态模型，采用交错式预训练**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00849) |   COSMO   | 01\u002F2024 |      [代码](http:\u002F\u002Ffingerrec.github.io\u002Fcosmo)       | arXiv |\n| [**弱监督高斯对比接地：利用大型多模态模型进行视频问答**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10711) |   GCG   | 01\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002FWHB139426\u002FGCG)       | ACMMM |\n| [**用于视频理解的视听LLM**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06720) |   AV-LLM   | 12\u002F2023 |      [代码]()       | arXiv |\n| [**生成式多模态模型是上下文学习者**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13286) |   Emu2   | 12\u002F2023 |      [项目页面](https:\u002F\u002Fbaaivision.github.io\u002Femu2)       | CVPR |\n| [**MMICT：利用上下文示例提升多模态微调效果**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06363) |   MMICT   | 12\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002FKDEGroup\u002FMMICT)       | TOMM |\n| [**VaQuitA：增强LLM辅助视频理解中的对齐能力**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02310) |   VaQuitA   | 12\u002F2023 |      [代码]()       | arXiv |\n| [**VILA：关于视觉语言模型的预训练**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.07533) |   VILA   | 12\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVILA)       | CVPR |\n| [**Vista-LLaMA：通过与视觉令牌等距来实现可靠的视频叙述者**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08870) |   Vista-LLaMA   | 12\u002F2023 |      [项目页面](https:\u002F\u002Fjinxxian.github.io\u002FVista-LLaMA)       | arXiv |\n| [**Chat-UniVi：统一的视觉表征赋予大型语言模型图像和视频理解能力**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.08046) |   Chat-UniVi   | 11\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FChat-UniVi)       | CVPR |\n| [**LLaMA-VID：在大型语言模型中，一张图像胜过两个令牌**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17043) |   LLaMA-VID   | 11\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID)       | arXiv |\n| [**Video-LLaVA：通过投影前的对齐学习统一的视觉表征**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10122) |   Video-LLaVA   | 11\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA)       | arXiv |\n| [**大型语言模型是用于视频问答的时序和因果推理者**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.15747) |   LLaMA-VQA   | 10\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002Fmlvlab\u002FFlipped-VQA)       | EMNLP |\n| [**MovieChat：从密集令牌到稀疏记忆，用于长视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449) |   MovieChat   | 07\u002F2023 |      [代码](https:\u002F\u002Frese1f.github.io\u002FMovieChat\u002F)       | CVPR |\n| [**LLMVA-GEBC：带有视频适配器的大型语言模型，用于通用事件边界字幕生成**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.10354) |   LLMVA-GEBC   | 06\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002Fzjr2000\u002FLLMVA-GEBC)       | CVPR |\n| [**Macaw-LLM：融合图像、音频、视频和文本的多模态语言建模**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.09093) |   Macaw-LLM   | 06\u002F2023 |      [项目页面](https:\u002F\u002Fgithub.com\u002Flyuchenyang\u002FMacaw-LLM)       | arXiv |\n| [**Valley：配备大型语言模型的强大视频助手**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07207) |   VALLEY   | 06\u002F2023 |      [代码]()       | arXiv |\n| [**Video-ChatGPT：迈向通过大型视觉和语言模型实现详细视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05424) |   Video-ChatGPT   | 06\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT)       | ACL |\n| [**Video-LLaMA：一种针对视频理解的指令微调视听语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.02858) |   Video-LLaMA   | 06\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideo-LLaMA)       | EMNLP |\n| [**Youku-mPLUG：用于预训练和基准测试的1000万规模中文视频-语言数据集**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04362) |   mPLUG-video   | 06\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FYouku-mPLUG)       | arXiv |\n| [**ChatBridge：以大型语言模型为语言催化剂连接不同模态**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.16103) |   ChatBridge   | 05\u002F2023 |      [代码](https:\u002F\u002Fiva-chatbridge.github.io)       | arXiv |\n| [**Otter：一种具有上下文指令微调的多模态模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.03726) |   Otter   | 05\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter)       | arXiv |\n| [**VideoLLM：利用大型语言模型建模视频序列**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13292) |   VideoLLM   | 05\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002Fcg1177\u002FVideoLLM)       | arXiv |\n| [**一条轨迹，一个令牌：通过全景子对象轨迹进行 grounded视频分词**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23617) |   -   | 05\u002F2025 |      [代码](https:\u002F\u002Fgithub.com\u002FRAIVNLab\u002Ftrajvit)       | ICCV 2025 |\n\n##### LLM作为回归器\n\n\u003C!-- \n| [**标题**](链接) |   模型   | 日期 |      [代码](链接)       | 场所 |\n -->\n| 标题                                                        |        模型        |  日期   |                             代码                             | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n|[**LLaVA-MR：用于视频瞬间检索的大型多模态语言视觉助手**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.14505)|LLaVA-MR|11\u002F2024|[code]()| arXiv |\n| [**Holmes-VAD：通过多模态LLM实现无偏且可解释的视频异常检测**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12235) |   Holmes-VAD   | 06\u002F2024 |      [code](https:\u002F\u002Fholmesvad.github.io\u002F)       | arXiv |\n| [**VideoLLM-online：面向流媒体视频的在线视频大型语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.11816) |   VideoLLM-online   | 06\u002F2024 |      [code](https:\u002F\u002Fshowlab.github.io\u002Fvideollm-online)       | CVPR |\n| [**VLM4HOI：第一人称视角下的手物交互引用任务**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.09933) |   VLM4HOI   | 04\u002F2024 |      [项目页面](https:\u002F\u002Fsid2697.github.io\u002Fhoi-ref\u002F)       | arXiv |\n| [**V2Xum-LLaMA：基于时间提示指令微调的跨模态视频摘要模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12353) |   V2Xum-LLaMA   | 04\u002F2024 |      [code](https:\u002F\u002Fhanghuacs.github.io\u002Fv2xum\u002F)       | arXiv |\n| [**AVicuna：具有交错器和上下文边界对齐功能的视听LLM，用于时序指代对话**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.16276) |   AVicuna   | 03\u002F2024 |      [code]()       | arXiv |\n| [**Elysium：通过MLLM探索视频中的对象级感知**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.16558) |   Elysium   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FHon-Wong\u002FElysium)       | arXiv |\n| [**HawkEye：用于在视频中定位文本的视频-文本LLM训练**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.10228) |   HawkEye   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fyellow-binary-tree\u002FHawkEye)       | arXiv |\n| [**LITA：语言指令驱动的时序定位助手**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.19046) |   LITA   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FLITA)       | arXiv |\n| [**OmniViD：通用视频理解的生成式框架**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.17935) |   OmniViD   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fwangjk666\u002FOmniVid)       | CVPR |\n| [**GroundingGPT：语言增强型多模态定位模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06071) |   GroundingGPT   | 01\u002F2024 |      [code](https: \u002F\u002Fgithub.com\u002Flzw-lzw\u002FGroundingGPT)       | arXiv |\n| [**TimeChat：一种对时间敏感的多模态大型语言模型，用于长视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02051) |   TimeChat   | 12\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FRenShuhuai-Andy\u002FTimeChat)       | CVPR |\n| [**SeViLA：用于视频定位与问答的自链式图像-语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06988) |   SeViLA   | 11\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002FYui010206\u002FSeViLA)       | NeurIPS |\n| [**VTimeLLM：赋能LLM掌握视频瞬间**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.18445) |   VTimeLLM   | 11\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002Fhuangb23\u002FVTimeLLM)       | arXiv |\n\n##### LLM作为隐藏层\n | 标题                                                        |        模型        |  日期   |                             代码                             | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**VTG-LLM：将时间戳知识融入视频LLM，以增强视频的时间定位能力**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.13382) |   VTG-LLM   | 05\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002Fgyxxyg\u002FVTG-LLM)       | arXiv |\n| [**VITRON：统一的像素级视觉LLM，用于理解、生成、分割和编辑**](https:\u002F\u002Fhaofei.vip\u002Fdownloads\u002Fpapers\u002FSkywork_Vitron_2024.pdf) |   VITRON   | 04\u002F2024 |      [项目页面](https:\u002F\u002Fvitron-llm.github.io\u002F)       | NeurIPS |\n| [**VTG-GPT：无需微调的零样本视频时间定位技术，基于GPT**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.02076) |   VTG-GPT   | 03\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FYoucanBaby\u002FVTG-GPT)       | arXiv |\n| [**Momentor：通过细粒度的时间推理推进视频大型语言模型的发展**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11435) |   Momentor   | 02\u002F2024 |      [code](https:\u002F\u002Fgithub.com\u002FDCDmllm\u002FMomentor)       | ICML |\n| [**VidDetours：用于导航教学视频的工具**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.01823) |   VidDetours   | 01\u002F2024 |      [code]()       | CVPR |\n| [**OneLLM：一个将所有模态与语言对齐的统一框架**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03700) |   OneLLM   | 12\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002Fcsuhan\u002FOneLLM)       | arXiv |\n| [**GPT4Video：一个统一的多模态大型语言模型，用于遵循指令的理解及安全意识生成**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16511) |   GPT4Video   | 11\u002F2023 |      [code](https:\u002F\u002Fgpt4video.github.io)       | ACMMM |\n\n#### 🧭 (分析器 + 嵌入器) × LLM\n\n##### 大型语言模型作为管理者\n | 标题                                                        |        模型        |  日期   |                             代码                             | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**MM-VID：利用GPT-4V（vision）推进视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.19773) |       MM-VID        | 10\u002F2023 |                              -                               | arXiv |\n##### 大型语言模型作为摘要生成器\n | 标题                                                        |        模型        |  日期   |                             代码                             | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**Shot2Story20K：多镜头视频全面理解的新基准**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.10300) |   SUM-shot   | 12\u002F2023 |      [代码](https:\u002F\u002Fmingfei.info\u002Fshot2story\u002F)       | arXiv |\n##### 大型语言模型作为回归器\n | 标题                                                        |        模型        |  日期   |                             代码                             | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**Vript：一段视频胜过千言万语**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06040) |   Vriptor   | 06\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fmutonix\u002FVript)       | NeurIPS |\n| [**Merlin：以预见性思维赋能多模态大语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00589) |   Merlin   | 12\u002F2023 |      [项目页面](https:\u002F\u002Fahnsun.github.io\u002Fmerlin)       | ECCV |\n| [**VideoChat：以聊天为中心的视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06355) |   VideoChat   | 05\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything)       | arXiv |\n| [**Vid2Seq：用于密集视频字幕的大规模视觉语言模型预训练**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.14115) |   Vid2Seq   | 02\u002F2023 |      [代码](https:\u002F\u002Fantoyang.github.io\u002Fvid2seq.html)       | CVPR |\n##### 大型语言模型作为文本解码器\n | 标题                                                        |        模型        |  日期   |                             代码                             | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**带有交错多模态序列的上下文AD旁白**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12922) |   Uni-AD   | 03\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002FMCG-NJU\u002FUni-AD)       | arXiv |\n| [**MM-Narrator：通过多模态上下文学习为长视频配音**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17435) |   MM-narrator   | 11\u002F2023 |      [项目页面](https:\u002F\u002Fmm-narrator.github.io\u002F)       | arXiv |\n| [**Vamos：用于视频理解的多功能动作模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13627) |   Vamos   | 11\u002F2023 |      [项目页面](https:\u002F\u002Fbrown-palm.github.io\u002FVamos\u002F)       | ECCV |\n| [**AutoAD II：续集——电影音频描述中的谁、何时、何事**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06838) |   Auto-AD II   | 10\u002F2023 |      [项目页面](https:\u002F\u002Fwww.robots.ox.ac.uk\u002Fvgg\u002Fresearch\u002Fautoad\u002F)       | ICCV |\n##### 大型语言模型作为隐藏层\n | 标题                                                        |        模型        |  日期   |                             代码                             | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**PG-Video-LLaVA：像素对齐的大规模视频-语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13435v2)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002Fvideo-llava.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002Fvideo-llava) |   PG-Video-LLaVA    | 11\u002F2023 |      [代码](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002Fvideo-llava)      | arXiv |\n\n\n### 🗒️ 分类学 2\n\n#### 🤖 基于大型语言模型的视频智能体\n\n| 标题                                                        |        模型        |  日期   |                             代码                             | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**苏格拉底模型：用语言构建零样本多模态推理**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.00598) |   苏格拉底模型   | 04\u002F2022 |      [项目页面](https:\u002F\u002Fsocraticmodels.github.io\u002F)       | arXiv |\n| [**视频聊天字幕生成器：迈向丰富的时空描述**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.04227)[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FChatCaptioner.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FChatCaptioner\u002Ftree\u002Fmain\u002FVideo_ChatCaptioner) | 视频聊天字幕生成器 | 04\u002F2023 | [代码](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FChatCaptioner\u002Ftree\u002Fmain\u002FVideo_ChatCaptioner) | arXiv |\n| [**VLog：视频即长文档**](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FVLog)[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FVLog.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FVLog) |        VLog         | 04\u002F2023 |    [代码](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTencentARC\u002FVLog)     |   -   |\n| [**ChatVideo：以轨迹为中心的多模态、多功能视频理解系统**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.14407) |      ChatVideo      | 04\u002F2023 |    [项目页面](https:\u002F\u002Fwww.wangjunke.info\u002FChatVideo\u002F)     | arXiv |\n| [**MM-VID：借助GPT-4V（vision）推进视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.19773) |       MM-VID        | 10\u002F2023 |                              -                               | arXiv |\n| [**MISAR：一种结合增强现实的多模态指令系统**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11699v1)[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fnguyennm1024\u002Fmisar.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fnguyennm1024\u002Fmisar) |        MISAR        | 10\u002F2023 |    [项目页面](https:\u002F\u002Fgithub.com\u002Fnguyennm1024\u002Fmisar)     | ICCV  |\n| [**接地提示器：利用多模态信息为长视频中的时序句子接地提供提示**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17117) | 接地提示器  | 12\u002F2023 |                              -                               | arXiv |\n| [**NaVid：基于视频的VLM为视觉与语言导航规划下一步**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.15852) | NaVid  | 02\u002F2024 |     [项目页面](https:\u002F\u002Fpku-epic.github.io\u002FNaVid\u002F)                         -      | RSS |\n| [**VideoAgent：一种记忆增强型多模态代理，用于视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.11481) | VideoAgent  | 03\u002F2024 |                              [项目页面](https:\u002F\u002Fvideoagent.github.io\u002F)                               | arXiv |\n| [**VideoINSTA：通过LLM进行信息丰富的时空推理，实现零样本长视频理解**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.20365) |   VideoINSTA   | 09\u002F2024 |      [代码](https:\u002F\u002Fgithub.com\u002Fmayhugotong\u002FVideoINSTA)       | EMNLP |\n| [**Ego-R1：用于超长第一人称视角视频推理的工具链思维**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.13654) [![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fegolife-ai\u002FEgo-R1.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fegolife-ai\u002FEgo-R1) |   Ego-R1代理   | 06\u002F2025 |      [代码](https:\u002F\u002Fgithub.com\u002Fegolife-ai\u002FEgo-R1)       | arXiv |\n\n\n#### 🎥 视频-LLM预训练\n\n| 标题                                                        |  模型  |  日期   |                        代码                        |  场所  |\n| :----------------------------------------------------------- | :-----: | :-----: | :------------------------------------------------: | :-----: |\n| [**从大型语言模型中学习视频表征**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.04501)[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Flavila?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flavila) | LaViLa  | 12\u002F2022 | [代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flavila) |  CVPR   |\n| [**Vid2Seq：用于密集视频字幕生成的大规模视觉语言模型预训练**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.14115) | Vid2Seq | 02\u002F2023 |  [代码](https:\u002F\u002Fantoyang.github.io\u002Fvid2seq.html)   |  CVPR   |\n| [**VAST：一个视觉-音频-字幕-文本全模态基础模型及数据集**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18500v1)[![星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftxh-mercury\u002Fvast?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftxh-mercury\u002Fvast) |  VAST   | 05\u002F2023 |    [代码](https:\u002F\u002Fgithub.com\u002Ftxh-mercury\u002Fvast)     | NeurIPS |\n| [**Merlin：用预见性思维赋能多模态LLM**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00589v1) | Merlin  | 12\u002F2023 |                         -                          |  arXiv  |\n\n#### 👀 视频-LLM指令微调\n\n##### 使用连接适配器进行微调\n\n| 标题                                                        |     模型     | 日期   |                         代码                         | 场所 |\n| :----------------------------------------------------------- | :-----------: | :-----: | :--------------------------------------------------: | :---: |\n| [**Video-LLaMA：用于视频理解的指令微调视觉语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.02858) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideo-LLaMA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideo-LLaMA) |  Video-LLaMA  | 06\u002F2023 |  [code](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideo-LLaMA)  | arXiv |\n| [**VALLEY：具有大语言模型增强能力的视频助手**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07207)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRupertLuo\u002FValley.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FRupertLuo\u002FValley) |    VALLEY     | 06\u002F2023 |     [code](https:\u002F\u002Fgithub.com\u002FRupertLuo\u002FValley)      |   -   |\n| [**Video-ChatGPT：通过大型视觉和语言模型实现详细视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05424)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FVideo-ChatGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT) | Video-ChatGPT | 06\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT) | arXiv |\n| [**Macaw-LLM：融合图像、音频、视频和文本的多模态语言建模**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.09093)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flyuchenyang\u002Fmacaw-llm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Flyuchenyang\u002Fmacaw-llm) |   Macaw-LLM   | 06\u002F2023 |   [code](https:\u002F\u002Fgithub.com\u002Flyuchenyang\u002Fmacaw-llm)   | arXiv |\n| [**LLMVA-GEBC：用于通用事件边界字幕生成的大语言模型与视频适配器**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.10354) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzjr2000\u002Fllmva-gebc.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzjr2000\u002Fllmva-gebc) |  LLMVA-GEBC   | 06\u002F2023 |    [code](https:\u002F\u002Fgithub.com\u002Fzjr2000\u002Fllmva-gebc)     | CVPR  |\n| [**Youku-mPLUG：用于预训练和基准测试的1000万规模中文视频-语言数据集**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04362) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fx-plug\u002Fyouku-mplug.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fx-plug\u002Fyouku-mplug) |  mPLUG-video  | 06\u002F2023 |    [code](https:\u002F\u002Fgithub.com\u002Fx-plug\u002Fyouku-mplug)     | arXiv |\n| [**MovieChat：从密集标记到稀疏记忆，用于长视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frese1f\u002FMovieChat.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat) |   MovieChat   | 07\u002F2023 |     [code](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat)      | arXiv |\n| [**大语言模型是视频问答任务中的时序与因果推理者**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.15747)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlvlab\u002FFlipped-VQA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmlvlab\u002FFlipped-VQA) |   LLaMA-VQA   | 10\u002F2023 |    [code](https:\u002F\u002Fgithub.com\u002Fmlvlab\u002FFlipped-VQA)     | EMNLP |\n| [**Video-LLaVA：通过投影前对齐学习统一视觉表征**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.10122v2.pdf)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FVideo-LLaVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA) |  Video-LLaVA  | 11\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA) | arXiv |\n| [**Chat-UniVi：统一视觉表征赋予大语言模型图像和视频理解能力**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17043v1.pdf)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpku-yuangroup\u002Fchat-univi.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpku-yuangroup\u002Fchat-univi) |  Chat-UniVi   | 11\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002Fpku-yuangroup\u002Fchat-univi)  | arXiv |\n| [**LLaMA-VID：在大语言模型中，一张图像胜过两个标记**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.08046v1)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLaMA-VID)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID) |  LLaMA-VID   | 11\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID)  | arXiv |\n| [**VISTA-LLAMA：通过与视觉标记等距实现可靠的视频解说员**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08870) |  VISTA-LLAMA   | 12\u002F2023 | - | arXiv |\n| [**用于视频理解的视听大语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06720) | - | 12\u002F2023 | - | arXiv |\n| [**AutoAD：上下文中的电影描述**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FHan_AutoAD_Movie_Description_in_Context_CVPR_2023_paper.pdf) |    AutoAD     | 06\u002F2023 |     [code](https:\u002F\u002Fgithub.com\u002FTengdaHan\u002FAutoAD)      | CVPR  |\n| [**AutoAD II：续集——电影音频描述中的谁、何时、何事**](https:\u002F\u002Fopenaccess.thecvf.com\u002F\u002Fcontent\u002FICCV2023\u002Fpapers\u002FHan_AutoAD_II_The_Sequel_-_Who_When_and_What_in_ICCV_2023_paper.pdf) |   AutoAD II   | 10\u002F2023 |                          -                           | ICCV  |\n| [**AutoAD III：前传——回到像素**](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fpublications\u002F2024\u002FHan24\u002Fhan24.pdf) |   AutoAD III   | 04\u002F2024 |                          -                           | CVPR  |\n| [**面向多模态大语言模型的细粒度视听联合表征**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05863v2)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthe-anonymous-bs\u002Ffavor.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthe-anonymous-bs\u002Ffavor) |     FAVOR     | 10\u002F2023 |  [code](https:\u002F\u002Fgithub.com\u002Fthe-anonymous-bs\u002Ffavor)   | arXiv |\n| [**VideoLLaMA2：推进视频大语言模型中的时空建模和音频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07476)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoLLaMA2.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2) |     VideoLLaMA2     | 06\u002F2024 |  [code](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2)   | arXiv |\n| [**PAVE：修补和适配视频大语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.19794) |   PAVE   | 03\u002F2025 | [code](https:\u002F\u002Fgithub.com\u002Fdragonlzm\u002FPAVE) | CVPR |\n| [**将大型视觉-语言模型迁移到视频理解中的时间导向配方**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12605)|     时间配方     | 05\u002F2025 |  [code](https:\u002F\u002Fgithub.com\u002Fnguyentthong\u002Ftemporal_recipe)   | arXiv |\n\n\n##### 使用插入式适配器进行微调\n\n| 标题                                                        |  模型   |  日期   |                    代码                    | 场所 |\n| :----------------------------------------------------------- | :------: | :-----: | :----------------------------------------: | :---: |\n| [**Otter：一种具有上下文指令微调的多模态模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.03726v1)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fluodian\u002Fotter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fluodian\u002Fotter) |  Otter   | 06\u002F2023 |  [code](https:\u002F\u002Fgithub.com\u002Fluodian\u002Fotter)  | arXiv |\n| [**VideoLLM：利用大型语言模型建模视频序列**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13292)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcg1177\u002Fvideollm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fcg1177\u002Fvideollm) | VideoLLM | 05\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002Fcg1177\u002Fvideollm) | arXiv |\n\n##### 使用混合适配器进行微调\n\n| 标题                                                        |   模型   |  日期   |                     代码                     | 场所 |\n| :----------------------------------------------------------- | :-------: | :-----: | :------------------------------------------: | :---: |\n| [**VTimeLLM：让大语言模型掌握视频瞬间**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.18445v1)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuangb23\u002Fvtimellm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhuangb23\u002Fvtimellm) | VTimeLLM  | 11\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002Fhuangb23\u002Fvtimellm) | arXiv |\n| [**GPT4Video：用于遵循指令理解和安全生成的统一多模态大型语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16511v1) | GPT4Video | 11\u002F2023 |                      -                       | arXiv |\n\n#### 🦾 混合方法\n\n| 标题                                                        |        模型        |  日期   |                             代码                             | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :----------------------------------------------------------: | :---: |\n| [**VideoChat：以聊天为中心的视频理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06355)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FAsk-Anything.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything) |      VideoChat      | 05\u002F2023 | [code](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything)  [demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fynhe\u002FAskAnything) | arXiv |\n| [**PG-Video-LLaVA：像素对齐的大规模视频-语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13435v2)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002Fvideo-llava.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002Fvideo-llava) |   PG-Video-LLaVA    | 11\u002F2023 |      [code](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002Fvideo-llava)      | arXiv |\n| [**TimeChat：一种面向长时间视频理解的时间敏感型多模态大型语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02051)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRenShuhuai-Andy\u002FTimeChat.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FRenShuhuai-Andy\u002FTimeChat) |      TimeChat       | 12\u002F2023 |     [code](https:\u002F\u002Fgithub.com\u002FRenShuhuai-Andy\u002FTimeChat)      | CVPR |\n| [**Video-GroundingDINO：迈向开放词汇的时空视频定位**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.00901.pdf)[![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTalalWasim\u002FVideo-GroundingDINO.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTalalWasim\u002FVideo-GroundingDINO) | Video-GroundingDINO | 12\u002F2023 |  [code](https:\u002F\u002Fgithub.com\u002FTalalWasim\u002FVideo-GroundingDINO)   | arXiv |\n| [**一段视频值4096个token：零样本下将视频转化为文本以实现理解**](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.608\u002F) | Video4096 | 05\u002F2023 |  | EMNLP |\n\n#### 💎 无需训练的方法\n\n| 标题                                                        |        模型        |  日期   | 代码 | 场所 |\n| :----------------------------------------------------------- | :-----------------: | :-----: | :--: | :---: |\n| [**超越训练：用于零样本视频理解的动态token合并**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14401) |   DyTo    | 11\u002F2024 |  [code](https:\u002F\u002Fgithub.com\u002FJam1ezhang\u002FDYTO)   | ICCV2025 |\n| [**SlowFast-LLaVA：视频大型语言模型的强大无训练基线**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.15841) |   SlowFast-LLaVA    | 07\u002F2024 |  -   | arXiv |\n| [**TS-LLaVA：通过缩略图采样构建视觉token，用于无训练的视频大型语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11066) | TS-LLaVA | 11\u002F2024 | [code](https:\u002F\u002Fgithub.com\u002Ftingyu215\u002FTS-LLaVA) | arXiv |\n| [**声音能否通过token替换在LLaVA中替代视觉？**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.10416) | SoundCLIP | 08\u002F2025 | [code](https:\u002F\u002Fgithub.com\u002Fali-vosoughi\u002FSoundCLIP) | arXiv |\n| [**D-CoDe：通过动态压缩和问题分解，将图像预训练的VLM扩展到视频领域**](https:\u002F\u002Faclanthology.org\u002F2025.emnlp-main.597\u002F) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhukcc\u002FD-CoDe.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhukcc\u002FD-CoDe) | D-CoDe | 08\u002F2025 | [code](https:\u002F\u002Fgithub.com\u002Fhukcc\u002FD-CoDe) [项目页面](https:\u002F\u002Fhukcc.github.io\u002FD-CoDe\u002F) | EMNLP |\n\n---\n\n\n\n## 任务、数据集和基准测试\n\n#### 识别与预测\n\n| 名称               |                            论文                             | 日期 |                            链接                             |  场所  |\n| :----------------- | :----------------------------------------------------------: | :--: | :---------------------------------------------------------: | :-----: |\n| **Charades**       | [**Hollywood in homes: Crowdsourcing data collection for activity understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.01753v3) | 2016 |        [链接](http:\u002F\u002Fvuchallenge.org\u002Fcharades.html)         |  ECCV   |\n| **YouTube8M**      | [**YouTube-8M: A Large-Scale Video Classification Benchmark**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.08675v1) | 2016 | [链接](https:\u002F\u002Fresearch.google.com\u002Fyoutube8m\u002Fdownload.html) |    -    |\n| **ActivityNet**    | [**ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_cvpr_2015\u002Fpapers\u002FHeilbron_ActivityNet_A_Large-Scale_2015_CVPR_paper.pdf) | 2015 |              [链接](http:\u002F\u002Factivity-net.org\u002F)               |  CVPR   |\n| **Kinetics-GEBC**  | [**GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.00486v4) | 2022 |         [链接](https:\u002F\u002Fgithub.com\u002Fshowlab\u002Fgeb-plus)         |  ECCV   |\n| **Kinetics-400**   | [**The Kinetics Human Action Video Dataset**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.06950) | 2017 |  [链接](https:\u002F\u002Fpaperswithcode.com\u002Fdataset\u002Fkinetics-400-1)  |    -    |\n| **VidChapters-7M** | [**VidChapters-7M: Video Chapters at Scale**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.13952) | 2023 |     [链接](https:\u002F\u002Fantoyang.github.io\u002Fvidchapters.html)     | NeurIPS |\n| **BlackSwanSuite** | [**Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05725) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsahithyaravi\u002FBlackSwan.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsahithyaravi\u002FBlackSwan) | 2025 |[链接](https:\u002F\u002Fblackswan.cs.ubc.ca) |                CVPR |\n\n#### 字幕与描述\n| 名称               |                            论文                             | 日期 |                            链接                             |  场所  |\n| :----------------- | :----------------------------------------------------------: | :--: | :---------------------------------------------------------: | :-----: |\n|**Microsoft Research Video Description Corpus (MSVD)**|[**Collecting Highly Parallel Data for Paraphrase Evaluation**](https:\u002F\u002Faclanthology.org\u002FP11-1020.pdf)|2011|[链接](https:\u002F\u002Fwww.cs.utexas.edu\u002Fusers\u002Fml\u002Fclamp\u002FvideoDescription\u002F#data)|ACL|\n|**Microsoft Research Video-to-Text (MSR-VTT)**|[**MSR-VTT: A Large Video Description Dataset for Bridging Video and Language**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_cvpr_2016\u002Fpapers\u002FXu_MSR-VTT_A_Large_CVPR_2016_paper.pdf)|2016|[链接](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Fmsr-vtt-a-large-video-description-dataset-for-bridging-video-and-language\u002F)|CVPR|\n|**Tumblr GIF (TGIF)**|[**TGIF: A New Dataset and Benchmark on Animated GIF Description**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.02748v2)|2016|[链接](https:\u002F\u002Fgithub.com\u002Fraingo\u002FTGIF-Release)|CVPR|\n|**Charades**|[**Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.01753v3)|2016|[链接](https:\u002F\u002Fprior.allenai.org\u002Fprojects\u002Fcharades)|ECCV|\n|**Charades-Ego**|[**Actor and Observer: Joint Modeling of First and Third-Person Videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.0962)|2018|[链接](https:\u002F\u002Fprior.allenai.org\u002Fprojects\u002Fcharades-ego)|CVPR|\n|**ActivityNet Captions**|[**Dense-Captioning Events in Videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.00754)|2017|[链接](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Franjaykrishna\u002Fdensevid\u002F)|ICCV|\n|**HowTo100m**|[**HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.03327)|2019|[链接](https:\u002F\u002Fwww.di.ens.fr\u002Fwillow\u002Fresearch\u002Fhowto100m\u002F)|ICCV|\n|**Movie Audio Descriptions (MAD)**|[**MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.00431)|2021|[链接](https:\u002F\u002Fgithub.com\u002FSoldelli\u002FMAD)|CVPR|\n|**YouCook2**|[**Towards Automatic Learning of Procedures from Web Instructional Videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.09788)|2017|[链接](http:\u002F\u002Fyoucook2.eecs.umich.edu\u002F)|AAAI|\n|**MovieNet**|[**MovieNet: A Holistic Dataset for Movie Understanding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.10937)|2020|[链接](https:\u002F\u002Fmovienet.github.io\u002F)|ECCV|\n|**Youku-mPLUG**|[**Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04362)|2023|[链接](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FYouku-mPLUG)|arXiv|\n|**Video Timeline Tags (ViTT)**|[**Multimodal Pretraining for Dense Video Captioning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.11760)|2020|[链接](https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002FVideo-Timeline-Tags-ViTT)|AACL-IJCNLP|\n|**TVSum**|[**TVSum: Summarizing web videos using titles**](https:\u002F\u002Fwww.cv-foundation.org\u002Fopenaccess\u002Fcontent_cvpr_2015\u002Fpapers\u002FSong_TVSum_Summarizing_Web_2015_CVPR_paper.pdf)|2015|[链接](https:\u002F\u002Fgithub.com\u002Fyalesong\u002Ftvsum)|CVPR|\n|**SumMe**|[**Creating Summaries from User Videos**](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FCreating-Summaries-from-User-Videos-Gygli-Grabner\u002F799bf307438ec2171e6f0bd5b8040f678d5b28da)|2014|[链接](http:\u002F\u002Fwww.vision.ee.ethz.ch\u002F~gyglim\u002Fvsum\u002F)|ECCV|\n|**VideoXum**|[**VideoXum: Cross-modal Visual and Textural Summarization of Videos**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12060)|2023|[链接](https:\u002F\u002Fvideoxum.github.io\u002F)|IEEE Trans Multimedia|\n|**Multi-Source Video Captioning (MSVC)**|[**VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07476)|2024|[链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDAMO-NLP-SG\u002FMulti-Source-Video-Captioning)|arXiv|\n\n#### 现实场景理解与检索\n| 名称               |                            论文                             | 日期 |                            链接                             | 会议   |\n| :----------------- | :----------------------------------------------------------: | :--: | :---------------------------------------------------------: | :-----: |\n|**Epic-Kitchens-100**|[**重新定义第一人称视角视觉任务**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.13256v4)|2021|[链接](https:\u002F\u002Fepic-kitchens.github.io\u002F2021)|IJCV|\n|**VCR（视觉常识推理）**|[**从识别到认知：视觉常识推理**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.10830v2)|2019|[链接](https:\u002F\u002Fvisualcommonsense.com\u002F)|CVPR|\n|**Ego4D-MQ 和 Ego4D-NLQ**|[**Ego4D：全球3000小时的第一人称视频及其第一人称感知基准套件**](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fego4d-unscripted-first-person-video-from-around-the-world-and-a-benchmark-suite-for-egocentric-perception\u002F)|2021|[链接](https:\u002F\u002Fego4d-data.org\u002F)|CVPR|\n|**Vid-STG**|[**它在哪里存在？面向多形式句子的时空视频定位**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2001.06891)|2020|[链接](https:\u002F\u002Fgithub.com\u002FGuaranteer\u002FVidSTG-Dataset)|CVPR|\n|**Charades-STA**|[**TALL：基于语言查询的时序动作定位**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.02101)|2017|[链接](https:\u002F\u002Fgithub.com\u002Fjiyanggao\u002FTALL)|ICCV|\n|**DiDeMo**|[**利用自然语言在视频中定位时刻**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1708.01641)|2017|[链接](https:\u002F\u002Fgithub.com\u002FLisaAnne\u002FTemporalLanguageRelease)|ICCV|\n\n#### 问答任务\n| 名称               |                            论文                             | 日期 |                            链接                             | 会议   |\n| :----------------- | :----------------------------------------------------------: | :--: | :---------------------------------------------------------: | :-----: |\n|**MSVD-QA**|[**通过逐步细化的外观与运动注意力进行视频问答**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3123266.3123427)|2017|[链接](https:\u002F\u002Fgithub.com\u002Fxudejing\u002Fvideo-question-answering)|ACM Multimedia|\n|**MSRVTT-QA**|[**通过逐步细化的外观与运动注意力进行视频问答**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3123266.3123427)|2017|[链接](https:\u002F\u002Fgithub.com\u002Fxudejing\u002Fvideo-question-answering)|ACM Multimedia|\n|**TGIF-QA**|[**TGIF-QA：迈向视觉问答中的时空推理**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1704.04497)|2017|[链接](https:\u002F\u002Fgithub.com\u002FYunseokJANG\u002Ftgif-qa)|CVPR|\n|**ActivityNet-QA**|[**ActivityNet-QA：通过问答理解复杂网络视频的数据集**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.02467)|2019|[链接](https:\u002F\u002Fgithub.com\u002FMILVLG\u002Factivitynet-qa)|AAAI|\n|**Pororo-QA**|[**DeepStory：基于深度嵌入记忆网络的视频故事问答**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.00836)|2017|[链接](https:\u002F\u002Fgithub.com\u002FKyung-Min\u002FPororoQA)|IJCAI|\n|**TVQA**|[**TVQA：局部化、组合式的视频问答**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1809.01696)|2018|[链接](https:\u002F\u002Ftvqa.cs.unc.edu\u002F)|EMNLP|\n|**MAD-QA**|[**为长视频问答编码与控制全局语义**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.19723)|2024|[链接](https:\u002F\u002Fgithub.com\u002Fzhiyuanhubj\u002Flong_form_videoqa)|EMNLP|\n|**Ego-QA**|[**为长视频问答编码与控制全局语义**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.19723)|2024|[链接](https:\u002F\u002Fgithub.com\u002Fzhiyuanhubj\u002Flong_form_videoqa)|EMNLP|\n| **BlackSwanSuite** | [**黑天鹅：不可预测事件中的溯因与可废止视频推理**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05725) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsahithyaravi\u002FBlackSwan.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsahithyaravi\u002FBlackSwan) | 2025 | [链接](https:\u002F\u002Fblackswan.cs.ubc.ca) | CVPR |\n| **CrossVid** | [**CrossVid：用于评估多模态大型语言模型跨视频推理能力的综合基准**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.12263) | 2025 | [链接](https:\u002F\u002Fgithub.com\u002Fchuntianli666\u002FCrossVid) | AAAI |\n\n#### 视频指令微调\n##### 预训练数据集\n| 名称               |                            论文                             | 日期 |                            链接                             | 会议   |\n| :----------------- | :----------------------------------------------------------: | :--: | :---------------------------------------------------------: | :-----: |\n|**VidChapters-7M**|[**VidChapters-7M：大规模视频章节数据集**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.13952)|2023|[链接](https:\u002F\u002Fantoyang.github.io\u002Fvidchapters.html)|NeurIPS|\n|**VALOR-1M**|[**VALOR：视觉-音频-语言全感知预训练模型及数据集**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08345)|2023|[链接](https:\u002F\u002Fgithub.com\u002FTXH-mercury\u002FVALOR)|arXiv|\n|**Youku-mPLUG**|[**Youku-mPLUG：用于预训练和基准测试的1000万规模中文视频-语言数据集**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04362)|2023|[链接](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FYouku-mPLUG)|arXiv|\n|**InternVid**|[**InternVid：用于多模态理解和生成的大规模视频-文本数据集**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06942)|2023|[链接](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVideo\u002Ftree\u002Fmain\u002FData\u002FInternVid)|arXiv|\n|**VAST-27M**|[**VAST：视觉-音频-字幕-文本全模态基础模型及数据集**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18500)|2023|[链接](https:\u002F\u002Fgithub.com\u002FTXH-mercury\u002FVAST)|NeurIPS|\n\n##### 微调数据集\n| 名称               |                            论文                             | 日期 |                            链接                             | 会议   |\n| :----------------- | :----------------------------------------------------------: | :--: | :---------------------------------------------------------: | :-----: |\n|**MIMIC-IT**|[**MIMIC-IT：多模态上下文指令微调**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05425)|2023|[链接](https:\u002F\u002Fgithub.com\u002Fluodian\u002Fotter)|arXiv|\n|**VideoInstruct100K**|[**Video-ChatGPT：借助大型视觉和语言模型实现精细视频理解**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05424)|2023|[链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FVideoInstruct-100K)|arXiv|\n|**TimeIT**|[**TimeChat：用于长视频理解的时敏型多模态大型语言模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02051)|2023|[链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FShuhuaiRen\u002FTimeIT)|CVPR|\n\n#### 基于视频的大型语言模型基准\n\n| 标题                                                        | 日期   |                            代码                            |              场所               |\n| :----------------------------------------------------------- | :-----: | :--------------------------------------------------------: | :------------------------------: |\n| [**LVBench：极端长视频理解基准**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.08035) | 2024年6月 |    [代码](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FLVBench)    |                -                 |\n| [**Video-Bench：评估基于视频的大语言模型的综合基准与工具包**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16103) | 2023年11月 |    [代码](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-Bench)    |                -                 |\n| [**Perception Test：多模态视频模型的诊断性基准测试**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13786) | 2023年5月 | [代码](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Fperception_test) | NeurIPS 2023、ICCV 2023研讨会 |\n| [**Youku-mPLUG：用于预训练和基准测试的1000万规模中文视频-语言数据集**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04362v1) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fx-plug\u002Fyouku-mplug.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fx-plug\u002Fyouku-mplug) | 2023年7月 |       [代码](https:\u002F\u002Fgithub.com\u002Fx-plug\u002Fyouku-mplug)        |                -                 |\n| [**FETV：开放域文本到视频生成的细粒度评估基准**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.01813) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fllyx97\u002Ffetv.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fllyx97\u002Ffetv) | 2023年11月 |           [代码](https:\u002F\u002Fgithub.com\u002Fllyx97\u002Ffetv)           |                NeurIPS 2023                 |\n| [**MoVQA：面向长篇电影理解的多功能问答基准**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04817) | 2023年12月 |         [代码](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FMoVQA)         |                -                 |\n| [**MVBench：全面的多模态视频理解基准**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17005) | 2023年12月 |     [代码](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything)      |                -                 |\n| [**TempCompass：视频大语言模型真的能理解视频吗？**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.00476) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fllyx97\u002FTempCompass.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fllyx97\u002FTempCompass) | 2024年3月 |           [代码](https:\u002F\u002Fgithub.com\u002Fllyx97\u002FTempCompass)           |                ACL 2024                 |\n| [**Video-MME：首个针对多模态大语言模型在视频分析中进行全面评估的基准**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.21075) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FVideo-MME.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME) | 2024年6月 |           [代码](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME)           |                -                 |\n| [**VideoHallucer：评估大型视频-语言模型中的内在与外在幻觉**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16338) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpatrick-tssn\u002FVideoHallucer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpatrick-tssn\u002FVideoHallucer) | 2024年6月 |           [代码](https:\u002F\u002Fgithub.com\u002Fpatrick-tssn\u002FVideoHallucer)           |                -                 |\n| [**Black Swan：不可预测事件中的溯因与可废止视频推理**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05725) [![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsahithyaravi\u002FBlackSwan.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsahithyaravi\u002FBlackSwan) | 2025年6月 |           [代码](https:\u002F\u002Fgithub.com\u002Fsahithyaravi\u002FBlackSwan)           |                CVPR 2025             |\n| [**能否让视频多模态模型像怀疑论者一样思考——或加倍下注：关于可废止视频蕴涵的研究**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.22385) | 2025年8月 |          -          |                -             |\n| [**CrossVid：评估多模态大语言模型跨视频推理的综合基准**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.12263) | 2025年11月 |           [代码](https:\u002F\u002Fgithub.com\u002Fchuntianli666\u002FCrossVid)           |                AAAI 2026             |\n| [**MVU-Eval：迈向多视频理解的多模态大语言模型评估**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.07250) |   MVU-Eval   | 2025年11月 | [代码](https:\u002F\u002Fgithub.com\u002FNJU-LINK\u002FMVU-Eval) | NeurIPS DB |\n| [**OmniVideoBench：迈向全模态大语言模型的视听理解评估**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.10689) |   OmniVideoBench   | 2025年10月 | [代码](https:\u002F\u002Fgithub.com\u002FNJU-LINK\u002FOmniVideoBench) | arXiv |\n| [**IF-VidCap：视频字幕模型能否遵循指令？**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.18726) |   IF-VidCap   | 2025年10月 | [代码](https:\u002F\u002Fgithub.com\u002FNJU-LINK\u002FIF-VidCap) | arXiv |\n\n\n\n## 贡献\n\n我们欢迎所有人参与本仓库的贡献，共同提升其质量。您可以提交拉取请求，以添加新的论文、项目及有用资料，或更正您发现的任何错误。请确保您的拉取请求遵循“标题|模型|日期|代码|场所”的格式。感谢您的宝贵贡献！\n\n\n### 🌟 星标历史\n\n[![星标历史图表](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyunlong10_Awesome-LLMs-for-Video-Understanding_readme_a5c6fbbc7e64.png)](https:\u002F\u002Fstar-history.com\u002F#yunlong10\u002FAwesome-LLMs-for-Video-Understanding&Date)\n\n### ♥️ 贡献者\n\n没有这些了不起的人的贡献，我们的项目根本不可能实现！感谢大家让这个项目变得更好。\n\n[Yolo Y. Tang](https:\u002F\u002Fyunlong10.github.io) @ 罗切斯特大学 \\\n[Jing Bi](https:\u002F\u002Fjing-bi.github.io) @ 罗切斯特大学 \\\n[Siting Xu](https:\u002F\u002Fsai-01.github.io) @ 南方科技大学 \\\n[Luchuan Song](https:\u002F\u002Fsongluchuan.github.io) @ 罗切斯特大学 \\\n[Susan Liang](https:\u002F\u002Fliangsusan-git.github.io) @ 罗切斯特大学 \\\n[Teng Wang](http:\u002F\u002Fttengwang.com\u002F) @ 香港大学 \\\n[Daoan Zhang](https:\u002F\u002Fdwan.ch) @ 罗切斯特大学 \\\n[Jie An](https:\u002F\u002Fpkuanjie.com) @ 罗切斯特大学 \\\n[Jingyang Lin](https:\u002F\u002Fjylin.me) @ 罗切斯特大学 \\\n[Rongyi Zhu](https:\u002F\u002Frongyizhu.github.io) @ 罗切斯特大学 \\\n[Ali Vosoughi](https:\u002F\u002Falivosoughi.com) @ 罗切斯特大学 \\\n[Chao Huang](https:\u002F\u002Fwikichao.github.io) @ 罗切斯特大学 \\\n[Zeliang Zhang](https:\u002F\u002Fzhangaipi.github.io) @ 罗切斯特大学 \\\n[Pinxin Liu](https:\u002F\u002Fandypinxinliu.github.io) @ 罗切斯特大学 \\\n[Mingqian Feng](https:\u002F\u002Ffmmarkmq.github.io) @ 罗切斯特大学 \\\n[Feng Zheng](https:\u002F\u002Fwww.sustech.edu.cn\u002Fen\u002Ffaculties\u002Fzhengfeng.html) @ 南方科技大学 \\\n[Jianguo Zhang](https:\u002F\u002Ffaculty.sustech.edu.cn\u002F?tagid=zhangjg&iscss=1&snapid=1&orderby=date&go=2&lang=en) @ 南方科技大学 \\\n[Ping Luo](http:\u002F\u002Fluoping.me) @ 香港大学 \\\n[Jiebo Luo](https:\u002F\u002Fwww.cs.rochester.edu\u002Fu\u002Fjluo\u002F) @ 罗切斯特大学 \\\n[Chenliang Xu](https:\u002F\u002Fwww.cs.rochester.edu\u002F~cxu22\u002Findex.html) @ 罗切斯特大学 \n\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyunlong10_Awesome-LLMs-for-Video-Understanding_readme_cba89f61381f.png\" \u002F>\n\u003C\u002Fa>","# Awesome-LLMs-for-Video-Understanding 快速上手指南\n\n本项目并非单一的可执行软件，而是一个**精选资源列表（Awesome List）**，汇集了基于大语言模型（LLM）的视频理解（Vid-LLMs）领域的最新论文、模型代码、数据集和基准测试。本指南将帮助您快速定位所需模型并运行示例代码。\n\n## 环境准备\n\n由于列表中包含了数十个不同的模型（如 LLaVA, VideoTree, AntGPT 等），具体依赖因模型而异。但大多数 Vid-LLM 项目共享以下基础环境要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+) 或 macOS\n*   **Python**: 3.8 或更高版本 (推荐 3.10)\n*   **GPU**: NVIDIA GPU (显存建议 16GB+ 以运行大型多模态模型)，需安装 CUDA (11.8 或 12.1)\n*   **包管理器**: `pip` 或 `conda`\n\n**通用前置依赖安装：**\n在克隆具体模型仓库前，建议先配置基础深度学习环境。\n\n```bash\n# 创建虚拟环境\nconda create -n vid-llm python=3.10 -y\nconda activate vid-llm\n\n# 安装 PyTorch (根据官方源选择对应 CUDA 版本)\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n\n# 安装通用依赖 (transformers, accelerate 等)\npip install transformers accelerate sentencepiece protobuf einops opencv-python\n```\n\n> 💡 **国内加速建议**：\n> 推荐使用清华或阿里镜像源加速 Python 包安装：\n> `pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple \u003Cpackage_name>`\n\n## 安装步骤\n\n由于本项目是资源索引，您需要根据需求选择具体的模型进行安装。以下以列表中热门的 **LLaVA** (作为基础架构代表) 和 **VideoTree** (长视频理解代表) 为例。\n\n### 1. 获取资源列表\n首先克隆本 Awesome 列表仓库，查阅最新的模型清单：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding.git\ncd Awesome-LLMs-for-Video-Understanding\n```\n*请查阅目录中的 `Taxonomy` 表格，找到您感兴趣的模型及其对应的 Code 链接。*\n\n### 2. 安装具体模型示例 (以 LLaVA 为例)\n大多数模型提供独立的 GitHub 仓库。\n\n```bash\n# 克隆模型仓库 (示例为 LLaVA)\ngit clone https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA.git\ncd LLaVA\n\n# 安装模型特定依赖\npip install -e .\n```\n\n### 3. 下载预训练权重\n大多数模型权重托管在 Hugging Face。国内用户建议使用 **ModelScope (魔搭社区)** 或配置 Hugging Face 镜像。\n\n```bash\n# 使用 huggingface-cli 下载 (需配置 HF_ENDPOINT)\nexport HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\nhuggingface-cli download --resume-download liuhaotian\u002Fllava-v1.5-7b --local-dir .\u002Fcheckpoints\u002Fllava-v1.5-7b\n```\n\n## 基本使用\n\n不同模型的调用方式略有不同，通常分为 **命令行推理** 和 **Python API 调用** 两种方式。以下是基于典型 Vid-LLM 结构的通用使用流程。\n\n### 场景一：视频内容问答 (Video QA)\n\n假设您已准备好视频文件 `input_video.mp4` 和模型权重。\n\n**Python 脚本示例：**\n\n```python\nimport torch\nfrom transformers import AutoProcessor, LlavaForConditionalGeneration\nfrom PIL import Image\nimport cv2\n\n# 1. 加载模型 (路径替换为您下载的实际模型路径)\nmodel_path = \".\u002Fcheckpoints\u002Fllava-v1.5-7b\"\nmodel = LlavaForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.float16, device_map=\"auto\")\nprocessor = AutoProcessor.from_pretrained(model_path)\n\n# 2. 预处理视频 (简化示例：抽取关键帧)\n# 实际项目中请使用该模型推荐的采样策略 (如均匀采样或动作检测采样)\ncap = cv2.VideoCapture(\"input_video.mp4\")\nframes = []\nfor _ in range(8): # 抽取 8 帧\n    ret, frame = cap.read()\n    if not ret: break\n    frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))\ncap.release()\n\n# 3. 构建提示词\nprompt = \"USER: \u003Cimage>\\nDescribe the main actions happening in this video in detail.\\nASSISTANT:\"\n\n# 4. 生成回答\ninputs = processor(text=prompt, images=frames, return_tensors=\"pt\").to(model.device, torch.float16)\n\nwith torch.inference_mode():\n    output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)\n\nresult = processor.decode(output_ids[0], skip_special_tokens=True)\nprint(result)\n```\n\n### 场景二：长视频分析 (Long-Form Understanding)\n\n对于列表中提到的 **VideoTree** 或 **DrVideo** 等专门处理长视频的模型，通常需要运行其提供的专用脚本。\n\n```bash\n# 进入具体模型目录 (示例)\ncd ..\u002FVideoTree\n\n# 运行推理脚本 (参数参考各模型 README)\npython inference.py \\\n    --video_path .\u002Flong_video.mp4 \\\n    --model_name videotree-7b \\\n    --question \"What is the plot twist in the second half of the video?\" \\\n    --output_dir .\u002Fresults\n```\n\n### 下一步建议\n1.  **查阅 Survey 论文**：阅读项目首页链接的 [arXiv 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17432)，了解不同架构（如 `Video Analyzer × LLM` vs `Video Embedder × LLM`）的适用场景。\n2.  **浏览 Benchmark**：在仓库的 `Tasks, Datasets, and Benchmarks` 章节查找适合您任务的数据集（如 ActivityNet-QA, MSVD-QA）。\n3.  **贡献代码**：如果您复现了新模型，欢迎通过 Pull Request 更新此 Awesome 列表。","某视频内容审核团队正试图构建一个能理解复杂长视频情节、自动识别违规行为的智能系统，但面对飞速发展的多模态大模型技术感到无从下手。\n\n### 没有 Awesome-LLMs-for-Video-Understanding 时\n- **文献检索如大海捞针**：团队成员需分散在 arXiv、GitHub 和各大学术会议网站手动搜索，难以及时获取最新的 Vid-LLM 论文与代码，导致技术选型滞后。\n- **模型分类混乱不清**：面对数百个新模型，缺乏统一的分类标准（如基于视频表征或 LLM 功能），难以判断哪些架构适合处理长时序依赖或细粒度动作识别。\n- **数据与基准匹配困难**：不清楚哪些数据集支持特定的推理任务，也找不到权威的评测基准来验证自研模型的性能，重复造轮子现象严重。\n- **训练策略盲目试错**：缺乏对适配器微调、全量训练等策略的系统性总结，团队在资源有限的情况下浪费大量算力进行无效实验。\n\n### 使用 Awesome-LLMs-for-Video-Understanding 后\n- **一站式资源聚合**：直接查阅该仓库整理的最新综述、百余个模型代码链接及 15+ 新基准，半天内即可完成从技术调研到方案选定的全过程。\n- **清晰的技术导航**：利用其提出的新颖分类体系，快速锁定适合“长视频逻辑推理”任务的模型架构，大幅缩短技术验证周期。\n- **精准的数据与评测对接**：通过关联的任务 - 数据集 - 基准映射表，迅速找到适配的监控视频数据集和评估指标，确保实验结果具有可比性。\n- **高效的训练路径规划**：参考仓库中关于训练策略的深度章节，直接复用成熟的微调方案，避免了盲目的超参数搜索，显著降低研发成本。\n\nAwesome-LLMs-for-Video-Understanding 将碎片化的前沿研究转化为结构化的工程指南，让视频理解大模型的开发从“盲目探索”转向“高效落地”。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyunlong10_Awesome-LLMs-for-Video-Understanding_dbfaa4df.png","yunlong10","Yolo Y. Tang","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fyunlong10_816c9ce6.jpg","Ph.D. Student @ UR CS",null,"yoloytang","yunlong10.github.io","https:\u002F\u002Fgithub.com\u002Fyunlong10",3153,142,"2026-04-18T11:26:48",5,"","未说明",{"notes":90,"python":88,"dependencies":91},"该仓库是一个综述列表（Awesome List），主要整理了用于视频理解的大语言模型（Vid-LLMs）相关的论文、数据集和基准测试，并非一个可直接运行的单一软件工具。因此，README 中未提供具体的运行环境需求。用户若需运行列表中提到的具体模型（如 LLoVi, VideoTree, AntGPT 等），需前往各模型对应的独立代码仓库查阅其特定的环境配置要求。",[],[15,35,62],"2026-03-27T02:49:30.150509","2026-04-19T03:17:41.638729",[96,101,106,111,116,121],{"id":97,"question_zh":98,"answer_zh":99,"source_url":100},41454,"如何向该仓库提交新的论文或代码资源？","维护者建议用户直接发起拉取请求（Pull Request），而不是仅通过 Issue 提出。您可以将新资源添加到您认为最合适的章节（例如“免训练方法”或“微调”部分），以便团队快速审查和合并。","https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding\u002Fissues\u002F34",{"id":102,"question_zh":103,"answer_zh":104,"source_url":105},41455,"提交的综述论文会被收录到哪些地方？","相关的综述论文会被添加到项目维护的综述论文（survey paper）的“相关综述”部分，并包含在后续的 arXiv 版本更新中。但请注意，如果期刊版本（如 IEEE TCSVT）已经正式接收，则无法再修改该已发表版本。","https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding\u002Fissues\u002F22",{"id":107,"question_zh":108,"answer_zh":109,"source_url":110},41456,"为什么我在 GitHub 仓库列表中还没看到刚被确认收录的论文？","最新的调查论文版本可能已经包含了您的工作，但 GitHub 仓库的文件列表更新会有延迟。维护团队会逐步更新仓库内容，请耐心等待同步。","https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding\u002Fissues\u002F11",{"id":112,"question_zh":113,"answer_zh":114,"source_url":115},41457,"该项目主要收录哪些类型的视频理解资源？","该项目主要收录与视频大语言模型（Video-LLM）相关的资源，涵盖多个类别，包括：免训练方法（Training-free Methods）、带连接适配器的微调方法（Fine-tuning with Connective Adapters）、基于轨迹令牌的视频编码器、视频理解基准测试（Benchmarks）以及多视频\u002F音视频评估工具。","https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding\u002Fissues\u002F32",{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},41458,"如何提交关于视频异常检测或特定领域的新研究？","您可以直接通过 Issue 分享您的论文链接（如开放世界视频异常检测研究）。只要工作与视频 LLM 评估或理解高度相关，维护者确认后就会将其添加到列表中。","https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding\u002Fissues\u002F27",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},41459,"是否接受关于视频评估基准（Benchmark）的投稿？","是的，项目欢迎添加新的视频理解基准测试论文。例如针对多视频理解评估（MVU-Eval）、音视频全能模型评估（OmniVideoBench）或指令遵循能力评估（IF-VidCap）的工作，均可被收录。","https:\u002F\u002Fgithub.com\u002Fyunlong10\u002FAwesome-LLMs-for-Video-Understanding\u002Fissues\u002F31",[]]