[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Vision-CAIR--MiniGPT4-video":3,"tool-Vision-CAIR--MiniGPT4-video":65},[4,17,27,36,44,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",153609,2,"2026-04-13T11:34:59",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[13,26,14,35],"视频",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":10,"last_commit_at":42,"category_tags":43,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,52,35,53,14,54,15,13,55],"数据工具","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,52,54],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":96,"forks":97,"last_commit_at":98,"license":99,"difficulty_score":100,"env_os":101,"env_gpu":102,"env_ram":103,"env_deps":104,"category_tags":116,"github_topics":117,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":122,"updated_at":123,"faqs":124,"releases":150},7151,"Vision-CAIR\u002FMiniGPT4-video","MiniGPT4-video","Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understanding ","MiniGPT4-video 是一款专注于短视频理解的多模态大语言模型，同时也是处理长视频任务的金鱼（Goldfish）模型的重要组件。它主要解决了现有 AI 在处理视频时面临的两大难题：一是难以高效分析长达数分钟甚至数小时的视频内容，二是无法有效过滤视频中的冗余信息并精准定位关键片段。\n\n通过引入“交错视觉 - 文本令牌”技术，MiniGPT4-video 能够为视频片段生成详尽的描述，不仅自身在短视频问答任务中表现卓越，超越了多个主流基准测试的最优结果，还作为核心检索机制助力 Goldfish 模型实现对任意长度视频（如整部电影或剧集）的高效理解。系统会先利用该技术快速筛选出与用户指令最相关的视频片段，再进行深度分析，从而大幅降低计算成本并提升回答准确率。\n\n这款工具非常适合人工智能研究人员、开发者以及需要处理海量视频数据的企业团队使用。无论是希望探索长视频理解前沿算法的学者，还是致力于开发智能视频摘要、剧情问答应用的工程师，都能从中获得强大的技术支持。目前，相关代码、论文及在线演示均已开源，欢迎各界人士体验与交流。","# [ECCV 2024 Accepted]Goldfish: Vision-Language Understanding of Arbitrarily Long Videos\n# [CVPR2024W]MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens\n**This repo contains the codes for MiniGPT4-video for short video understanding and Goldfish for long video understanding.**\n\u003Ch3 style=\"text-align: center;\">Online Demos\u003C\u002Fh3>\n\u003Cdiv style=\"display: flex; justify-content: center; gap: 40px;\">\n    \u003Cdiv style=\"text-align: center;\">\n        \u003Ca href='https:\u002F\u002Fgoldfishdemo.loophole.site'>\n            \u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_bc9b6f94c8d9.png' width=200 height=200>\n        \u003C\u002Fa>\n        \u003Cdiv>\n            \u003Cfont size=3>\n                \u003Cdiv>\n                    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_bc9b6f94c8d9.png\" width=18>\n                    \u003Ca href=\"https:\u002F\u002Fvision-cair.github.io\u002FGoldfish_website\u002F\">Project Page\u003C\u002Fa>\n                    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.12679\">📝 arXiv Paper\u003C\u002Fa>\n                    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FVision-CAIR\u002FTVQA-Long\u002Ftree\u002Fmain\">🤗 TVQA-Long Dataset\u003C\u002Fa>\n                \u003C\u002Fdiv>\n            \u003C\u002Ffont>\n        \u003C\u002Fdiv>\n    \u003C\u002Fdiv>\n    \u003Cdiv style=\"text-align: center;\">\n        \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVision-CAIR\u002FMiniGPT4-video'>\n            \u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_b1185abc61d3.png' width=200 height=200>\n        \u003C\u002Fa>\n        \u003Cdiv>\n            \u003Cfont size=3>\n                \u003Cdiv>\n                    \u003Ca href=\"https:\u002F\u002Fvision-cair.github.io\u002FMiniGPT4-video\u002F\">🎞️ Project Page\u003C\u002Fa>\n                    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03413\">📝 arXiv Paper\u003C\u002Fa>\n                \u003C\u002Fdiv>\n            \u003C\u002Ffont>\n        \u003C\u002Fdiv>\n    \u003C\u002Fdiv>\n\u003C\u002Fdiv>\n\n\n![Goldfish_teaser_fig](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_ddb77f93540e.jpg)\n## Overview\nMost current LLM-based models for video understanding can\nprocess videos within minutes but struggle with processing lengthy videos\ndue to the “noise and redundancy challenge” and “memory and compu-\ntation” challenges. In this paper, we present Goldfish, a methodology\ntailored for comprehending videos of arbitrary lengths. We also introduce\nthe TVQA-long benchmark, specifically designed to evaluate models’\ncapabilities in understanding long videos with questions in both vision\nand text content. Goldfish approaches these challenges with an efficient\nretrieval mechanism that initially gathers the top-k video clips relevant to\nthe instruction before proceeding to provide the desired response. This de-\nsign of the retrieval mechanism enables the Goldfish to efficiently process\narbitrarily long video sequences, facilitating its application in contexts\nsuch as movies or television series. To facilitate the retrieval process, we\ndeveloped MiniGPT4-Video that generates detailed descriptions for the\nvideo clips. In addressing the scarcity of benchmarks for long video evalu-\nation, we adapted the TVQA short video benchmark for extended content\nanalysis by aggregating questions from entire episodes, thereby shifting\nthe evaluation from partial to full episode comprehension. We attained a\n41.78% accuracy rate on the TVQA-long benchmark, surpassing previous\nmethods by 14.94%. Our MiniGPT4-Video also shows exceptional perfor-\nmance in short video comprehension, exceeding existing state-of-the-art\nmethods by 3.23%, 2.03%, 16.5% and 23.59% on the MSVD, MSRVTT,\nTGIF,and TVQA short video benchmarks, respectively. These results\nindicate that our models have significant improvements in both long and\nshort-video understanding.\n### Goldfish framework (Long videos)\n![methodology](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_931b1e5a17f8.jpg)\u003Cbr>\n![Gold ish demo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_261d68770b64.jpg)\n### MiniGPT4-Video  (Short videos)\n![methodology](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_e8017bafc20b.jpg)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fzeroshot-video-question-answer-on-tgif-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-tgif-qa?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fzero-shot-video-question-answer-on-tvqa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzero-shot-video-question-answer-on-tvqa?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fvideo-based-generative-performance-1)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fvideo-based-generative-performance-1?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fvideo-based-generative-performance-3)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fvideo-based-generative-performance-3?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fvideo-based-generative-performance-4)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fvideo-based-generative-performance-4?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fvideo-based-generative-performance-5)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fvideo-based-generative-performance-5?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fvideo-based-generative-performance-2)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fvideo-based-generative-performance-2?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fzeroshot-video-question-answer-on-msvd-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-msvd-qa?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fzeroshot-video-question-answer-on-msrvtt-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-msrvtt-qa?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fzeroshot-video-question-answer-on-activitynet)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-activitynet?p=minigpt4-video-advancing-multimodal-llms-for)\n\n![demo_1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_146617407452.gif)\n![demo_2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_a9321ca8d399.gif)\n![demo_3](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_e9a462c0ae5b.gif) \n## :rocket: Demo\n**1. Clone the repository** \u003Cbr>\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT4-video.git\ncd MiniGPT4-video\n```\n\n**2. Set up the environment** \u003Cbr>\n```bash\nconda env create -f environment.yml\n```\n**3. Download the checkpoints**\n\n| MiniGPT4-Video (Llama2 Chat 7B) | MiniGPT4-Video (Mistral 7B) |\n:------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:\n| [Download](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_llama_checkpoint_last.pth) | [Download](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_mistral_checkpoint_last.pth) |\n\n**4. Run the demo** \u003Cbr>\nGoldfish demo \n```bash\n# For recommended performance, add the parameter --use_openai_embedding True to the command below and set the API key in the environment variable OPENAI_API_KEY otherwise the model will use the default embeddings.\nexport OPENAI_API_KEY=\"your_openai_key\" \n# Llama2\npython goldfish_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs\u002Fllama2_test_config.yaml \n# Mistral\npython goldfish_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs\u002Fmistral_test_config.yaml\n```\nMiniGPT4-Video demo\n```bash\n# Llama2\npython minigpt4_video_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs\u002Fllama2_test_config.yaml\n# Mistral\npython minigpt4_video_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs\u002Fmistral_test_config.yaml\n```\n### Inference\nDo the previous steps and replace step 4 with this step \u003Cbr>\nGoldfish inference\n```bash\n# For recommended performance, add the parameter --use_openai_embedding True to the command below and set the API key in the environment variable OPENAI_API_KEY otherwise the model will use the default embeddings.\nexport OPENAI_API_KEY=\"your_openai_key\" \n# Llama2\npython goldfish_inference.py --ckpt path_to_llama2_checkpoint --cfg-path test_configs\u002Fllama2_test_config.yaml --video_path path_to_video --question \"Your question here\" \n# Mistral\npython goldfish_inference.py --ckpt path_to_mistral_checkpoint --cfg-path test_configs\u002Fmistral_test_config.yaml --video_path path_to_video --question \"Your question here\" \n```\nMiniGPT4-Video inference\n```bash\n# Llama2\npython minigpt4_video_inference.py --ckpt path_to_llama2_checkpoint --cfg-path test_configs\u002Fllama2_test_config.yaml --video_path path_to_video --question \"Your question here\" \n# Mistral\npython minigpt4_video_inference.py --ckpt path_to_mistral_checkpoint --cfg-path test_configs\u002Fmistral_test_config.yaml --video_path path_to_video --question \"Your question here\" \n```\n## :fire: Training\nFor both Goldfish and MiniGPT4-Video, the only training part is the MiniGPT4-Video model. \u003Cbr>\n### To customize MiniGPT4-Video for your own Video-text dataset \n\u003C!-- point to file here Custom_training.md -->\nYou can find the steps to customize MiniGPT4-Video for your own video-text dataset in [Custom_training.md](Custom_training.md)\n### Training datasets\nAfter downloading the datasets below, **you should go to the datasets configuration folder here minigpt4\u002Fconfigs\u002Fdatasets set the paths for each dataset there.**\u003Cbr>\nImage text training\u003Cbr>\nYou can find the steps to download the datasets in [MiniGPT4](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4\u002Ftree\u002Fmain\u002Fdataset)\u003Cbr>\n+ LAION \u003Cbr>\n+ Conceptual Captions \u003Cbr>\n+ SBU \u003Cbr>\n\nVideo text training:\u003Cbr>\n\n+ [CMD](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fdata\u002Fcondensed-movies\u002F) \u003Cbr>\n+ [Webvid](https:\u002F\u002Fgithub.com\u002Fm-bain\u002Fwebvid\u002F) \u003Cbr> \u003C!-- + [Webvid](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FTempoFunk\u002Fwebvid-10M?row=2) \u003Cbr> -->\n+ [Video Instructional Dataset 100K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FVideoInstruct-100K) \u003Cbr>\n\nYou can find the datasets annotation files for video_text datasets here [download](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Ftree\u002Fmain\u002Fdatasets\u002Ftraining_datasets) \u003Cbr>\n\n\n### Model training: \nYou can edit the number of gpus in the each script.sh below\u003Cbr>\n#### Stage 1 (image text pretraining)\n\nYou can directly download the pretrained MiniGPT4 [checkpoint](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F11nAPjEok8eAGGEG1N2vXo3kBLCg0WgUk\u002Fview?usp=sharing) aligned with Llama2. \u003Cbr>\n\nOr train by yourself:\n\n```bash\n# pretrain\n# Llama2\ntorchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs\u002F224_minigpt4_llama2_image.yaml\n# Mistral\ntorchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs\u002F224_minigpt4_mistral_image.yaml\n\n# align\n# To launch the second stage alignment, first specify the path to the checkpoint file trained in pretrain stage.\n# Llama2\ntorchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs\u002F224_minigpt4_llama2_image_align.yaml\n# Mistral\ntorchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs\u002F224_minigpt4_mistral_image_align.yaml\n```\nYou can download our trained weights for this stage from here [Llama2](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fimage_llama2_checkpoint.pth) [Mistral](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fimage_mistral_checkpoint.pth)\u003Cbr>\n#### Stage 2 (video captioning pretraining)\n\nFor **Llama2** \u003Cbr>\nset the cfg-path in the script to `train_configs\u002F224_v2_llama2_video_stage_2.yaml` \u003Cbr>\nset the model name here `minigpt4\u002Fconfigs\u002Fdatasets\u002Fcmd_video\u002Fdefault.yaml` and `minigpt4\u002Fconfigs\u002Fdatasets\u002Fwebvid\u002Fdefault.yaml` to llama2\u003Cbr>\nFor **Mistral**\u003Cbr> \nset the cfg-path in the script to `train_configs\u002F224_v2_mistral_video_stage_2.yaml` \u003Cbr>\nset the model name here `minigpt4\u002Fconfigs\u002Fdatasets\u002Fcmd_video\u002Fdefault.yaml` and `minigpt4\u002Fconfigs\u002Fdatasets\u002Fwebvid\u002Fdefault.yaml` to mistral\u003Cbr>\n\n```bash\nbash training_scripts\u002Fstage_2.sh\n```\nYou can download our trained weights for this stage from here [Llama2](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_captioning_llama_checkpoint_last.pth) [Mistral](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_captioning_mistral_checkpoint_last.pth)\u003Cbr>\n\n#### Stage 3 (video Instruction finetuning)\n\nFor **Llama2** \u003Cbr>\nset the cfg-path in the script to `train_configs\u002F224_v2_llama2_video_stage_3.yaml` \u003Cbr>\nset the model name here `minigpt4\u002Fconfigs\u002Fdatasets\u002Fvideo_chatgpt\u002Fdefault.yaml` to llama2\u003Cbr>\n\nFor **Mistral**\u003Cbr> \nset the cfg-path in the script to `train_configs\u002F224_v2_mistral_video_stage_3.yaml` \u003Cbr>\nset the model name here `minigpt4\u002Fconfigs\u002Fdatasets\u002Fvideo_chatgpt\u002Fdefault.yaml` to mistral\u003Cbr>\n\n```bash\nbash training_scripts\u002Fstage_3.sh\n```\nYou can download our trained weights for this stage from here [Llama2](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_llama_checkpoint_last.pth) [Mistral](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_mistral_checkpoint_last.pth)\u003Cbr>\n\n## :zap: MiniGPT4-Video Evaluation\nTo reproduce the results use the best checkpoints for each model \u003Cbr>\n[Llama2](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_llama_checkpoint_best.pth) [Mistral](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_mistral_checkpoint_best.pth)\u003Cbr>\nWe used the same evaluation as [Video-ChatGPT](https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-ChatGPT\u002F)\u003Cbr>\n\n|Method| Using Subtitles | Information Correctness | Detailed Orientation | Contextual Understanding | Temporal Understanding | Consistency |\n|:--------------------:|:----:|:------------------------:|:---------------------:|:-------------------------:|:-----------------------:|:------------:|\n| LLaMA Adapter | :x:| 2.03 | 2.32| 2.30| 1.98| 2.15 |\n| Video LLaMA| :x:| 1.96 | 2.18| 2.16| 1.82| 1.79 |\n| Video Chat| :x:| 2.23 | 2.50| 2.53| 1.94| 2.24 |\n| Video-ChatGPT | :x:| 2.40 | 2.52| 2.62| 1.98| 2.37 |\n| BT-Adapter-7B | :x:| 2.68 | 2.69| 3.27| 2.34| 2.46 |\n| LLaMA-VID-7B| :x:| 2.96 | 3.00| 3.53| 2.46| 2.51 |\n| **Ours-7B Llama2**| :x:| 2.93 | 2.97| 3.45| **2.47**| **2.60**|\n| **Ours-7B Llama2**| :white_check_mark:| **3.08** | **3.02**| **3.57**| **2.65**| **2.67**|\n| **Ours-7B Mistral** | :x:| 2.83|2.52 |3.01 |2.32 |2.40 |\n| **Ours-7B Mistral**| :white_check_mark:| 2.91 | 2.57| 3.11|2.33 | 2.39|\n\n\n\n|Method| Using Subtitles | MSVD Acc.↑ | MSVD Score↑ | MSRVTT Acc.↑ | MSRVTT Score↑ | TGIF Acc.↑ | TGIF Score↑ | ActivityNet Acc.↑ | ActivityNet Score↑ | TVQA Acc.↑ |\n|:---------------------------------------:|:----------------:|:-----------:|:------------:|:--------------:|:---------------:|:-----------:|:------------:|:-------------------:|:--------------------:|:------------:|\n| FrozenBiLM|:x:|32.2| --|16.8 |--| 41 |-- |24.7|--|29.7 |\n| LLaMA Adapter|:x:|54.9| 3.1 |43.8 |2.7| -- |-- |34.2| 2.7| --|\n| Video LLaMA|:x:|51.6| 2.5 |29|1.8| -- |-- |12.4| 1.1| --|\n| Video Chat|:x:|56.3| 2.8 |45|2.5|34.4| 2.3 |26.5| 2.2|--|\n| Video-ChatGPT|:x:|64.9| 3.3 |49.3 |2.8|51.4| 3.0 |35.2| 2.7|23.35|\n| BT-Adapter-7B|:x:|67.7| 3.7 |57|3.2| -- |-- |45.7| 3.2| --|\n| LLaMA-VID-7B |:x:|69.7| 3.7 |57.7 |3.2| -- |-- |**47.4**| **3.3**| --|\n| **Ours-7B LLama2**|:x:|72.93|3.84|58.83|3.29|67.9|3.71| 45.85 |3.23|36.45|\n| **Ours-7B Llama2**|:white_check_mark:|72.93|3.84|**59.73**|**3.3** |67.9|3.71| 46.3|3.4 |46.94|\n| **Ours-7B Mistral**|:x:|**73.92**|**4.06**|58.26|3.52|**72.22**|**4.08**|44.25 |3.35|33.90|\n| **Ours-7B Mistral**|:white_check_mark:|**73.92**|**4.06**|58.68|3.53 |**72.22**|**4.08**| 44.38|3.36 |**54.21** |\n\n### Download datasets for evaluation\n+ [MSVD](https:\u002F\u002Fwww.cs.utexas.edu\u002Fusers\u002Fml\u002Fclamp\u002FvideoDescription\u002F) \u003Cbr>\n+ [MSRVTT](https:\u002F\u002Fcove.thecvf.com\u002Fdatasets\u002F839) \u003Cbr>\n+ [TGIF](https:\u002F\u002Fgithub.com\u002FYunseokJANG\u002Ftgif-qa\u002Fblob\u002Fmaster\u002Fdataset\u002FREADME.md) \u003Cbr>\n+ [ActivityNet](https:\u002F\u002Fmbzuaiac-my.sharepoint.com\u002F:u:\u002Fg\u002Fpersonal\u002Fhanoona_bangalath_mbzuai_ac_ae\u002FESa302OCJMNHsMk7wuBbQc8BZH5CqlcdCWiSpXynQZDfAQ?e=CrOPbm) \u003Cbr>\n+ [TVQA](https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fjielei\u002Ftvqa\u002Ftvqa_public_html\u002Fdownload_tvqa.html) \u003Cbr>\n+ [Video-ChatGPT benchmark](https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-ChatGPT\u002F) \u003Cbr>\n\nYou can find the evaluation datasets annotation files [download](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Ftree\u002Fmain\u002Fdatasets\u002Fevaluation_datasets) \u003Cbr>\n\nSubtitles for MSR-VTT,and ActivityNet are availabe here  [download](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fresolve\u002Fmain\u002Fdatasets\u002Fevaluation_subtitles.zip)\nnote these subtitles are generated using \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper\">whisper model\u003Cbr>\nTVQA subtitles can be downloaded from [here](https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fjielei\u002Ftvqa\u002Ftvqa_public_html\u002Fdownload_tvqa.html)\n### Run evaluation script\nSet the each evaluation script parameters in the script \u003Cbr>\n```\nNAME=\"\" # Name of the experiment\nBATCH_SIZE=8 # batch size \nCKPT_PATH=\"\" # path to the checkpoint\nDATASET=\"msvd\" # dataset name, available datasets: tvqa, msrvtt, msvd, activitynet,tgif ,video_chatgpt_generic,video_chatgpt_temporal,video_chatgpt_consistency\n# set the paths to the dataset files\nvideos_path=\"\" # path to the videos file\nsubtitles_path=\"\" # path to the subtitles file if the dataset is msrvtt, activitynet or tvqa else set it to \"\"\nann_path=\"\" # path to the annotations file\ncfg_path=\"\" # path to the config file\n```\n\u003Cbr> \n\n```bash\nbash evaluation\u002Fminigpt4_video_eval\u002Fminigpt4_video_evalualtion.sh\n```\nThen Use GPT3.5 turbo to compare the predictions with the ground truth and generate the accuracy and scores \u003Cbr>\nSet these variables in both evaluate_benchmark.sh and evaluate_zeroshot.sh \u003Cbr>\n```bash\nPRED=\"path_to_predictions\"\nOUTPUT_DIR=\"path_to_output_dir\"\nAPI_KEY=\"openAI_key\"\nNUM_TASKS=128\n```\nThen to evaluate [Video-ChatGPT benchmark] run the following script \u003Cbr>\n```bash\nbash GPT_evaluation\u002Fevaluate_benchmark.sh\n```\nTo evaluate open ended questions run the following script \u003Cbr>\n```bash\nbash GPT_evaluation\u002Fevaluate_zeroshot.py\n```\n\n## :zap: Goldfish Evaluation\n**Long video benchmarking results on four benchmarks: LLama-Vid, MovieChat, Movie QA, and our proposed TVQA-Long. The \"V\" modality indicates the use of video frames only, while \"V+T\" indicates the use of both video frames and subtitles**\n\n\u003C!-- ![Goldfish results](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_1a6d09e1a2c2.jpg) -->\n| Method      | Modalities | LLama-Vid Acc.↑ | LLama-Vid Score↑ | MovieChat Acc.↑ | MovieChat Score↑ | Movie QA Acc.↑ | Movie QA Score↑ | TVQA-Long Acc.↑ | TVQA-Long Score↑ |\n|-------------|------------|-----------------|------------------|-----------------|------------------|----------------|-----------------|------------|-------------|\n| LLAMA-VID   | V          | 20.68           | 2.41             | 53.2            | 3.81             | 24.42          | 2.19            | 24.63      | 2.16        |\n| MovieChat   | V          | 11.71           | 1.45             | NA              | NA               | 16.18          | 1.68            | 5.0        | 0.86        |\n| Ours        | V          | **23.09**       | 2.19             | **67.6**        | **4.23**         | **28.49**      | **2.8**         | **28.61**  | **2.78**    |\n| LLAMA-VID   | V+T        | 41.4†           | 3.07†            | NA              | NA               | 37.65†         | 3.03†           | 26.86      | 2.21        |\n| Ours        | V+T        | 31.49           | 2.48             | NA              | NA               | 35.24          | **3.1**             | **41.78**  | **3.21**    |\n\n**Note: The dagger † symbol indicates the method used the benchmark during training, which implies unfair comparison.**\n\nTo reproduce the results use the `checkpoints\u002Fvideo_llama_checkpoint_last.pth`  and use openAI embedding `--use_openai_embedding=True`\u003Cbr>\n### Download datasets for evaluation\nFor **Llama-vid** and **MovieQA** \u003Cbr>\nDowlnoad the original MovieNet data with movies and annotations from [here](https:\u002F\u002Fopendatalab.com\u002FOpenDataLab\u002FMovieNet\u002Ftree\u002Fmain\u002Fraw)\u003Cbr>\nThis will be the souce videos for LLama-vid and MovieQA \u003Cbr>\n#### Filtered Annotations same as illestrated in the paper and used for evaluation\n[Llama-vid](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Ftree\u002Fmain\u002Fdatasets\u002Fgoldfish_eval_datasets\u002Fllama_vid)\u003Cbr>\n[MovieQA](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Ftree\u002Fmain\u002Fdatasets\u002Fgoldfish_eval_datasets\u002Fmovie_qa)\u003Cbr>\nFor **Moviechat** the only available videos while implementing this work is 10 % of the training data and this what we used for evalaution and can be found [here](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fdatasets\u002Fgoldfish_eval_datasets\u002Fmovie_chat\u002Favailable_movies_list.txt) \u003Cbr>\nFull dataset can be found [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEnxin\u002FMovieChat-1K_train\u002Ftree\u002Fmain) \u003Cbr>\nFor **TVQA-Long** \u003Cbr>\nif you want to use TVQA-Long for another model (llama-vid),both videos and annotations can be found here [TVQA-Long](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FVision-CAIR\u002FTVQA-Long\u002Ftree\u002Fmain). \nFor Goldfish evalaution we will use the separated clips from the original TVQA dataset \u003Cbr>\n### Run the evaluation scripts \n``` bash \n# Llama-vid evalauation \n# set these parameters in the script \nvideos_path=\"path to the videos\"\nsubtitle_path=\"path to the subtitles\"\nvideo_clips_saving_path=\"path to save the video clips\"\nannotation_file=\"path to the annotation file\"\nmovienet_annotations_dir=\"path to the movienet annotations directory\" \nNEIGHBOURS=3\nuse_openai_embedding=\"whether to use openai embeddings or not\"\n# then run the script\nbash evaluation\u002FGoldfish_eval\u002Fmovies\u002Feval_model_summary_llama_vid.sh\n\n# MovieQA evaluation\n# same as above but set the parameters in the script to the MovieQA paths \nbash evaluation\u002FGoldfish_eval\u002Fmovies\u002Feval_model_summary_movie_qa.sh\n\n# MovieChat evaluation \n# set these parameters in the script \ndataset_path=\"path to the movies folder\"\nannotation_json_folder=\"path to the jsons folder\"\n# then run the script\nbash evaluation\u002FGoldfish_eval\u002Fmovies\u002Feval_model_summary_movie_chat.sh\n```\n### TVQA-Long\nFor Goldfish evaluation we can use the original separated clips from the original TVQA dataset \u003Cbr>\nDownload the original TVQA videos and clips subtitles for short videos from [here](https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fjielei\u002Ftvqa\u002Ftvqa_public_html\u002Fdownload_tvqa.html)\u003Cbr>\ntvqa_long_annotation [here](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Ftree\u002Fmain\u002Fdatasets\u002Fgoldfish_eval_datasets\u002Ftvqa\u002Ftvqa_val_edited.json) \u003Cbr>\ntvqa_json_subtitles [here](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Ftree\u002Fmain\u002Fdatasets\u002Fgoldfish_eval_datasets\u002Ftvqa\u002Ftvqa_preprocessed_subtitles.json)\u003Cbr>\n\n```bash \n# set these parameters in the script\ntvqa_json_subtitles=\"path to the tvqa json subtitles file\"\ntvqa_clips_subtitles=\"path to the tvqa clips subtitles\"\nvideos_frames=\"path to the video frames\"\ntvqa_long_annotation=\"path to the TVQA-Long annotation file\"\nNEIGHBOURS= 3\nuse_openai_embedding=\"whether to use openai embeddings or not\"\n# then run the script\nbash evaluation\u002FGoldfish_eval\u002Ftvqa_eval\u002Feval_model_summary.sh\n````\n\nThen Use GPT3.5 turbo to compare the predictions with the ground truth and generate the accuracy and scores \u003Cbr>\nSet these variables in evaluate_zeroshot.sh \u003Cbr>\n```bash\nPRED=\"path_to_predictions\"\nOUTPUT_DIR=\"path_to_output_dir\"\nAPI_KEY=\"openAI_key\"\nNUM_TASKS=128\n```\nTo evaluate open ended questions run the following script \u003Cbr>\n```bash\nbash GPT_evaluation\u002Fevaluate_zeroshot.sh\n```\n\n## Citation\nIf you're using MiniGPT4-Video or Goldfish in your research or applications, please cite using this BibTeX:\n```\n@misc{ataallah2024goldfishvisionlanguageunderstandingarbitrarily,\n      title={Goldfish: Vision-Language Understanding of Arbitrarily Long Videos}, \n      author={Kirolos Ataallah and Xiaoqian Shen and Eslam Abdelrahman and Essam Sleiman and Mingchen Zhuge and Jian Ding and Deyao Zhu and Jürgen Schmidhuber and Mohamed Elhoseiny},\n      year={2024},\n      eprint={2407.12679},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.12679}, \n}\n@article{ataallah2024minigpt4,\n  title={MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens},\n  author={Ataallah, Kirolos and Shen, Xiaoqian and Abdelrahman, Eslam and Sleiman, Essam and Zhu, Deyao and Ding, Jian and Elhoseiny, Mohamed},\n  journal={arXiv preprint arXiv:2404.03413},\n  year={2024}\n}\n\n```\n\n## Acknowledgements\n[MiniGPT4](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4) \u003Cbr>\n[Video-ChatGPT](https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-ChatGPT)\n\n## License\nThis repository is under [BSD 3-Clause License](LICENSE.md).\nMany codes are based on [MiniGPT4](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4).\n","# [ECCV 2024 接受]金鱼：任意长度视频的视觉-语言理解\n# [CVPR2024W]MiniGPT4-Video：通过交错的视觉-文本标记推进用于视频理解的多模态大模型\n**本仓库包含用于短视频理解的MiniGPT4-video和用于长视频理解的金鱼的代码。**\n\u003Ch3 style=\"text-align: center;\">在线演示\u003C\u002Fh3>\n\u003Cdiv style=\"display: flex; justify-content: center; gap: 40px;\">\n    \u003Cdiv style=\"text-align: center;\">\n        \u003Ca href='https:\u002F\u002Fgoldfishdemo.loophole.site'>\n            \u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_bc9b6f94c8d9.png' width=200 height=200>\n        \u003C\u002Fa>\n        \u003Cdiv>\n            \u003Cfont size=3>\n                \u003Cdiv>\n                    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_bc9b6f94c8d9.png\" width=18>\n                    \u003Ca href=\"https:\u002F\u002Fvision-cair.github.io\u002FGoldfish_website\u002F\">项目页面\u003C\u002Fa>\n                    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.12679\">📝 arXiv论文\u003C\u002Fa>\n                    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FVision-CAIR\u002FTVQA-Long\u002Ftree\u002Fmain\">🤗 TVQA-Long数据集\u003C\u002Fa>\n                \u003C\u002Fdiv>\n            \u003C\u002Ffont>\n        \u003C\u002Fdiv>\n    \u003C\u002Fdiv>\n    \u003Cdiv style=\"text-align: center;\">\n        \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVision-CAIR\u002FMiniGPT4-video'>\n            \u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_b1185abc61d3.png' width=200 height=200>\n        \u003C\u002Fa>\n        \u003Cdiv>\n            \u003Cfont size=3>\n                \u003Cdiv>\n                    \u003Ca href=\"https:\u002F\u002Fvision-cair.github.io\u002FMiniGPT4-video\u002F\">🎞️ 项目页面\u003C\u002Fa>\n                    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03413\">📝 arXiv论文\u003C\u002Fa>\n                \u003C\u002Fdiv>\n            \u003C\u002Ffont>\n        \u003C\u002Fdiv>\n    \u003C\u002Fdiv>\n\u003C\u002Fdiv>\n\n\n![Goldfish_teaser_fig](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_ddb77f93540e.jpg)\n## 概述\n当前大多数基于大模型的视频理解模型能够处理几分钟以内的视频，但在处理较长视频时却面临“噪声与冗余挑战”以及“内存与计算挑战”。在本文中，我们提出了金鱼方法，该方法专为理解任意长度的视频而设计。同时，我们还引入了TVQA-long基准，专门用于评估模型在理解和回答涉及视觉与文本内容问题的长视频方面的能力。金鱼通过一种高效的检索机制来应对这些挑战：首先根据指令筛选出最相关的前k个视频片段，然后再生成所需的响应。这种检索机制的设计使得金鱼能够高效地处理任意长度的视频序列，从而使其适用于电影或电视剧等场景。为了便于检索过程，我们开发了MiniGPT4-Video，它可以为视频片段生成详细的描述。针对长视频评估基准稀缺的问题，我们将TVQA短视频基准扩展应用于更长内容的分析，通过汇总整集节目中的问题，将评估重点从部分片段转向整集的理解。我们在TVQA-long基准上取得了41.78%的准确率，比现有方法高出14.94%。我们的MiniGPT4-Video在短视频理解方面也表现出色，在MSVD、MSRVTT、TGIF和TVQA短视频基准上分别超越了现有最先进方法3.23%、2.03%、16.5%和23.59%。这些结果表明，我们的模型在长视频和短视频理解方面都取得了显著提升。\n### 金鱼框架（长视频）\n![methodology](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_931b1e5a17f8.jpg)\u003Cbr>\n![Gold ish demo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_261d68770b64.jpg)\n### MiniGPT4-Video （短视频）\n![methodology](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_e8017bafc20b.jpg)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fzeroshot-video-question-answer-on-tgif-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-tgif-qa?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fzero-shot-video-question-answer-on-tvqa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzero-shot-video-question-answer-on-tvqa?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fvideo-based-generative-performance-1)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fvideo-based-generative-performance-1?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fvideo-based-generative-performance-3)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fvideo-based-generative-performance-3?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fvideo-based-generative-performance-4)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fvideo-based-generative-performance-4?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fvideo-based-generative-performance-5)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fvideo-based-generative-performance-5?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fvideo-based-generative-performance-2)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fvideo-based-generative-performance-2?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fzeroshot-video-question-answer-on-msvd-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-msvd-qa?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fzeroshot-video-question-answer-on-msrvtt-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-msrvtt-qa?p=minigpt4-video-advancing-multimodal-llms-for)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fminigpt4-video-advancing-multimodal-llms-for\u002Fzeroshot-video-question-answer-on-activitynet)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-activitynet?p=minigpt4-video-advancing-multimodal-llms-for)\n\n![demo_1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_146617407452.gif)\n![demo_2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_a9321ca8d399.gif)\n![demo_3](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_e9a462c0ae5b.gif)\n\n## :rocket: 演示\n**1. 克隆仓库** \u003Cbr>\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT4-video.git\ncd MiniGPT4-video\n```\n\n**2. 配置环境** \u003Cbr>\n```bash\nconda env create -f environment.yml\n```\n**3. 下载检查点**\n\n| MiniGPT4-Video (Llama2 Chat 7B) | MiniGPT4-Video (Mistral 7B) |\n:------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:\n| [下载](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_llama_checkpoint_last.pth) | [下载](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_mistral_checkpoint_last.pth) |\n\n**4. 运行演示** \u003Cbr>\nGoldfish 演示 \n```bash\n# 为获得推荐性能，请在以下命令中添加参数 --use_openai_embedding True，并将 API 密钥设置到环境变量 OPENAI_API_KEY 中；否则模型将使用默认嵌入。\nexport OPENAI_API_KEY=\"your_openai_key\" \n# Llama2\npython goldfish_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs\u002Fllama2_test_config.yaml \n# Mistral\npython goldfish_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs\u002Fmistral_test_config.yaml\n```\nMiniGPT4-Video 演示\n```bash\n# Llama2\npython minigpt4_video_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs\u002Fllama2_test_config.yaml\n# Mistral\npython minigpt4_video_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs\u002Fmistral_test_config.yaml\n```\n### 推理\n按照上述步骤操作，然后用此步骤替换第 4 步 \u003Cbr>\nGoldfish 推理\n```bash\n# 为获得推荐性能，请在以下命令中添加参数 --use_openai_embedding True，并将 API 密钥设置到环境变量 OPENAI_API_KEY 中；否则模型将使用默认嵌入。\nexport OPENAI_API_KEY=\"your_openai_key\" \n# Llama2\npython goldfish_inference.py --ckpt path_to_llama2_checkpoint --cfg-path test_configs\u002Fllama2_test_config.yaml --video_path path_to_video --question \"Your question here\" \n# Mistral\npython goldfish_inference.py --ckpt path_to_mistral_checkpoint --cfg-path test_configs\u002Fmistral_test_config.yaml --video_path path_to_video --question \"Your question here\" \n```\nMiniGPT4-Video 推理\n```bash\n# Llama2\npython minigpt4_video_inference.py --ckpt path_to_llama2_checkpoint --cfg-path test_configs\u002Fllama2_test_config.yaml --video_path path_to_video --question \"Your question here\" \n# Mistral\npython minigpt4_video_inference.py --ckpt path_to_mistral_checkpoint --cfg-path test_configs\u002Fmistral_test_config.yaml --video_path path_to_video --question \"Your question here\" \n```\n## :fire: 训练\n对于 Goldfish 和 MiniGPT4-Video，唯一需要训练的部分是 MiniGPT4-Video 模型。 \u003Cbr>\n### 为您的专属视频-文本数据集定制 MiniGPT4-Video \n\u003C!-- point to file here Custom_training.md -->\n您可以在 [Custom_training.md](Custom_training.md) 中找到为您的视频-文本数据集定制 MiniGPT4-Video 的步骤。\n### 训练数据集\n下载以下数据集后，**请前往数据集配置文件夹 minigpt4\u002Fconfigs\u002Fdatasets，在那里设置每个数据集的路径。**\u003Cbr>\n图像-文本训练\u003Cbr>\n您可以在 [MiniGPT4](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4\u002Ftree\u002Fmain\u002Fdataset) 中找到下载这些数据集的步骤。\u003Cbr>\n+ LAION \u003Cbr>\n+ Conceptual Captions \u003Cbr>\n+ SBU \u003Cbr>\n\n视频-文本训练：\u003Cbr>\n\n+ [CMD](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fdata\u002Fcondensed-movies\u002F) \u003Cbr>\n+ [Webvid](https:\u002F\u002Fgithub.com\u002Fm-bain\u002Fwebvid\u002F) \u003Cbr> \u003C!-- + [Webvid](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FTempoFunk\u002Fwebvid-10M?row=2) \u003Cbr> -->\n+ [Video Instructional Dataset 100K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FVideoInstruct-100K) \u003Cbr>\n\n您可以在这里找到视频-文本数据集的标注文件 [下载](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Ftree\u002Fmain\u002Fdatasets\u002Ftraining_datasets) \u003Cbr>\n\n\n### 模型训练： \n您可以在下面的每个 script.sh 文件中调整 GPU 数量。\u003Cbr>\n#### 第一阶段（图像-文本预训练）\n\n您可以直接下载与 Llama2 对齐的预训练 MiniGPT4 [检查点](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F11nAPjEok8eAGGEG1N2vXo3kBLCg0WgUk\u002Fview?usp=sharing)。 \u003Cbr>\n\n或者自行训练：\n\n```bash\n# 预训练\n# Llama2\ntorchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs\u002F224_minigpt4_llama2_image.yaml\n# Mistral\ntorchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs\u002F224_minigpt4_mistral_image.yaml\n\n# 对齐\n# 要启动第二阶段对齐，首先指定在预训练阶段训练好的检查点文件路径。\n# Llama2\ntorchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs\u002F224_minigpt4_llama2_image_align.yaml\n# Mistral\ntorchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs\u002F224_minigpt4_mistral_image_align.yaml\n```\n您可以从这里下载我们在此阶段训练好的权重 [Llama2](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fimage_llama2_checkpoint.pth) [Mistral](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fimage_mistral_checkpoint.pth)\u003Cbr>\n#### 第二阶段（视频字幕预训练）\n\n对于 **Llama2** \u003Cbr>\n将脚本中的 cfg-path 设置为 `train_configs\u002F224_v2_llama2_video_stage_2.yaml` \u003Cbr>\n并将模型名称在此处设置为 `minigpt4\u002Fconfigs\u002Fdatasets\u002Fcmd_video\u002Fdefault.yaml` 和 `minigpt4\u002Fconfigs\u002Fdatasets\u002Fwebvid\u002Fdefault.yaml` 为 llama2\u003Cbr>\n对于 **Mistral**\u003Cbr> \n将脚本中的 cfg-path 设置为 `train_configs\u002F224_v2_mistral_video_stage_2.yaml` \u003Cbr>\n并将模型名称在此处设置为 `minigpt4\u002Fconfigs\u002Fdatasets\u002Fcmd_video\u002Fdefault.yaml` 和 `minigpt4\u002Fconfigs\u002Fdatasets\u002Fwebvid\u002Fdefault.yaml` 为 mistral\u003Cbr>\n\n```bash\nbash training_scripts\u002Fstage_2.sh\n```\n您可以从这里下载我们在此阶段训练好的权重 [Llama2](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_captioning_llama_checkpoint_last.pth) [Mistral](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_captioning_mistral_checkpoint_last.pth)\u003Cbr>\n\n#### 第三阶段（视频指令微调）\n\n对于 **Llama2** \u003Cbr>\n将脚本中的 cfg-path 设置为 `train_configs\u002F224_v2_llama2_video_stage_3.yaml` \u003Cbr>\n并将模型名称在此处设置为 `minigpt4\u002Fconfigs\u002Fdatasets\u002Fvideo_chatgpt\u002Fdefault.yaml` 为 llama2\u003Cbr>\n\n对于 **Mistral**\u003Cbr> \n将脚本中的 cfg-path 设置为 `train_configs\u002F224_v2_mistral_video_stage_3.yaml` \u003Cbr>\n并将模型名称在此处设置为 `minigpt4\u002Fconfigs\u002Fdatasets\u002Fvideo_chatgpt\u002Fdefault.yaml` 为 mistral\u003Cbr>\n\n```bash\nbash training_scripts\u002Fstage_3.sh\n```\n您可以从这里下载我们在此阶段训练好的权重 [Llama2](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_llama_checkpoint_last.pth) [Mistral](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_mistral_checkpoint_last.pth)\u003Cbr>\n\n## :zap: MiniGPT4-Video 评估\n要复现结果，请使用每个模型的最佳检查点：\u003Cbr>\n[Llama2](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_llama_checkpoint_best.pth) [Mistral](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_mistral_checkpoint_best.pth)\u003Cbr>\n我们采用了与[Video-ChatGPT](https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-ChatGPT\u002F)相同的评估方法。\u003Cbr>\n\n|方法| 使用字幕 | 信息正确性 | 细节导向性 | 上下文理解 | 时间顺序理解 | 一致性 |\n|:--------------------:|:----:|:------------------------:|:---------------------:|:-------------------------:|:-----------------------:|:------------:|\n| LLaMA Adapter | :x:| 2.03 | 2.32| 2.30| 1.98| 2.15 |\n| Video LLaMA| :x:| 1.96 | 2.18| 2.16| 1.82| 1.79 |\n| Video Chat| :x:| 2.23 | 2.50| 2.53| 1.94| 2.24 |\n| Video-ChatGPT | :x:| 2.40 | 2.52| 2.62| 1.98| 2.37 |\n| BT-Adapter-7B | :x:| 2.68 | 2.69| 3.27| 2.34| 2.46 |\n| LLaMA-VID-7B| :x:| 2.96 | 3.00| 3.53| 2.46| 2.51 |\n| **我们的7B Llama2**| :x:| 2.93 | 2.97| 3.45| **2.47**| **2.60**|\n| **我们的7B Llama2**| :white_check_mark:| **3.08** | **3.02**| **3.57**| **2.65**| **2.67**|\n| **我们的7B Mistral** | :x:| 2.83|2.52 |3.01 |2.32 |2.40 |\n| **我们的7B Mistral**| :white_check_mark:| 2.91 | 2.57| 3.11|2.33 | 2.39|\n\n\n\n|方法| 使用字幕 | MSVD 准确率↑ | MSVD 分数↑ | MSRVTT 准确率↑ | MSRVTT 分数↑ | TGIF 准确率↑ | TGIF 分数↑ | ActivityNet 准确率↑ | ActivityNet 分数↑ | TVQA 准确率↑ |\n|:---------------------------------------:|:----------------:|:-----------:|:------------:|:--------------:|:---------------:|:-----------:|:------------:|:-------------------:|:--------------------:|:------------:|\n| FrozenBiLM|:x:|32.2| --|16.8 |--| 41 |-- |24.7|--|29.7 |\n| LLaMA Adapter|:x:|54.9| 3.1 |43.8 |2.7| -- |-- |34.2| 2.7| --|\n| Video LLaMA|:x:|51.6| 2.5 |29|1.8| -- |-- |12.4| 1.1| --|\n| Video Chat|:x:|56.3| 2.8 |45|2.5|34.4| 2.3 |26.5| 2.2|--|\n| Video-ChatGPT|:x:|64.9| 3.3 |49.3 |2.8|51.4| 3.0 |35.2| 2.7|23.35|\n| BT-Adapter-7B|:x:|67.7| 3.7 |57|3.2| -- |-- |45.7| 3.2| --|\n| LLaMA-VID-7B |:x:|69.7| 3.7 |57.7 |3.2| -- |-- |**47.4**| **3.3**| --|\n| **我们的7B LLama2**|:x:|72.93|3.84|58.83|3.29|67.9|3.71| 45.85 |3.23|36.45|\n| **我们的7B Llama2**|:white_check_mark:|72.93|3.84|**59.73**|**3.3** |67.9|3.71| 46.3|3.4 |46.94|\n| **我们的7B Mistral**|:x:|**73.92**|**4.06**|58.26|3.52|**72.22**|**4.08**|44.25 |3.35|33.90|\n| **我们的7B Mistral**|:white_check_mark:|**73.92**|**4.06**|58.68|3.53 |**72.22**|**4.08**| 44.38|3.36 |**54.21** |\n\n### 下载用于评估的数据集\n+ [MSVD](https:\u002F\u002Fwww.cs.utexas.edu\u002Fusers\u002Fml\u002Fclamp\u002FvideoDescription\u002F) \u003Cbr>\n+ [MSRVTT](https:\u002F\u002Fcove.thecvf.com\u002Fdatasets\u002F839) \u003Cbr>\n+ [TGIF](https:\u002F\u002Fgithub.com\u002FYunseokJANG\u002Ftgif-qa\u002Fblob\u002Fmaster\u002Fdataset\u002FREADME.md) \u003Cbr>\n+ [ActivityNet](https:\u002F\u002Fmbzuaiac-my.sharepoint.com\u002F:u:\u002Fg\u002Fpersonal\u002Fhanoona_bangalath_mbzuai_ac_ae\u002FESa302OCJMNHsMk7wuBbQc8BZH5CqlcdCWiSpXynQZDfAQ?e=CrOPbm) \u003Cbr>\n+ [TVQA](https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fjielei\u002Ftvqa\u002Ftvqa_public_html\u002Fdownload_tvqa.html) \u003Cbr>\n+ [Video-ChatGPT 基准测试](https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-ChatGPT\u002F) \u003Cbr>\n\n您可以在[Hugging Face](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Ftree\u002Fmain\u002Fdatasets\u002Fevaluation_datasets)上找到评估数据集的标注文件。\u003Cbr>\n\nMSR-VTT和ActivityNet的字幕可在此下载：[download](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fresolve\u002Fmain\u002Fdatasets\u002Fevaluation_subtitles.zip)\n请注意，这些字幕是使用\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper\">Whisper模型\u003C\u002Fa>生成的。\u003Cbr>\nTVQA的字幕可以从[这里](https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fjielei\u002Ftvqa\u002Ftvqa_public_html\u002Fdownload_tvqa.html)下载。\n\n### 运行评估脚本\n在脚本中设置每个评估脚本的参数：\u003Cbr>\n```\nNAME=\"\" # 实验名称\nBATCH_SIZE=8 # 批量大小 \nCKPT_PATH=\"\" # 检查点路径\nDATASET=\"msvd\" # 数据集名称，可用数据集：tvqa、msrvtt、msvd、activitynet、tgif、video_chatgpt_generic、video_chatgpt_temporal、video_chatgpt_consistency\n# 设置数据集文件的路径\nvideos_path=\"\" # 视频文件路径\nsubtitles_path=\"\" # 字幕文件路径，如果是msrvtt、activitynet或tvqa，则填写；否则留空\nann_path=\"\" # 注释文件路径\ncfg_path=\"\" # 配置文件路径\n```\n\u003Cbr> \n\n```bash\nbash evaluation\u002Fminigpt4_video_eval\u002Fminigpt4_video_evalualtion.sh\n```\n然后使用GPT3.5 turbo将预测结果与真实答案进行比较，并生成准确率和分数。\u003Cbr>\n在evaluate_benchmark.sh和evaluate_zeroshot.sh中设置以下变量：\u003Cbr>\n```bash\nPRED=\"预测结果路径\"\nOUTPUT_DIR=\"输出目录路径\"\nAPI_KEY=\"openAI密钥\"\nNUM_TASKS=128\n```\n然后，要评估[Video-ChatGPT基准]，运行以下脚本：\u003Cbr>\n```bash\nbash GPT_evaluation\u002Fevaluate_benchmark.sh\n```\n要评估开放式问题，运行以下脚本：\u003Cbr>\n```bash\nbash GPT_evaluation\u002Fevaluate_zeroshot.py\n```\n\n## :zap: Goldfish 评估\n**针对四个基准的长视频评测结果：LLama-Vid、MovieChat、Movie QA以及我们提出的TVQA-Long。其中“V”模态表示仅使用视频帧，“V+T”表示同时使用视频帧和字幕**\n\n\u003C!-- ![Goldfish结果](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_readme_1a6d09e1a2c2.jpg) -->\n| 方法      | 模态 | LLama-Vid 准确率↑ | LLama-Vid 分数↑ | MovieChat 准确率↑ | MovieChat 分数↑ | Movie QA 准确率↑ | Movie QA 分数↑ | TVQA-Long 准确率↑ | TVQA-Long 分数↑ |\n|-------------|------------|-----------------|------------------|-----------------|------------------|----------------|-----------------|------------|-------------|\n| LLAMA-VID   | V          | 20.68           | 2.41             | 53.2            | 3.81             | 24.42          | 2.19            | 24.63      | 2.16        |\n| MovieChat   | V          | 11.71           | 1.45             | NA              | NA               | 16.18          | 1.68            | 5.0        | 0.86        |\n| 我们        | V          | **23.09**       | 2.19             | **67.6**        | **4.23**         | **28.49**      | **2.8**         | **28.61**  | **2.78**    |\n| LLAMA-VID   | V+T        | 41.4†           | 3.07†            | NA              | NA               | 37.65†         | 3.03†           | 26.86      | 2.21        |\n| 我们        | V+T        | 31.49           | 2.48             | NA              | NA               | 35.24          | **3.1**             | **41.78**  | **3.21**    |\n\n**注：符号†表示该方法在训练时已使用过该基准，这意味着比较并不公平。**\n\n要复现结果，请使用`checkpoints\u002Fvideo_llama_checkpoint_last.pth`，并启用OpenAI嵌入`--use_openai_embedding=True`。\u003Cbr>\n\n### 下载用于评估的数据集\n对于 **Llama-vid** 和 **MovieQA**：\u003Cbr>\n请从[这里](https:\u002F\u002Fopendatalab.com\u002FOpenDataLab\u002FMovieNet\u002Ftree\u002Fmain\u002Fraw)下载包含电影和标注的原始 MovieNet 数据。\u003Cbr>\n这些将是 Llama-vid 和 MovieQA 的源视频。\u003Cbr>\n#### 如论文中所示并用于评估的过滤后标注：\n[Llama-vid](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Ftree\u002Fmain\u002Fdatasets\u002Fgoldfish_eval_datasets\u002Fllama_vid)\u003Cbr>\n[MovieQA](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Ftree\u002Fmain\u002Fdatasets\u002Fgoldfish_eval_datasets\u002Fmovie_qa)\u003Cbr>\n对于 **Moviechat**，在本工作中可用的视频仅为训练数据的 10%，这也是我们用于评估的数据，可在此处找到：[这里](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fdatasets\u002Fgoldfish_eval_datasets\u002Fmovie_chat\u002Favailable_movies_list.txt)。\u003Cbr>\n完整数据集可在此处找到：[这里](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEnxin\u002FMovieChat-1K_train\u002Ftree\u002Fmain)。\u003Cbr>\n对于 **TVQA-Long**：\u003Cbr>\n如果您想将 TVQA-Long 用于其他模型（如 llama-vid），则视频和标注均可在此处找到：[TVQA-Long](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FVision-CAIR\u002FTVQA-Long\u002Ftree\u002Fmain)。\u003Cbr>\n在 Goldfish 评估中，我们将使用来自原始 TVQA 数据集的分离片段。\u003Cbr>\n### 运行评估脚本\n``` bash \n# Llama-vid 评估\n# 在脚本中设置以下参数：\nvideos_path=\"视频路径\"\nsubtitle_path=\"字幕路径\"\nvideo_clips_saving_path=\"保存视频片段的路径\"\nannotation_file=\"标注文件路径\"\nmovienet_annotations_dir=\"MovieNet 标注目录路径\"\nNEIGHBOURS=3\nuse_openai_embedding=\"是否使用 OpenAI 嵌入\"\n# 然后运行脚本\nbash evaluation\u002FGoldfish_eval\u002Fmovies\u002Feval_model_summary_llama_vid.sh\n\n# MovieQA 评估\n# 同上，但将脚本中的参数设置为 MovieQA 的路径\nbash evaluation\u002FGoldfish_eval\u002Fmovies\u002Feval_model_summary_movie_qa.sh\n\n# MovieChat 评估\n# 在脚本中设置以下参数：\ndataset_path=\"电影文件夹路径\"\nannotation_json_folder=\"JSON 文件夹路径\"\n# 然后运行脚本\nbash evaluation\u002FGoldfish_eval\u002Fmovies\u002Feval_model_summary_movie_chat.sh\n```\n### TVQA-Long\n在 Goldfish 评估中，我们可以使用原始 TVQA 数据集中分离出的片段。\u003Cbr>\n请从[这里](https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fjielei\u002Ftvqa\u002Ftvqa_public_html\u002Fdownload_tvqa.html)下载原始 TVQA 视频及短片字幕。\u003Cbr>\nTVQA-Long 的标注文件可在[这里](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Ftree\u002Fmain\u002Fdatasets\u002Fgoldfish_eval_datasets\u002Ftvqa\u002Ftvqa_val_edited.json)获取。\u003Cbr>\nTVQA 的 JSON 字幕文件可在[这里](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Ftree\u002Fmain\u002Fdatasets\u002Fgoldfish_eval_datasets\u002Ftvqa\u002Ftvqa_preprocessed_subtitles.json)获取。\u003Cbr>\n\n```bash \n# 在脚本中设置以下参数：\ntvqa_json_subtitles=\"TVQA JSON 字幕文件路径\"\ntvqa_clips_subtitles=\"TVQA 片段字幕路径\"\nvideos_frames=\"视频帧路径\"\ntvqa_long_annotation=\"TVQA-Long 标注文件路径\"\nNEIGHBOURS=3\nuse_openai_embedding=\"是否使用 OpenAI 嵌入\"\n# 然后运行脚本\nbash evaluation\u002FGoldfish_eval\u002Ftvqa_eval\u002Feval_model_summary.sh\n```\n\n随后使用 GPT-3.5 turbo 将预测结果与真实答案进行对比，并生成准确率和得分。\u003Cbr>\n在 evaluate_zeroshot.sh 中设置以下变量：\u003Cbr>\n```bash\nPRED=\"预测结果路径\"\nOUTPUT_DIR=\"输出目录路径\"\nAPI_KEY=\"OpenAI API 密钥\"\nNUM_TASKS=128\n```\n要评估开放式问题，请运行以下脚本：\u003Cbr>\n```bash\nbash GPT_evaluation\u002Fevaluate_zeroshot.sh\n```\n\n## 引用\n如果您在研究或应用中使用 MiniGPT4-Video 或 Goldfish，请使用以下 BibTeX 格式引用：\n```\n@misc{ataallah2024goldfishvisionlanguageunderstandingarbitrarily,\n      title={Goldfish: Vision-Language Understanding of Arbitrarily Long Videos}, \n      author={Kirolos Ataallah and Xiaoqian Shen and Eslam Abdelrahman and Essam Sleiman and Mingchen Zhuge and Jian Ding and Deyao Zhu and Jürgen Schmidhuber and Mohamed Elhoseiny},\n      year={2024},\n      eprint={2407.12679},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.12679}, \n}\n@article{ataallah2024minigpt4,\n  title={MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens},\n  author={Ataallah, Kirolos and Shen, Xiaoqian and Abdelrahman, Eslam and Sleiman, Essam and Zhu, Deyao and Ding, Jian and Elhoseiny, Mohamed},\n  journal={arXiv preprint arXiv:2404.03413},\n  year={2024}\n}\n\n```\n\n## 致谢\n[MiniGPT4](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4)\u003Cbr>\n[Video-ChatGPT](https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-ChatGPT)\n\n## 许可证\n本仓库采用 [BSD 3-Clause 许可证](LICENSE.md)。许多代码基于 [MiniGPT4](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4)。","# MiniGPT4-video 快速上手指南\n\nMiniGPT4-video 是一个先进的多模态大语言模型，专为视频理解设计。本仓库包含两个核心模型：\n*   **MiniGPT4-Video**：擅长处理短视频理解，通过交错视觉 - 文本令牌提升性能。\n*   **Goldfish**：专为任意长度长视频设计，通过高效检索机制解决长视频中的噪声冗余及显存计算挑战。\n\n## 1. 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n*   **操作系统**：Linux (推荐 Ubuntu 18.04+)\n*   **GPU**：NVIDIA GPU (建议显存 16GB 以上，具体取决于模型大小)\n*   **软件依赖**：\n    *   Python 3.8+\n    *   Conda (用于环境管理)\n    *   CUDA Toolkit (与您的显卡驱动匹配)\n    *   Git\n\n## 2. 安装步骤\n\n### 2.1 克隆代码库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT4-video.git\ncd MiniGPT4-video\n```\n\n### 2.2 创建并激活虚拟环境\n使用提供的 `environment.yml` 文件创建 Conda 环境：\n```bash\nconda env create -f environment.yml\nconda activate minigpt4-video\n```\n*(注：如果下载依赖较慢，可尝试配置国内镜像源，如清华源或阿里源)*\n\n### 2.3 下载模型检查点\n根据需求选择下载 Llama2 或 Mistral 版本的预训练权重：\n\n| 模型版本 | 下载链接 |\n| :--- | :--- |\n| **MiniGPT4-Video (Llama2 Chat 7B)** | [Download](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_llama_checkpoint_last.pth) |\n| **MiniGPT4-Video (Mistral 7B)** | [Download](https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_mistral_checkpoint_last.pth) |\n\n下载后，请记下检查点文件的路径（例如：`.\u002Fcheckpoints\u002Fvideo_llama_checkpoint_last.pth`）。\n\n## 3. 基本使用\n\n本部分展示如何运行推理脚本对单个视频进行问答。\n\n### 3.1 设置环境变量 (仅 Goldfish\u002F长视频推荐)\n如果您使用 **Goldfish** 处理长视频，为了获得最佳检索性能，建议配置 OpenAI API Key 以使用高质量 Embedding。如果不配置，模型将使用默认 Embedding。\n```bash\nexport OPENAI_API_KEY=\"your_openai_key\"\n```\n\n### 3.2 运行推理示例\n\n请根据您的模型类型（Llama2 或 Mistral）和应用场景（短视频 MiniGPT4-Video 或 长视频 Goldfish）选择以下命令之一。\n\n#### 场景 A：短视频理解 (MiniGPT4-Video)\n适用于几分钟内的短视频分析。\n\n**使用 Llama2 版本：**\n```bash\npython minigpt4_video_inference.py --ckpt path_to_llama2_checkpoint --cfg-path test_configs\u002Fllama2_test_config.yaml --video_path path_to_video --question \"Your question here\"\n```\n\n**使用 Mistral 版本：**\n```bash\npython minigpt4_video_inference.py --ckpt path_to_mistral_checkpoint --cfg-path test_configs\u002Fmistral_test_config.yaml --video_path path_to_video --question \"Your question here\"\n```\n\n#### 场景 B：长视频理解 (Goldfish)\n适用于电影、电视剧集等长视频内容分析。\n\n**使用 Llama2 版本：**\n```bash\npython goldfish_inference.py --ckpt path_to_llama2_checkpoint --cfg-path test_configs\u002Fllama2_test_config.yaml --video_path path_to_video --question \"Your question here\"\n```\n\n**使用 Mistral 版本：**\n```bash\npython goldfish_inference.py --ckpt path_to_mistral_checkpoint --cfg-path test_configs\u002Fmistral_test_config.yaml --video_path path_to_video --question \"Your question here\"\n```\n\n> **参数说明：**\n> *   `--ckpt`: 替换为您下载的 `.pth` 检查点文件的实际路径。\n> *   `--video_path`: 替换为您本地视频文件的实际路径。\n> *   `--question`: 替换为您想要询问关于视频的具体问题。\n> *   `--cfg-path`: 配置文件路径，通常无需修改，保持默认即可。\n\n### 3.3 启动交互式 Demo (可选)\n如果您希望启动一个本地的交互式界面进行测试，可以使用以下命令：\n\n**MiniGPT4-Video Demo:**\n```bash\n# Llama2\npython minigpt4_video_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs\u002Fllama2_test_config.yaml\n```\n\n**Goldfish Demo:**\n```bash\n# Llama2 (如需开启 OpenAI Embedding 优化，请确保已 export OPENAI_API_KEY)\npython goldfish_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs\u002Fllama2_test_config.yaml\n```","某视频内容审核团队需要每日处理数千条用户上传的短视频，快速识别其中是否包含违规动作或特定危险行为。\n\n### 没有 MiniGPT4-video 时\n- **理解碎片化**：传统模型只能识别单帧画面，无法连贯理解“先拿起瓶子再泼洒”这类跨帧的动态因果关系，导致大量误判。\n- **描述能力弱**：模型仅能输出简单的标签（如“人”、“车”），无法生成详细的自然语言描述，审核员仍需人工逐帧回看确认细节。\n- **交互成本高**：无法通过自然语言提问（例如“视频中有人摔倒吗？”），必须依赖预先训练好的固定分类器，灵活性极差。\n- **长上下文丢失**：面对稍长的视频片段，模型容易遗忘开头的关键信息，难以回答涉及视频整体逻辑的复杂问题。\n\n### 使用 MiniGPT4-video 后\n- **动态逻辑精准捕捉**：MiniGPT4-video 利用交错视觉 - 文本令牌技术，能完整理解视频中的时间序列动作，准确判断复杂的违规行为链条。\n- **生成式详细报告**：工具可直接生成流畅的视频内容摘要，详细描述人物动作与环境变化，大幅减少人工复核的时间成本。\n- **自然语言自由问答**：审核员可以直接输入“是否有人员未佩戴安全帽？”等具体问题，MiniGPT4-video 能即时定位并回答，无需重新训练模型。\n- **全片记忆无遗漏**：即使在较短的视频理解任务中，该工具也能保持对全程内容的敏锐感知，确保关键线索不被遗漏，准确率显著提升。\n\nMiniGPT4-video 通过将视频转化为可对话的多模态数据，让机器真正具备了“看懂”视频逻辑并像人类一样交流的能力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVision-CAIR_MiniGPT4-video_ddb77f93.jpg","Vision-CAIR","Vision CAIR Research Group, KAUST","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FVision-CAIR_5d0a2f15.png","Vision CAIR Group, KAUST, supported by Mohamed Elhoseiny",null,"https:\u002F\u002Fcemse.kaust.edu.sa\u002Fvision-cair","https:\u002F\u002Fgithub.com\u002FVision-CAIR",[84,88,92],{"name":85,"color":86,"percentage":87},"Python","#3572A5",97,{"name":89,"color":90,"percentage":91},"Shell","#89e051",2.9,{"name":93,"color":94,"percentage":95},"Dockerfile","#384d54",0.1,639,71,"2026-04-08T08:59:35","BSD-3-Clause",4,"Linux","必需 NVIDIA GPU。训练脚本使用 torchrun 并支持多卡 (--nproc-per-node)，具体显存需求未说明，但运行 Llama2 7B\u002FMistral 7B 多模态模型通常建议 16GB-24GB+ 显存。CUDA 版本未明确说明，需匹配 PyTorch 版本。","未说明 (建议 32GB+ 以处理视频数据)",{"notes":105,"python":106,"dependencies":107},"1. 必须通过 'conda env create -f environment.yml' 创建环境，具体依赖版本需查看该文件。2. 项目包含两个模型：MiniGPT4-Video（短视频）和 Goldfish（长视频）。3. Goldfish 长视频理解功能推荐配置 OPENAI_API_KEY 环境变量以使用 OpenAI Embedding 获得最佳性能，否则使用默认嵌入。4. 需手动下载预训练检查点（Llama2 或 Mistral 版本）并配置路径。5. 训练分为图像文本预训练和视频字幕预训练等阶段，需自行准备 LAION、Webvid 等数据集并配置路径。","未说明 (通过 environment.yml 创建 conda 环境，通常对应 Python 3.8 或 3.9)",[108,109,110,111,112,113,114,115],"torch","transformers","accelerate","opencv-python","decord","timm","sentencepiece","gradio",[35,54,15],[118,119,120,121],"video-question-answering","video-understanding","long-video-understanding","video-retrieval","2026-03-27T02:49:30.150509","2026-04-13T23:54:43.533586",[125,130,135,140,145],{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},32114,"为什么我复现的 Mistral-7b 模型评估结果低于论文中报告的结果？","为了获得与论文一致或更好的结果，请确保使用 README 文件中提到的最佳检查点（best checkpoint）：\n1. Llama2 检查点：https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_llama_checkpoint_best.pth\n2. Mistral 检查点：https:\u002F\u002Fhuggingface.co\u002FVision-CAIR\u002FMiniGPT4-Video\u002Fblob\u002Fmain\u002Fcheckpoints\u002Fvideo_mistral_checkpoint_best.pth\n\n维护者已修复了之前的 URL 错误。此外，评估结果可能因 ChatGPT 的随机性有约 1% 的波动，这属于正常范围。","https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT4-video\u002Fissues\u002F18",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},32115,"是否尝试过使用 TinyLlama 或 Phi-2 等小型语言模型？","不建议对视频任务使用 TinyLlama 或 Phi-2，因为它们的上下文窗口仅为 2048，这会限制可处理的视频帧数（仅能采样约 22 帧），导致丢失大量视频信息。\n\n项目测试过的配置如下：\n- Llama 2（上下文窗口 4096）：可接受 45 帧。\n- Mistral（上下文窗口 8192）：可接受 90 帧。\n\n如果您想更换 LLM，建议选择具有更大上下文窗口的模型以容纳更多帧，但需注意训练时的显存限制。","https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT4-video\u002Fissues\u002F4",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},32116,"运行 minigpt4_video_inference.py 时只下载了 YouTube 视频，没有进行其他操作怎么办？","这是一个已知问题，该文件的内容之前被错误地替换成了其他内容。维护者已经修复了此问题。请拉取最新的代码仓库更新，然后重新运行该脚本即可正常工作。","https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT4-video\u002Fissues\u002F1",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},32117,"如何为 eval_goldfish_movie_chat.py 生成所需的 .h5 文件？","代码中的 load_frames 函数期望每个视频文件对应一个 .h5 文件。关于这些 .h5 文件的具体生成方法，请参考 MovieChat 论文中的相关描述和处理流程。","https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT4-video\u002Fissues\u002F43",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},32118,"运行时遇到 'TypeError: forward() got an unexpected keyword argument cache_position' 错误如何解决？","该错误通常是由于 transformers 库版本不兼容导致的。`cache_position` 参数是在较新版本的 transformers 中引入的，而当前的 PEFT 或模型实现可能尚未完全适配，或者您的环境混合了不兼容的版本。\n\n建议尝试以下解决方案：\n1. 升级 transformers 库到最新版本：`pip install --upgrade transformers`\n2. 确保 peft 和 accelerate 库也是最新版本。\n3. 如果问题依旧，检查项目 requirements.txt 中指定的特定版本并严格匹配安装。","https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT4-video\u002Fissues\u002F8",[]]