[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-rese1f--MovieChat":3,"tool-rese1f--MovieChat":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":10,"env_os":98,"env_gpu":99,"env_ram":98,"env_deps":100,"category_tags":103,"github_topics":104,"view_count":111,"oss_zip_url":112,"oss_zip_packed_at":112,"status":16,"created_at":113,"updated_at":114,"faqs":115,"releases":146},634,"rese1f\u002FMovieChat","MovieChat","[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding","MovieChat 是一款面向长视频理解的开源 AI 模型，旨在突破传统视频分析技术在处理超长内容时的性能瓶颈。以往的方法往往因显存消耗过大，难以在普通硬件上分析数千帧以上的视频，而 MovieChat 通过创新的“从密集令牌到稀疏记忆”架构，成功实现了在 24GB 显存显卡上流畅处理超过一万帧的视频内容。\n\n这种高效的设计大幅降低了每帧处理的显存开销，使其在 ActivityNet-QA、EgoSchema 等多个权威基准测试中表现卓越。MovieChat 不仅支持视频问答和事件定位，还具备优秀的上下文理解能力。它特别适合计算机视觉领域的研究人员、算法工程师以及希望集成长视频智能分析功能的开发者。借助 MovieChat，团队可以更经济、高效地构建下一代视频理解应用，深入挖掘长视频中的丰富信息。","\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_f120e0b01624.png\" height=\"120px\" align=\"left\">\n\n# MovieChat\n\n[![](http:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcs.CV-arXiv%3A2307.16449-B31B1B.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449v4)\n[![](http:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcs.CV-arXiv%3A2404.17176-B31B1B.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.17176)\n\n> **MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**  \n> Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang✉️   \n> _CVPR 2024._\n\n\n\u003Cimg width=\"1155\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_ce50cd586ff6.png\">\n\nMovieChat can handle videos with >10K frames on a 24GB graphics card. MovieChat has a 10000× advantage over other methods in terms of the average increase in GPU memory cost per frame (21.3KB\u002Ff to ~200MB\u002Ff).\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_222ab8107fac.gif\" alt=\"MovieChat\" style=\"width: 80%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Ch5 align=\"center\"> If you like our project, please give us a star ⭐ on GitHub for the latest update.\u003C\u002Fh5>\n\n## 🔢 MovieChat-1K leaderboard\n\nFeel free to PR your new results!\n\n| Model with Link | Comment | Breakpoint Acc | Global Acc |\n|-----------------------------------------------|------------------------------|------------|----------------|\n| [Video-LLaMA](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.02858)            | End-to-end                  | 39.1 | 51.7 |\n| [VideoChat](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06355)              | End-to-end                  | 46.1 | 57.8 |\n| [TimeChat](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.11333)               | CoT, ICL, train on MovieChat| 46.1 | 73.8 |\n| [VideoChatGPT](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05424)           | End-to-end                  | 48.0 | 47.6 |\n| [MovieChat](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449v4) (baseline) | End-to-end                  | 48.3 | 62.3 |\n| [MovieChat+](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.17176) (baseline)  | End-to-end                  | 49.6 | 71.2 |\n| [Long-LLaVA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13093)             | Eng-to-end                  | 54.0 | 69.6 |\n| [Long-LLaVA + Video-RAG](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13093) | Eng-to-end                  | 54.5 | 72.9 |\n| [Streaming Long Video](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16009)   | Train on MovieChat          | 54.9 | 90.4 |\n| [DrVideo](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.12846)                | RAG                         | 56.7 | 93.1 |\n| [ReWind](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.15556)                 | End-to-end                  | 57.2 | 87.6 |\n| [HERMES](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.17443)                 | Train on MovieChat          | 57.3 | 78.6 |\n| [Flash-VStream](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.08085)          | Train on MovieChat          | 59.6 | 96.0 |\n| [MM-Screenplayer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.17309)        | RAG                         | 68.8 | 87.5 |\n| [VILA1.5-8B](https:\u002F\u002Fopenreview.net\u002Fpdf?id=oS79Tw3G0c)     | End-to-end                  |  -   | 40.0 |\n| [FocusChat](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.12833)              | End-to-end                  |  -   | 60.0|\n| [llavaonevision-MovieChat](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat) | End-to-end             | -    | 79.0 |\n| [Sullam Jeoung, _et al_](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.20252) | Agent                       | -    | 84.8 |\n| [SEAL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.01798)                   | Train on MovieChat          | -    | 86.8 |\n| [HEM-LLM](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.06299)                | Unknown training dataset    | -    | 90.6 |\n\n\n## 🔢 Evaluation of MovieChat on Existing Benchmarks\n\nSort in alphabetical order.\n\n| Benchmark | Results |\n|-----------|---------|\n| ActivityNet-QA | Acc. \u002F Score: 45.7 \u002F 3.4 |\n| Charades-STA | R@1(IOU =0.3): 8.8 • R@1(IOU =0.5): 2.9 •  R@1(IOU =0.7): 1.3 |\n| CineClipQA | Overall: 20.86\u002F2.11 • Description: 23.67\u002F2.41 • Intention: 30.19\u002F2.41 • Perception: 21.80\u002F1.97 • Temporality: 16.32\u002F1.97 • Spaciality: 16.40\u002F1.98 |\n| CVRR-ES | Average: 16.41 |\n| EgoSchema | Top 1 Acc: 53.5 |\n| EventBench | Acc: 20.33 |\n| InfiniBench | Global Appearance: 6.59 • Scene transition: 6.41 • Character actions: 4.51 • Temporal order: 36.99 • Local visual: 17.76 • Summarization: 0.14 • Deep context: 0.55 • Spoiler questions: 0.34 • Multiple events: 0.85 • Avg: 14.45\u002F0.47 |\n| InfiniBench-Vision | Acc: 14.2 • Score: 1.2 |\n| LvBench | ER: 21.3 • EU: 23.1 • KIR: 25.9 • TG: 22.3 • Rea: 24.0 • Sum: 17.2 • Overall: 22.5 |\n| LvM-QA | Acc. \u002F Score: 48.3 \u002F 2.57 |\n| MLVU | Holistic TR: 29.5 • AR: 25.0 • VS: 2.33 • Single Detail NQA: 24.2 • ER: 24.7 • PQA: 25.8 • SSC: 3.23 • Multi Detail AO: 28.6 • AC: 22.8 • M-Avg: 25.8 • G-Avg: 2.78 |\n| MovieChat-1K | Global Acc. \u002F Score: 62.3 \u002F 3.23 • Global Acc. \u002F Score: 48.3 \u002F 2.57 |\n| MovieCORE | Acc: 20.33 • Comp: 2.90 • Depth: 2.29 • Evid: 2.14 • Coh: 2.30 • Avg: 2.23 |\n| MSVD-QA | Acc. \u002F Score: 75.2 \u002F 3.8 |\n| MSRVTT-QA | Acc. \u002F Score: 52.7 \u002F 2.6 |\n| NExT-QA | Acc. \u002F Score: 49.9 \u002F 2.7 |\n| QVHighlight | mAP: 11.7 • HIT @1: 16.1 |\n| RVS-Ego | Acc. \u002F Score: 50.7 \u002F 3.4 |\n| RVS-Movie | Acc. \u002F Score: 36.0 \u002F 2.3 |\n| Seed-Bench | Procedure Understanding: 29.82 • Action Recognition: 40.11 |\n| SFD | Multiple-Choice V: 8.4 • L: 16.4 • VL: 8.0 • Open-Ended V: 14.0 • L: 15.7 • VL: 11.8 |\n| SVBench | Dialogue SA: 20.46 • Dialogue CC: 20.05 • Dialogue LC: 27.76 • Dialogue TU: 21.81 • Dialogue IC: 22.21 • Dialogue OS: 21.89 • Streaming SA: 17.99 • Streaming CC: 16.42 • Streaming LC: 20.37 • Streaming TU: 15.77 • Streaming IC: 19.08 • Streaming OS: 17.43 |\n| TV-Caption | BertScore: 38.11 • CIDER: 8.43 • ROUGE-L: 12.09 • SPICE: 9.21 |\n| VCG Bench | CI: 2.76 • DO: 2.93 • CU: 3.01 • TU: 2.24 • CO: 2.42 • Avg: 2.67 |\n| VDC | Camera: 37.25\u002F1.98 • Short: 32.55\u002F1.59 • Background: 28.99\u002F1.54 • Main: 31.97\u002F1.64 • Object: 28.82\u002F1.46 • Avg: 31.92\u002F1.64 |\n| VideoMME | w\u002Fo subs: 38.2 • w\u002Fo subs (Long): 33.4 |\n| Video-ChatGPT | Avg: 2.67 • CI: 2.76 • DO: 2.93 • CU: 3.01 • TU: 2.24 • CO: 2.42 |\n| VS-Ego | Acc. \u002F Score: 52.2 \u002F 3.4 |\n| VS-Movie | Acc. \u002F Score: 39.1 \u002F 2.3 |\n| YouCook2 | C: 38.5 • M: 18.8 |\n\n\n## :fire: News\n* **[2024.10.26]** :keyboard: We upload MovieChat, MovieChat_OneVision, MovieChat-1K to [lmms-eval](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval).\n* **[2024.10.26]** :keyboard: We release a new version of MovieChat, which use LLaVA-OneVision as the base model instead of the original VideoLLaMA. The new version is available on [MovieChat_Onevision](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Ftree\u002Fmain\u002FMovieChat_Onevision).\n* **[2024.6.13]** :film_projector: We release the ground truth of MovieChat's test set in [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEnxin\u002FMovieChat-1K-test). \n* **[2024.5.10]** :film_projector: We release the raw videos of MovieChat's training set in [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEnxin\u002FMovieChat-1K_train). \n* **[2024.4.29]** :page_with_curl: We update the MovieChat+ [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.17176) with implementation details, technical evaluations, and dataset information.\n* **[2024.4.25]** :keyboard:We update a new version of MovieChat+. We realse the [MovieChat+ code](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Fblob\u002Fmain\u002FMovieChat\u002Fmodels\u002Fmoviechat%2B.py) and the corresponding [evaluation code](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Fblob\u002Fmain\u002Feval_code\u002Fresult_prepare\u002Frun_inference_qa_moviechat%2B.py). Our paper is Coming soon!\n* **[2024.4.19]** :keyboard:We update the latest source code of MovieChat to [PyPI](https:\u002F\u002Fpypi.org\u002F). Now you can use MovieChat by `pip install Moviechat` directly!\n* **[2024.3.25]** :bar_chart: We host challenge track 1 of [the 4th International Workshop on Long-form Video Understanding: Towards Multimodal AI Assistant and Copilot](https:\u002F\u002Fcvpr.thecvf.com\u002FConferences\u002F2024\u002Fworkshop-list) at CVPR 2024. You can participate in the challenge and submit your results via [Codalab](https:\u002F\u002Fcodalab.lisn.upsaclay.fr\u002Fcompetitions\u002F18284?secret_key=bd5e312c-4775-43cf-933b-70726d00bcbe). We will display the results on the [leaderboard](https:\u002F\u002Fespere-1119-song.github.io\u002FLOVEU-CVPR-24-Track-1-Leaderboard\u002F). For each participant, we hope you can submit your results in JSON format and report both the average running time and VRAM usage. We will use these metrics to select the most efficient method. For detailed information about the challenge, please refer to this [link](https:\u002F\u002Fsites.google.com\u002Fview\u002Floveucvpr24\u002Ftrack1).\n* **[2024.3.11]** :film_projector: We release the test set of the MovieChat-1K in [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEnxin\u002FMovieChat-1K-test). Each video contains 3 global questions and 10 breakpoint questions.\n* **[2024.2.27]** :tada: Our paper was accepted by CVPR 2024!\n* **[2024.2.14]** :film_projector: We release the training set of the MovieChat-1K in [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEnxin\u002FMovieChat-1K_train). Due to copyright restrictions, we share the clip features extracted by [eva_vit_g](https:\u002F\u002Fstorage.googleapis.com\u002Fsfr-vision-language-research\u002FLAVIS\u002Fmodels\u002FBLIP2\u002Feva_vit_g.pth), containing 8192 frames of each video.\n* **[2023.11.27]** :page_with_curl: We update the [paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.16449v2.pdf) with implementation details, technical evaluations, and dataset information.\n* **[2023.11.23]** :keyboard:We update the latest source code of MovieChat.\n* **[2023.8.1]** :page_with_curl: We release the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449).\n* **[2023.7.31]** :keyboard:We release eval [code and instraction](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Ftree\u002Fmain\u002Feval_code) for short video QA on **MSVD-QA**, **MSRVTT-QA** and **ActivityNet-QA**.\n* **[2023.7.29]** :joystick:We release [Gradio demo](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Ftree\u002Fmain\u002FGradio_demo) of MovieChat.\n* **[2023.7.22]** :keyboard:We release source code of MovieChat.\n  \n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fmoviechat-from-dense-token-to-sparse-memory\u002Fzeroshot-video-question-answer-on-activitynet)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-activitynet?p=moviechat-from-dense-token-to-sparse-memory)\\\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fmoviechat-from-dense-token-to-sparse-memory\u002Fzeroshot-video-question-answer-on-msrvtt-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-msrvtt-qa?p=moviechat-from-dense-token-to-sparse-memory)\\\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fmoviechat-from-dense-token-to-sparse-memory\u002Fzeroshot-video-question-answer-on-msvd-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-msvd-qa?p=moviechat-from-dense-token-to-sparse-memory)\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fmoviechat-from-dense-token-to-sparse-memory\u002Fzero-shot-long-video-global-mode-question)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzero-shot-long-video-global-mode-question?p=moviechat-from-dense-token-to-sparse-memory)\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fmoviechat-from-dense-token-to-sparse-memory\u002Fzero-shot-long-video-breakpoint-mode-question)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzero-shot-long-video-breakpoint-mode-question?p=moviechat-from-dense-token-to-sparse-memory)\n\n## 📊Performance Comparison on MovieChat-1K\n| **Method**         | **Text Decoder**   | **# Frames** | **Global Mode Acc.** | **Global Mode Sco.** |\n|--------------------|--------------------|--------------|----------------------|----------------------|\n| GIT                | non-LLM based      | 6            | 28.8                 | 1.83                 |\n| mPLUG-2            | non-LLM based      | 8            | 31.7                 | 2.13                 |\n| **Video Chat**     | LLM based          | 32           | 57.8                 | 3.00                 |\n| **Video LLaMA**    | LLM based          | 32           | 51.7                 | 2.67                 |\n| **Video-ChatGPT**  | LLM based          | 100          | 47.6                 | 2.55                 |\n| **MovieChat**      | LLM based          | 2048         | 62.3                 | 3.23                 |\n| **MovieChat+**     | LLM based          | 2048         | 71.2                 | 3.51               |\n| **MovieChat-Onevision**  | LLM based    | 2048         | **79.0**             | **4.20**             |\n\n## ✨How to run MovieChat quickly?\n\nWe have packaged MovieChat and uploaded it to PyPI. To run MovieChat quickly, you need to install it firstly. \n```\npip install MovieChat\n```\nWe advise you to install version `0.6.3` for now. Since `MovieChat` will download checkpoints from Huggingface automatically, if your service doesn't support `git clone from \u003CHuggingFace  url>`, we recommend you to download the checkpoint to your service, and change the respective path in the package, including [q_former_model](https:\u002F\u002Fstorage.googleapis.com\u002Fsfr-vision-language-research\u002FLAVIS\u002Fmodels\u002FBLIP2\u002Fblip2_pretrained_flant5xxl.pth), [ckpt_path](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideo-LLaMA-Series\u002Fresolve\u002Fmain\u002Ffinetune-vicuna7b-v2.pth?download=true), and [llama_model](https:\u002F\u002Fhuggingface.co\u002FEnxin\u002FMovieChat-vicuna). \n\nBefore you run the following inference code, we hope you can verify the installation of `ffprobe` via `ffprobe -version`. This command should return the version of ffprobe if it is correctly installed. Otherwise, you should install it via `sudo apt-get install ffmpeg` (Ubuntu).\n\n```\nfrom PIL import Image\nimport cv2\n\nfrom MovieChat.processors.video_processor import AlproVideoEvalProcessor\nfrom MovieChat.models.chat_model import Chat\nfrom MovieChat.models.moviechat import MovieChat\n\ndevice = 'cuda:0'\nprint('Initializing Chat')\nmoviechat_model = MovieChat.from_config(device=device).to(device)\nvis_processor_cfg = {'name': 'alpro_video_eval', 'n_frms': 8, 'image_size': 224}\nframe_processor = AlproVideoEvalProcessor.from_config(vis_processor_cfg)\nchat = Chat(moviechat_model, frame_processor, device=device)\nprint('Initialization Finished')\n\nvideo_path = \"Your video path, end with mp4\"\nfragment_video_path = \"The path to store tmp video clips\"\nmiddle_video = False # True->Breakpoint mode, False->Global mode\nquestion = \"Your Question\"\ncur_min = 0 # Change it when Breakpoint mode\ncur_sec = 0 # Change it when Breakpoint mode\n\ncap = cv2.VideoCapture(video_path)\ncur_fps = cap.get(cv2.CAP_PROP_FPS)\ncap.set(cv2.CAP_PROP_POS_FRAMES, cur_fps)\nret, frame = cap.read()\nrgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)\npil_image = Image.fromarray(rgb_frame)\nimage = chat.image_vis_processor(pil_image).unsqueeze(0).unsqueeze(2).half().to(device)\ncur_image = chat.model.encode_image(image)\n\nimg_list = []\nmsg = chat.upload_video_without_audio(\n    video_path=video_path, \n    fragment_video_path=fragment_video_path,\n    cur_min=cur_min, \n    cur_sec=cur_sec, \n    cur_image=cur_image, \n    img_list=img_list, \n    middle_video=middle_video,\n    question=question\n)\nanswer = chat.answer(\n    img_list=img_list,\n    input_text=question,\n    msg = msg,\n    num_beams=1,\n    temperature=1.0,\n    max_new_tokens=300,\n    max_length=2000)[0]\n\nprint(answer)\n```\n\nNote that if you receive a RuntimeError like `\"Error reading \u003Cfilename.mp4>\"`, one solution is to initialize `\u003Cfilename.mp4>` with any other video file.\n\n## 💡 Overview\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_55631525c7af.png)\n\n## 📣 Demo Video\n\n[![Alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_dfe02cf159e4.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fembed\u002FDx5BQmgK4n8?si=FN9pLyQBN--vJBZA)\n\n## ⚡ Comparison Case\n\n\u003Cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"> Question and answer about a clip from YouTube, which is a tutorial on how to cook steak. The entire instructional process begins with marinating the steak, followed by pan-searing it, preparing side dishes, and ultimately plating the meal. Green ( Red ) highlights the correct (wrong) answer and yellow indicates that the model is hallucinating.\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_800e86bc801e.png\"  style=\"width: 80%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n## 😍 Examples\n\n\u003Cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"> Question and answer about clips from Zootopia, a cartoon, which tells the story of a determined police officer rabbit named Judy\nwho pairs up with a cunning fox to uncover a conspiracy about missing animals and develop an unexpected friendship.\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_9c77b07c7a07.png\"  style=\"width: 80%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\n\u003Cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"> Question and answer about clips from Goblin, which tells the story of Kim Shin, an immortal ”goblin” who needs to find a human\nbride to end his endless life but instead meets Ji Eun-tak, a girl fated to die who claims to be the ”goblin’s bride,” leading to a romantic tale unfolding bet.\n\u003C\u002Fdiv>\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_4879e63ea188.png\" style=\"width: 80%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\">  Question and answer about clips from Game of Thrones, which tells the epic fantasy tale of power struggles and political intrigue among the Seven Kingdoms, entwined with intricate family relationships, all set against the backdrop of an ancient, mystical threat.\n\u003C\u002Fdiv>\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_68193a4b18ad.png\" style=\"width: 80%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"> Question and answer about clips from YouTube, which contains a compilation of some inspirational movies scenes. This video clip comprises several segments from The Death Crawl, Coach Carter, Rocky Balboa, and We Are Marshall,  which vary in duration.\n\u003C\u002Fdiv>\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_b5fba0307272.png\" style=\"width: 80%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n## 🚀 Benchmark: MovieChat-1K \n\nTo better evaluate the performance of MovieChat, we collect a new benchmark for long video understanding tasks, MovieChat-1K, which contains 1K high quality video clips sourced from various movies and TV series with 14K manual annotations.\n\nTo the best of our knowledge, a long video understanding dataset has not yet been established. Our work represents the initial step in creating and making it publicly available.We create MovieChat1K, containing 1k long\nvideos and corresponding 1k dense captions, and 13k visual question-answer pairs.For each video, we manually set and provide 1 dense caption for the whole video, 3 question-answering pairs for global mode and 10 question-answering pairs with timestamps for breakpoint mode. \n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_b69b077530d0.png\" style=\"width: 100%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\nWe collect videos from 15 popular categories with varying distribution, including documentary film, detective film, animation film, and so on. Among these, each video comprises multiple alternating scenes, contributing to a diverse and dynamic visual narrative within the context of the collection. Over 90% of the videos exhibit a duration ranging from 10K to 12K frames, while 14.6% of videos extending beyond 12K frames. Only 8.6% of videos have duration less than 10k frames.\n\n\n### Question-answering Pairs\n\n#### Word Distribution\nNote that MovieChat-1K is specifically designed for long video comprehension tasks, the majority of questions are open-ended, with only a quarter classified as multiple-choice questions, marked by initiators such as ‘Do,’ ‘Does,’ ‘Is,’ or ‘Are.’ We also compute the word distributions of our provided\nquestion-answer pairs, which includes common objects (people, clothes, etc.), time (day, night, etc.), scenes (indoor, outdoor, etc.), and so on.\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_2d0d0e7e46c0.png\" style=\"width: 40%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\n#### Sentence length distribution\nMovieChat1K exhibits diverse lengths of question-answer pairs in the segmented clip level. Despite the distribution of questionanswer pairs varies between the global mode and breakpoint mode, the majority of questions tends to concentrate between 5-15 words in length, while the length of answers generally have fewer than 10 words.\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_654fe19a010b.png\" style=\"width: 70%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\n### Dense Captions\n\nTo facilitate a more detailed understanding of long videos, we provide\na dense caption for each video. MovieChat-1K exhibits diverse caption lengths in the segmented clip level. Approximately two-thirds of the clips\nhave captions with 100-149 words, while one-fifth of the\nclip captions have fewer than 100 words. About 11% of\nclips have long captions with more than 150 words.\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_c17255d82f66.png\" style=\"width: 40%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\nTo analyze the word distribution of our generated captions, we compute their distributions. The resulting word\ndistribution of the captions is presented in Fig. B6, which\nincludes common objects (man, woman, people, girl, etc.),\nattributes (detective, various, small, white, etc.), locations\n(inside, behind, south, next, etc.), scenes (room, house,\nbuilding, office, etc.), actions\u002Fevents (talk, enter, leave, take,\netc.), and more.\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_77ab480bd8c4.png\" style=\"width: 45%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\nIn terms of actionness, MovieChat-1K captions contains nearly the same number of verbs as with the WebVid10M dataset. To evaluate this, we use the NLTK toolkit to\nanalyze the number of verbs in captions, focusing on extracting and tagging all unique verbs. We find a total of\n109,485 verbs in the WebVid10M caption dataset, while the\nMovieChat-1K captions contain 102,988 unique instances\nof verbs. While these counts may not be entirely accurate\ndue to our simple counting method, we believe they provide\na rough indication of the actionness of the two datasets.\n\n\u003C!-- ## Comparison between MovieChat-1K and other benchmarks\n\nMovieChat-1K provides a large-scale benchmark\nfor long video understanding, which contains 1K movies,\n1K dense captions and 13k question-answer pairs. The\ncomparison between different datasets are shown in Tab. 8.\nIt is evident that MovieChat-1K provides the longest\naverage duration for movie clips. MovieQA exclusively offers question-answer pairs related to movies,\nwhile MovieGraphs supplies captions associated with\nmovies. Unlike other datasets, MovieNet encompasses\nthree main types of texts: subtitle, synopsis, and script,\nexcluding question-answer pairs. Additionally, the synopsis category is designed for the entire movie rather than\nvideo clips. Consequently, MovieChat-1K is more suitable\nfor studying long video comprehension compared to other\ndatasets.\n\n\u003Cdiv align=\"center\">\n\u003Ctable border=\"1\" width=\"100%\">\n    \u003Ctr align=\"center\">\n        \u003Cth>Dataset\u003C\u002Fth>\u003Cth>Avg. Duration (min)\u003C\u002Fth>\u003Cth>Number of Captions\u003C\u002Fth>\u003Cth>Avg. Caption Length\u003C\u002Fth>\u003Cth>Number of Question-Answer Pairs\u003C\u002Fth>\u003Cth>Avg. Question Length\u003C\u002Fth>\u003Cth>Avg. Answer Length\u003C\u002Fth>\n    \u003C\u002Ftr>\n    \u003Ctr align=\"center\">\n        \u003Ctd>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.02902\">MovieQA\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>3.5\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>14.9K\u003C\u002Ftd>\u003Ctd>9.3\u003C\u002Ftd>\u003Ctd>5.1\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003C\u002Ftr>\n    \u003Ctr align=\"center\">\n        \u003Ctd>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.06761\">MovieGraphs\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>0.73\u003C\u002Ftd>\u003Ctd>15K\u003C\u002Ftd>\u003Ctd>35\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr align=\"center\">\n        \u003Ctd>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.10937\">MovieNet\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>2.1\u003C\u002Ftd>\u003Ctd>2.5K\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr align=\"center\">\n        \u003Ctd>MovieChat-1K\u003C\u002Ftd>\u003Ctd>9.4\u003C\u002Ftd>\u003Ctd>1K\u003C\u002Ftd>\u003Ctd>121\u003C\u002Ftd>\u003Ctd>13K\u003C\u002Ftd>\u003Ctd>7.8\u003C\u002Ftd>\u003Ctd>2.3\u003C\u002Ftd>\n    \u003C\u002Ftr>\n\u003C\u002Ftable>\n\u003C\u002Fdiv> -->\n\n🔐 &#x00A9; **Due to the copyright concers and the size limitations of the movies, we  plan to release the features of the dataset. Please wait for a few weeks.**\n\n## 🛠️ Install \n\n### Environment Preparation\n\nFirst, create a conda environment:\n```\nconda env create -f environment.yml\nconda activate moviechat\n```\n\n### Prerequisites\n\nBefore using the repository, make sure you have obtained the following checkpoints:\n\n#### Pre-trained Language Decoder\n\n- Get the original LLaMA weights in the Hugging Face format by following the instructions [here](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fmodel_doc\u002Fllama).\n- Download Vicuna delta weights :point_right: [[7B](https:\u002F\u002Fhuggingface.co\u002Flmsys\u002Fvicuna-7b-delta-v0)] (Note: we use **v0 weights** instead of v1.1 weights). \n- Use the following command to add delta weights to the original LLaMA weights to obtain the Vicuna weights:\n\n```\npython apply_delta.py \\\n    --base ckpt\u002FLLaMA\u002F7B_hf \\\n    --target ckpt\u002FVicuna\u002F7B \\\n    --delta ckpt\u002FVicuna\u002Fvicuna-7b-delta-v0 \\\n```\n\n#### Pre-trained Visual Encoder for MovieChat\n- Download the MiniGPT-4 model (trained linear layer) from this [link](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1a4zLvaiDBr-36pasffmgpvH5P7CKmpze\u002Fview).\n\n#### Download Pretrained Weights\n\n- Download pretrained weights to run MovieChat with Vicuna-7B as language decoder locally from this [link](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideo-LLaMA-Series\u002Fresolve\u002Fmain\u002Ffinetune-vicuna7b-v2.pth).\n\n## 🤖 How to Run Demo Locally\n\nFirstly, set the `llama_model`, `llama_proj_model` and `ckpt` in [eval_configs\u002FMovieChat.yaml](.\u002Feval_configs\u002FMovieChat.yaml).\nThen run the script:\n```\npython inference.py \\\n    --cfg-path eval_configs\u002FMovieChat.yaml \\\n    --gpu-id 0 \\\n    --num-beams 1 \\\n    --temperature 1.0 \\\n    --text-query \"What is he doing?\" \\\n    --video-path src\u002Fexamples\u002FCooking_cake.mp4 \\\n    --fragment-video-path src\u002Fvideo_fragment\u002Foutput.mp4 \\\n    --cur-min 1 \\\n    --cur-sec 1 \\\n    --middle-video 1 \\\n```\nNote that, if you want to use the global mode (understanding and question-answering for the **whole** video), remember to change middle-video into 0.\n\n\u003C!-- ## 👍 Main Results\n### Short video question-answering\nWe use several widely\nused open-ended datasets: MSVD-QA, MSRVTT-QA, and ActivityNet-QA for short video question-answering tasks. The evaluation process is under the assistance of LLM with the default hyper-parameter settings. The accuracy and relative scores on a scale of 0 to 5 are reported. Compared to previous methods, MovieChat achieves comparable performance even it is not\nspecifically designed for short video question-answering tasks,\n\n\u003Cdiv align=\"center\">\n\u003Ctable border=\"1\" width=\"100%\">\n    \u003Ctr align=\"center\">\n        \u003Cth>Methods\u003C\u002Fth>\u003Cth>LLM\u003C\u002Fth>\u003Cth>Conversation\u003C\u002Fth>\u003Cth>Detail Description\u003C\u002Fth>\u003Cth>Complex Reasoning\u003C\u002Fth>\u003Cth>All\u003C\u002Fth>\n    \u003C\u002Ftr>\n    \u003Ctr align=\"center\">\n        \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FChat-UniVi\u002FChat-UniVi\">Chat-UniVi-7B\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Flmsys\u002Fvicuna-7b-v1.5\">Vicuna-7B\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>\u003Cb>84.1\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>74.2\u003C\u002Ftd>\u003Ctd>93.7\u003C\u002Ftd>\u003Ctd>84.2\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003C\u002Ftr>\n    \u003Ctr align=\"center\">\n        \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FChat-UniVi\u002FChat-UniVi-13B\">Chat-UniVi-13B\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Flmsys\u002Fvicuna-13b-v1.5\">Vicuna-13B\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>\u003Cb>84.1\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>79.4\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>94.7\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>86.1\u003C\u002Fb>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n\u003C\u002Ftable>\n\u003C\u002Fdiv> -->\n\n## 🤝 Acknowledgement\nWe are grateful for the following awesome projects our MovieChat arising from:\n* [Video-LLaMA](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideo-LLaMA): An Instruction-tuned Audio-Visual Language Model for Video Understanding\n* [Token Merging](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FToMe): Your ViT but Faster\n* [XMem](https:\u002F\u002Fgithub.com\u002Fhkchengrex\u002FXMem): Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model\n* [MiniGPT-4](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4): Enhancing Vision-language Understanding with Advanced Large Language Models\n* [FastChat](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat): An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots\n* [BLIP-2](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fblip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models \n* [EVA-CLIP](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEVA\u002Ftree\u002Fmaster\u002FEVA-CLIP): Improved Training Techniques for CLIP at Scale\n* [LLaMA](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fllama): Open and Efficient Foundation Language Models\n* [VideoChat](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything): Chat-Centric Video Understanding\n* [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA): Large Language and Vision Assistant\n\n\n## 🔒 Term of Use\nOur MovieChat is just a research preview intended for non-commercial use only. You must **NOT** use our MovieChat for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines. \n\n## ✏️ Citation\n\nIf you find MovieChat useful for your your research and applications, please cite using this BibTeX:\n\n```bibtex\n@article{song2023moviechat,\n  title={MovieChat: From Dense Token to Sparse Memory for Long Video Understanding},\n  author={Song, Enxin and Chai, Wenhao and Wang, Guanhong and Zhang, Yucheng and Zhou, Haoyang and Wu, Feiyang and Guo, Xun and Ye, Tian and Lu, Yan and Hwang, Jenq-Neng and others},\n  journal={arXiv preprint arXiv:2307.16449},\n  year={2023}\n}\n\n@article{song2024moviechat+,\n  title={MovieChat+: Question-aware Sparse Memory for Long Video Question Answering},\n  author={Song, Enxin and Chai, Wenhao and Ye, Tian and Hwang, Jenq-Neng and Li, Xi and Wang, Gaoang},\n  journal={arXiv preprint arXiv:2404.17176},\n  year={2024}\n}\n```\n","\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_f120e0b01624.png\" height=\"120px\" align=\"left\">\n\n# MovieChat\n\n[![](http:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcs.CV-arXiv%3A2307.16449-B31B1B.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449v4)\n[![](http:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcs.CV-arXiv%3A2404.17176-B31B1B.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.17176)\n\n> **MovieChat：从密集 Token 到稀疏记忆用于长视频理解**  \n> Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang✉️   \n> _CVPR 2024._\n\n\n\u003Cimg width=\"1155\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_ce50cd586ff6.png\">\n\nMovieChat 可以在 24GB 显存的显卡上处理超过 10K 帧的视频。在每帧 GPU 显存成本的平均增加方面（从 21.3KB\u002F帧 到 ~200MB\u002F帧），MovieChat 比其他方法具有 10000 倍的优势。\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_222ab8107fac.gif\" alt=\"MovieChat\" style=\"width: 80%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Ch5 align=\"center\">如果您喜欢我们的项目，请在 GitHub 上给我们点个星 ⭐ 以获取最新更新。\u003C\u002Fh5>\n\n## 🔢 MovieChat-1K 排行榜\n\n欢迎提交您的新结果！\n\n| 模型与链接 | 备注 | 断点准确率 | 全局准确率 |\n|-----------------------------------------------|------------------------------|------------|----------------|\n| [Video-LLaMA](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.02858)            | 端到端                  | 39.1 | 51.7 |\n| [VideoChat](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06355)              | 端到端                  | 46.1 | 57.8 |\n| [TimeChat](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.11333)               | 思维链 (CoT), 上下文学习 (ICL), 在 MovieChat 上训练 | 46.1 | 73.8 |\n| [VideoChatGPT](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05424)           | 端到端                  | 48.0 | 47.6 |\n| [MovieChat](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449v4) (baseline) | 端到端                  | 48.3 | 62.3 |\n| [MovieChat+](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.17176) (baseline)  | 端到端                  | 49.6 | 71.2 |\n| [Long-LLaVA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13093)             | 端到端                  | 54.0 | 69.6 |\n| [Long-LLaVA + Video-RAG](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13093) | 端到端                  | 54.5 | 72.9 |\n| [Streaming Long Video](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16009)   | 在 MovieChat 上训练          | 54.9 | 90.4 |\n| [DrVideo](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.12846)                | 检索增强生成 (RAG)                         | 56.7 | 93.1 |\n| [ReWind](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.15556)                 | 端到端                  | 57.2 | 87.6 |\n| [HERMES](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.17443)                 | 在 MovieChat 上训练          | 57.3 | 78.6 |\n| [Flash-VStream](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.08085)          | 在 MovieChat 上训练          | 59.6 | 96.0 |\n| [MM-Screenplayer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.17309)        | 检索增强生成 (RAG)                         | 68.8 | 87.5 |\n| [VILA1.5-8B](https:\u002F\u002Fopenreview.net\u002Fpdf?id=oS79Tw3G0c)     | 端到端                  |  -   | 40.0 |\n| [FocusChat](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.12833)              | 端到端                  |  -   | 60.0|\n| [llavaonevision-MovieChat](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat) | 端到端             | -    | 79.0 |\n| [Sullam Jeoung, _et al_](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.20252) | 智能体 (Agent)                       | -    | 84.8 |\n| [SEAL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.01798)                   | 在 MovieChat 上训练          | -    | 86.8 |\n| [HEM-LLM](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.06299)                | 未知训练数据集    | -    | 90.6 |\n\n\n## 🔢 MovieChat 在现有基准测试上的评估\n\n按字母顺序排列。\n\n| 基准测试 | 结果 |\n|-----------|---------|\n| ActivityNet-QA | 准确率 \u002F 评分：45.7 \u002F 3.4 |\n| Charades-STA | R@1(IOU =0.3): 8.8 • R@1(IOU =0.5): 2.9 •  R@1(IOU =0.7): 1.3 |\n| CineClipQA | 总体：20.86\u002F2.11 • 描述：23.67\u002F2.41 • 意图：30.19\u002F2.41 • 感知：21.80\u002F1.97 • 时间性：16.32\u002F1.97 • 空间性：16.40\u002F1.98 |\n| CVRR-ES | 平均：16.41 |\n| EgoSchema | Top 1 准确率：53.5 |\n| EventBench | 准确率：20.33 |\n| InfiniBench | 全局外观：6.59 • 场景转换：6.41 • 角色动作：4.51 • 时间顺序：36.99 • 局部视觉：17.76 • 摘要：0.14 • 深层上下文：0.55 • 剧透问题：0.34 • 多个事件：0.85 • 平均：14.45\u002F0.47 |\n| InfiniBench-Vision | 准确率：14.2 • 评分：1.2 |\n| LvBench | ER: 21.3 • EU: 23.1 • KIR: 25.9 • TG: 22.3 • Rea: 24.0 • Sum: 17.2 • 总体：22.5 |\n| LvM-QA | 准确率 \u002F 评分：48.3 \u002F 2.57 |\n| MLVU | 整体 TR: 29.5 • AR: 25.0 • VS: 2.33 • 单细节 NQA: 24.2 • ER: 24.7 • PQA: 25.8 • SSC: 3.23 • 多细节 AO: 28.6 • AC: 22.8 • M-Avg: 25.8 • G-Avg: 2.78 |\n| MovieChat-1K | 全局准确率 \u002F 评分：62.3 \u002F 3.23 • 全局准确率 \u002F 评分：48.3 \u002F 2.57 |\n| MovieCORE | 准确率：20.33 • 比较：2.90 • 深度：2.29 • 证据：2.14 • 连贯性：2.30 • 平均：2.23 |\n| MSVD-QA | 准确率 \u002F 评分：75.2 \u002F 3.8 |\n| MSRVTT-QA | 准确率 \u002F 评分：52.7 \u002F 2.6 |\n| NExT-QA | 准确率 \u002F 评分：49.9 \u002F 2.7 |\n| QVHighlight | mAP: 11.7 • HIT @1: 16.1 |\n| RVS-Ego | 准确率 \u002F 评分：50.7 \u002F 3.4 |\n| RVS-Movie | 准确率 \u002F 评分：36.0 \u002F 2.3 |\n| Seed-Bench | 流程理解：29.82 • 动作识别：40.11 |\n| SFD | 多项选择 V: 8.4 • L: 16.4 • VL: 8.0 • 开放式 V: 14.0 • L: 15.7 • VL: 11.8 |\n| SVBench | 对话 SA: 20.46 • 对话 CC: 20.05 • 对话 LC: 27.76 • 对话 TU: 21.81 • 对话 IC: 22.21 • 对话 OS: 21.89 • 流式 SA: 17.99 • 流式 CC: 16.42 • 流式 LC: 20.37 • 流式 TU: 15.77 • 流式 IC: 19.08 • 流式 OS: 17.43 |\n| TV-Caption | BertScore: 38.11 • CIDER: 8.43 • ROUGE-L: 12.09 • SPICE: 9.21 |\n| VCG Bench | CI: 2.76 • DO: 2.93 • CU: 3.01 • TU: 2.24 • CO: 2.42 • 平均：2.67 |\n| VDC | 相机：37.25\u002F1.98 • 短：32.55\u002F1.59 • 背景：28.99\u002F1.54 • 主体：31.97\u002F1.64 • 物体：28.82\u002F1.46 • 平均：31.92\u002F1.64 |\n| VideoMME | 无字幕：38.2 • 无字幕 (长视频): 33.4 |\n| Video-ChatGPT | 平均：2.67 • CI: 2.76 • DO: 2.93 • CU: 3.01 • TU: 2.24 • CO: 2.42 |\n| VS-Ego | 准确率 \u002F 评分：52.2 \u002F 3.4 |\n| VS-Movie | 准确率 \u002F 评分：39.1 \u002F 2.3 |\n| YouCook2 | C: 38.5 • M: 18.8 |\n\n## :fire: 最新动态\n* **[2024.10.26] :keyboard:** 我们将 MovieChat, MovieChat_OneVision, MovieChat-1K 上传至 [lmms-eval](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval)。\n* **[2024.10.26] :keyboard:** 我们发布了 MovieChat 的新版本，该版本使用 LLaVA-OneVision 作为基础模型，替代了原有的 VideoLLaMA。新版本可在 [MovieChat_Onevision](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Ftree\u002Fmain\u002FMovieChat_Onevision) 获取。\n* **[2024.6.13] :film_projector:** 我们在 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEnxin\u002FMovieChat-1K-test) 上发布了 MovieChat 测试集的真实标签 (ground truth)。 \n* **[2024.5.10] :film_projector:** 我们在 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEnxin\u002FMovieChat-1K_train) 上发布了 MovieChat 训练集的原始视频。 \n* **[2024.4.29] :page_with_curl:** 我们更新了 MovieChat+ [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.17176)，包含实现细节、技术评估和数据集信息。\n* **[2024.4.25] :keyboard:** 我们更新了 MovieChat+ 的新版本。我们发布了 [MovieChat+ 代码](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Fblob\u002Fmain\u002FMovieChat\u002Fmodels\u002Fmoviechat%2B.py) 和相应的 [评估代码](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Fblob\u002Fmain\u002Feval_code\u002Fresult_prepare\u002Frun_inference_qa_moviechat%2B.py)。我们的论文即将发布！\n* **[2024.4.19] :keyboard:** 我们将 MovieChat 的最新源代码更新至 [PyPI (Python 包索引)](https:\u002F\u002Fpypi.org\u002F)。现在您可以直接使用 `pip install Moviechat` 安装并使用 MovieChat！\n* **[2024.3.25] :bar_chart:** 我们在 CVPR 2024 会议上主办了 [第四届长视频理解国际研讨会：迈向多模态 AI 助手和副驾驶 (Copilot)](https:\u002F\u002Fcvpr.thecvf.com\u002FConferences\u002F2024\u002Fworkshop-list) 的挑战赛道 1。您可以参与挑战并通过 [Codalab](https:\u002F\u002Fcodalab.lisn.upsaclay.fr\u002Fcompetitions\u002F18284?secret_key=bd5e312c-4775-43cf-933b-70726d00bcbe) 提交您的结果。我们将在 [排行榜](https:\u002F\u002Fespere-1119-song.github.io\u002FLOVEU-CVPR-24-Track-1-Leaderboard\u002F) 上展示结果。对于每位参与者，我们希望您能以 JSON 格式提交结果，并报告平均运行时间和显存 (VRAM) 使用情况。我们将使用这些指标来筛选最高效的方法。关于挑战的详细信息，请参阅此 [链接](https:\u002F\u002Fsites.google.com\u002Fview\u002Floveucvpr24\u002Ftrack1)。\n* **[2024.3.11] :film_projector:** 我们在 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEnxin\u002FMovieChat-1K-test) 上发布了 MovieChat-1K 的测试集。每个视频包含 3 个全局问题和 10 个断点问题。\n* **[2024.2.27] :tada:** 我们的论文被 CVPR 2024 接收！\n* **[2024.2.14] :film_projector:** 我们在 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEnxin\u002FMovieChat-1K_train) 上发布了 MovieChat-1K 的训练集。由于版权限制，我们分享由 [eva_vit_g](https:\u002F\u002Fstorage.googleapis.com\u002Fsfr-vision-language-research\u002FLAVIS\u002Fmodels\u002FBLIP2\u002Feva_vit_g.pth) 提取的片段特征，每个视频包含 8192 帧。\n* **[2023.11.27] :page_with_curl:** 我们更新了 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.16449v2.pdf)，包含实现细节、技术评估和数据集信息。\n* **[2023.11.23] :keyboard:** 我们更新了 MovieChat 的最新源代码。\n* **[2023.8.1] :page_with_curl:** 我们发布了 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449)。\n* **[2023.7.31] :keyboard:** 我们发布了用于 **MSVD-QA**, **MSRVTT-QA** 和 **ActivityNet-QA** 短视频问答 (QA) 的评估 [代码和说明](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Ftree\u002Fmain\u002Feval_code)。\n* **[2023.7.29] :joystick:** 我们发布了 MovieChat 的 [Gradio 演示](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Ftree\u002Fmain\u002FGradio_demo)。\n* **[2023.7.22] :keyboard:** 我们发布了 MovieChat 的源代码。\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fmoviechat-from-dense-token-to-sparse-memory\u002Fzeroshot-video-question-answer-on-activitynet)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-activitynet?p=moviechat-from-dense-token-to-sparse-memory)\\\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fmoviechat-from-dense-token-to-sparse-memory\u002Fzeroshot-video-question-answer-on-msrvtt-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-msrvtt-qa?p=moviechat-from-dense-token-to-sparse-memory)\\\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fmoviechat-from-dense-token-to-sparse-memory\u002Fzeroshot-video-question-answer-on-msvd-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-msvd-qa?p=moviechat-from-dense-token-to-sparse-memory)\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fmoviechat-from-dense-token-to-sparse-memory\u002Fzero-shot-long-video-global-mode-question)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzero-shot-long-video-global-mode-question?p=moviechat-from-dense-token-to-sparse-memory)\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fmoviechat-from-dense-token-to-sparse-memory\u002Fzero-shot-long-video-breakpoint-mode-question)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzero-shot-long-video-breakpoint-mode-question?p=moviechat-from-dense-token-to-sparse-memory)\n\n## 📊MovieChat-1K 上的性能对比\n| **方法**         | **文本解码器**   | **# 帧数** | **全局模式准确率** | **全局模式得分** |\n|--------------------|--------------------|--------------|----------------------|----------------------|\n| GIT                | 基于非大语言模型 (LLM)      | 6            | 28.8                 | 1.83                 |\n| mPLUG-2            | 基于非大语言模型 (LLM)      | 8            | 31.7                 | 2.13                 |\n| **Video Chat**     | 基于大语言模型 (LLM)          | 32           | 57.8                 | 3.00                 |\n| **Video LLaMA**    | 基于大语言模型 (LLM)          | 32           | 51.7                 | 2.67                 |\n| **Video-ChatGPT**  | 基于大语言模型 (LLM)          | 100          | 47.6                 | 2.55                 |\n| **MovieChat**      | 基于大语言模型 (LLM)          | 2048         | 62.3                 | 3.23                 |\n| **MovieChat+**     | 基于大语言模型 (LLM)          | 2048         | 71.2                 | 3.51               |\n| **MovieChat-Onevision**  | 基于大语言模型 (LLM)    | 2048         | **79.0**             | **4.20**             |\n\n## ✨如何快速运行 MovieChat？\n\n我们将 MovieChat 打包并上传到了 PyPI。要快速运行 MovieChat，您需要先安装它。 \n```\npip install MovieChat\n```\n我们建议您目前安装版本 `0.6.3`。由于 MovieChat 会自动从 Huggingface 下载检查点（checkpoints），如果您的服务不支持从 \u003CHuggingFace url> 进行 `git clone`，我们建议您将检查点下载到您的服务中，并更改包中的相应路径，包括 [q_former_model](https:\u002F\u002Fstorage.googleapis.com\u002Fsfr-vision-language-research\u002FLAVIS\u002Fmodels\u002FBLIP2\u002Fblip2_pretrained_flant5xxl.pth)、[ckpt_path](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideo-LLaMA-Series\u002Fresolve\u002Fmain\u002Ffinetune-vicuna7b-v2.pth?download=true)，以及 [llama_model](https:\u002F\u002Fhuggingface.co\u002FEnxin\u002FMovieChat-vicuna)。 \n\n在您运行以下推理代码之前，希望您能通过 `ffprobe -version` 验证 `ffprobe` 的安装情况。如果正确安装，该命令应返回 `ffprobe` 的版本。否则，您应该通过 `sudo apt-get install ffmpeg` (Ubuntu) 进行安装。\n\n```\nfrom PIL import Image\nimport cv2\n\nfrom MovieChat.processors.video_processor import AlproVideoEvalProcessor\nfrom MovieChat.models.chat_model import Chat\nfrom MovieChat.models.moviechat import MovieChat\n\ndevice = 'cuda:0'\nprint('Initializing Chat')\nmoviechat_model = MovieChat.from_config(device=device).to(device)\nvis_processor_cfg = {'name': 'alpro_video_eval', 'n_frms': 8, 'image_size': 224}\nframe_processor = AlproVideoEvalProcessor.from_config(vis_processor_cfg)\nchat = Chat(moviechat_model, frame_processor, device=device)\nprint('Initialization Finished')\n\nvideo_path = \"Your video path, end with mp4\"\nfragment_video_path = \"The path to store tmp video clips\"\nmiddle_video = False # True->Breakpoint mode, False->Global mode\nquestion = \"Your Question\"\ncur_min = 0 # Change it when Breakpoint mode\ncur_sec = 0 # Change it when Breakpoint mode\n\ncap = cv2.VideoCapture(video_path)\ncur_fps = cap.get(cv2.CAP_PROP_FPS)\ncap.set(cv2.CAP_PROP_POS_FRAMES, cur_fps)\nret, frame = cap.read()\nrgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)\npil_image = Image.fromarray(rgb_frame)\nimage = chat.image_vis_processor(pil_image).unsqueeze(0).unsqueeze(2).half().to(device)\ncur_image = chat.model.encode_image(image)\n\nimg_list = []\nmsg = chat.upload_video_without_audio(\n    video_path=video_path, \n    fragment_video_path=fragment_video_path,\n    cur_min=cur_min, \n    cur_sec=cur_sec, \n    cur_image=cur_image, \n    img_list=img_list, \n    middle_video=middle_video,\n    question=question\n)\nanswer = chat.answer(\n    img_list=img_list,\n    input_text=question,\n    msg = msg,\n    num_beams=1,\n    temperature=1.0,\n    max_new_tokens=300,\n    max_length=2000)[0]\n\nprint(answer)\n```\n\n注意，如果您收到类似 `\"Error reading \u003Cfilename.mp4>\"` 的 `RuntimeError`，一种解决方案是使用任何其他视频文件初始化 `\u003Cfilename.mp4>`。\n\n## 💡 概述\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_55631525c7af.png)\n\n## 📣 演示视频\n\n[![Alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_dfe02cf159e4.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fembed\u002FDx5BQmgK4n8?si=FN9pLyQBN--vJBZA)\n\n## ⚡ 对比案例\n\n\u003Cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"> 关于 YouTube 上一个剪辑片段的问答，这是一个关于如何烹饪牛排的教程。整个教学流程始于腌制牛排，接着是煎牛排，准备配菜，最后是装盘。绿色（红色）高亮显示正确（错误）的答案，黄色表示模型正在产生幻觉。\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_800e86bc801e.png\"  style=\"width: 80%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n## 😍 示例\n\n\u003Cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"> 关于动画电影《疯狂动物城》（Zootopia）片段的问答，该片讲述了一只坚定的警官兔子朱迪与一只狡猾的狐狸联手，揭开失踪动物阴谋并发展出意外友谊的故事。\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_9c77b07c7a07.png\"  style=\"width: 80%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\n\u003Cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"> 关于电视剧《孤单又灿烂的神：鬼怪》（Goblin）片段的问答，该剧讲述了不朽的“鬼怪”金信需要找到一位人类新娘来结束他无尽的生命，却遇到了注定要死的少女池恩倬，她自称是“鬼怪的新娘”，由此展开了一段浪漫故事。\n\u003C\u002Fdiv>\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_4879e63ea188.png\" style=\"width: 80%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"> 关于电视剧《权力的游戏》（Game of Thrones）片段的问答，该剧讲述了七大王国内部权力斗争和政治阴谋的史诗奇幻故事，交织着错综复杂的家庭关系，背景设定在一个古老的、神秘的威胁之下。\n\u003C\u002Fdiv>\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_68193a4b18ad.png\" style=\"width: 80%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cdiv style=\"color:orange; border-bottom: 1px solid #d9d9d9;\n    display: inline-block;\n    color: #999;\n    padding: 2px;\"> 关于 YouTube 上包含一些励志电影场景集锦的片段问答。该视频片段由《死亡爬行》（The Death Crawl）、《卡特教练》（Coach Carter）、《洛奇：拳王再临》（Rocky Balboa）和《我们是马歇尔》（We Are Marshall）的几个片段组成，时长各不相同。\n\u003C\u002Fdiv>\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_b5fba0307272.png\" style=\"width: 80%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n## 🚀 基准测试：MovieChat-1K \n\n为了更好地评估 MovieChat 的性能，我们收集了一个用于长视频理解任务的新基准 MovieChat-1K，该基准包含来自各种电影和电视剧的 1K 个高质量视频片段，并附有 14K 条人工标注。\n\n据我们所知，目前尚未建立长视频理解数据集。我们的工作代表了创建并公开此类数据集的第一步。我们创建了 MovieChat-1K，包含 1K 个长视频及对应的 1K 个密集描述（dense captions），以及 13K 个视觉问答对（visual question-answer pairs）。对于每个视频，我们手动设置并提供 1 个覆盖整个视频的密集描述，3 个全局模式（global mode）的问答对，以及 10 个带时间戳的断点模式（breakpoint mode）的问答对。\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_b69b077530d0.png\" style=\"width: 100%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\n我们从 15 个流行类别中收集了视频，分布各不相同，包括纪录片、侦探片、动画片等。其中，每个视频由多个交替的场景组成，在集合背景下贡献了多样且动态的视觉叙事。超过 90% 的视频时长范围为 10K 到 12K 帧，而 14.6% 的视频时长超过 12K 帧。只有 8.6% 的视频时长少于 10K 帧。\n\n### 问答对\n\n#### 词分布\n请注意，MovieChat-1K 是专门为长视频理解任务设计的，大多数问题是开放式的，只有四分之一被归类为选择题，以\"Do,\" \"Does,\" \"Is,\"或\"Are\"等引导词标记。我们还计算了所提供问答对的词分布，其中包括常见物体（人、衣服等）、时间（白天、夜晚等）、场景（室内、室外等）等。\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_2d0d0e7e46c0.png\" style=\"width: 40%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\n#### 句子长度分布\nMovieChat-1K 在分段片段级别上展示了多样化的问答对长度。尽管全局模式和断点模式之间的问答对分布有所不同，但大多数问题的长度倾向于集中在 5-15 个单词之间，而答案的长度通常少于 10 个单词。\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_654fe19a010b.png\" style=\"width: 70%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\n### 密集描述\n\n为了促进对长视频的更详细理解，我们为每个视频提供了一个密集描述。MovieChat-1K 在分段片段级别上展示了多样化的描述长度。大约三分之二的片段拥有 100-149 个单词的描述，而五分之一的片段描述少于 100 个单词。约 11% 的片段拥有超过 150 个单词的长描述。\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_c17255d82f66.png\" style=\"width: 40%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\n为了分析生成描述的词分布，我们计算了它们的分布。生成的描述词分布如图 B6 所示，其中包括常见物体（男人、女人、人、女孩等）、属性（侦探、各种、小、白色等）、位置（内部、后面、南边、旁边等）、场景（房间、房子、建筑、办公室等）、动作\u002F事件（交谈、进入、离开、拿取等）等。\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Ca target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_readme_77ab480bd8c4.png\" style=\"width: 45%; min-width: 200px; display: block; margin: auto;\">\u003C\u002Fa>\n\n就动作性而言，MovieChat-1K 描述中的动词数量与 WebVid10M 数据集几乎相同。为了评估这一点，我们使用 NLTK 工具包来分析描述中的动词数量，专注于提取和标记所有唯一的动词。我们发现 WebVid10M 描述数据集中总共有 109,485 个动词，而 MovieChat-1K 描述中包含 102,988 个动词的唯一实例。虽然由于我们简单的计数方法，这些计数可能不完全准确，但我们认为它们提供了两个数据集动作性的粗略指示。\n\n\u003C!-- ## MovieChat-1K 与其他基准测试的比较\n\nMovieChat-1K 为长视频理解提供了一个大规模基准，包含 1K 部电影、1K 个密集描述和 13K 个问答对。不同数据集的比较如表 8 所示。显然，MovieChat-1K 提供了最长的电影片段平均时长。MovieQA 仅提供与电影相关的问答对，而 MovieGraphs 提供与电影相关的描述。与其他数据集不同，MovieNet 包含三种主要类型的文本：字幕、摘要和剧本，不包括问答对。此外，摘要类别是为整部电影设计的，而不是视频片段。因此，与其他数据集相比，MovieChat-1K 更适合研究长视频理解。\n\n\u003Cdiv align=\"center\">\n\u003Ctable border=\"1\" width=\"100%\">\n    \u003Ctr align=\"center\">\n        \u003Cth>数据集\u003C\u002Fth>\u003Cth>平均时长（分钟）\u003C\u002Fth>\u003Cth>描述数量\u003C\u002Fth>\u003Cth>平均描述长度\u003C\u002Fth>\u003Cth>问答对数量\u003C\u002Fth>\u003Cth>平均问题长度\u003C\u002Fth>\u003Cth>平均回答长度\u003C\u002Fth>\n    \u003C\u002Ftr>\n    \u003Ctr align=\"center\">\n        \u003Ctd>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.02902\">MovieQA\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>3.5\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>14.9K\u003C\u002Ftd>\u003Ctd>9.3\u003C\u002Ftd>\u003Ctd>5.1\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003C\u002Ftr>\n    \u003Ctr align=\"center\">\n        \u003Ctd>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.06761\">MovieGraphs\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>0.73\u003C\u002Ftd>\u003Ctd>15K\u003C\u002Ftd>\u003Ctd>35\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr align=\"center\">\n        \u003Ctd>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.10937\">MovieNet\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>2.1\u003C\u002Ftd>\u003Ctd>2.5K\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr align=\"center\">\n        \u003Ctd>MovieChat-1K\u003C\u002Ftd>\u003Ctd>9.4\u003C\u002Ftd>\u003Ctd>1K\u003C\u002Ftd>\u003Ctd>121\u003C\u002Ftd>\u003Ctd>13K\u003C\u002Ftd>\u003Ctd>7.8\u003C\u002Ftd>\u003Ctd>2.3\u003C\u002Ftd>\n    \u003C\u002Ftr>\n\u003C\u002Ftable>\n\u003C\u002Fdiv> -->\n\n🔐 &#x00A9; **由于版权顾虑和电影大小限制，我们计划发布数据集的特征。请等待几周。**\n\n## 🛠️ 安装 \n\n### 环境准备\n\n首先，创建一个 conda 环境：\n```\nconda env create -f environment.yml\nconda activate moviechat\n```\n\n### 前置条件\n\n在使用此仓库之前，请确保您已获得以下前提条件：\n\n#### 预训练语言解码器\n\n- 按照 [此处](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fmodel_doc\u002Fllama) 的说明获取 Hugging Face 格式的原始 LLaMA 权重。\n- 下载 Vicuna delta 权重 :point_right: [[7B](https:\u002F\u002Fhuggingface.co\u002Flmsys\u002Fvicuna-7b-delta-v0)]（注意：我们使用 **v0 权重** 而不是 v1.1 权重）。 \n- 使用以下命令将 delta 权重添加到原始 LLaMA 权重中，以获得 Vicuna 权重：\n\n```\npython apply_delta.py \\\n    --base ckpt\u002FLLaMA\u002F7B_hf \\\n    --target ckpt\u002FVicuna\u002F7B \\\n    --delta ckpt\u002FVicuna\u002Fvicuna-7b-delta-v0 \\\n```\n\n#### MovieChat 的预训练视觉编码器\n- 从 [此链接](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1a4zLvaiDBr-36pasffmgpvH5P7CKmpze\u002Fview) 下载 MiniGPT-4 模型（训练好的线性层）。\n\n#### 下载预训练权重\n\n- 从 [此链接](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideo-LLaMA-Series\u002Fresolve\u002Fmain\u002Ffinetune-vicuna7b-v2.pth) 下载预训练权重，以便在本地使用 Vicuna-7B 作为语言解码器运行 MovieChat。\n\n## 🤖 如何在本地运行演示\n\n首先，在 [eval_configs\u002FMovieChat.yaml](.\u002Feval_configs\u002FMovieChat.yaml) 中设置 `llama_model`、`llama_proj_model` 和 `ckpt`。\n然后运行脚本：\n```\npython inference.py \\\n    --cfg-path eval_configs\u002FMovieChat.yaml \\\n    --gpu-id 0 \\\n    --num-beams 1 \\\n    --temperature 1.0 \\\n    --text-query \"What is he doing?\" \\\n    --video-path src\u002Fexamples\u002FCooking_cake.mp4 \\\n    --fragment-video-path src\u002Fvideo_fragment\u002Foutput.mp4 \\\n    --cur-min 1 \\\n    --cur-sec 1 \\\n    --middle-video 1 \\\n```\n注意，如果您想使用全局模式（理解并回答**整个**视频的问题），请记住将 middle-video 改为 0。\n\n\u003C!-- ## 👍 主要结果\n### 短视频问答\n我们使用了几个广泛使用的开放式数据集：MSVD-QA, MSRVTT-QA 和 ActivityNet-QA 用于短视频问答任务。评估过程是在 LLM（大语言模型）的协助下进行的，采用默认超参数设置。报告了准确率和 0 到 5 分制的相对分数。与之前的方法相比，MovieChat 取得了相当的性能，即使它并不是专门为短视频问答任务设计的。\n\n\u003Cdiv align=\"center\">\n\u003Ctable border=\"1\" width=\"100%\">\n    \u003Ctr align=\"center\">\n        \u003Cth>方法\u003C\u002Fth>\u003Cth>LLM\u003C\u002Fth>\u003Cth>对话\u003C\u002Fth>\u003Cth>详细描述\u003C\u002Fth>\u003Cth>复杂推理\u003C\u002Fth>\u003Cth>全部\u003C\u002Fth>\n    \u003C\u002Ftr>\n    \u003Ctr align=\"center\">\n        \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FChat-UniVi\u002FChat-UniVi\">Chat-UniVi-7B\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Flmsys\u002Fvicuna-7b-v1.5\">Vicuna-7B\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>\u003Cb>84.1\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>74.2\u003C\u002Ftd>\u003Ctd>93.7\u003C\u002Ftd>\u003Ctd>84.2\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003C\u002Ftr>\n    \u003Ctr align=\"center\">\n        \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FChat-UniVi\u002FChat-UniVi-13B\">Chat-UniVi-13B\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Flmsys\u002Fvicuna-13b-v1.5\">Vicuna-13B\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>\u003Cb>84.1\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>79.4\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>94.7\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>86.1\u003C\u002Fb>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n\u003C\u002Ftable>\n\u003C\u002Fdiv> -->\n\n## 🤝 致谢\n我们要感谢以下对我们 MovieChat 项目有启发的优秀项目：\n* [Video-LLaMA](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideo-LLaMA): 面向视频理解的指令微调视听语言模型\n* [Token Merging](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FToMe): 更快的 ViT（视觉 Transformer）\n* [XMem](https:\u002F\u002Fgithub.com\u002Fhkchengrex\u002FXMem): 基于 Atkinson-Shiffrin 记忆模型的长时视频目标分割\n* [MiniGPT-4](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4): 利用先进的大语言模型增强视觉 - 语言理解\n* [FastChat](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat): 用于训练、部署和评估基于大语言模型的聊天机器人的开放平台\n* [BLIP-2](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fblip2): 使用冻结图像编码器和大型语言模型引导语言 - 图像预训练 \n* [EVA-CLIP](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEVA\u002Ftree\u002Fmaster\u002FEVA-CLIP): 大规模 CLIP 的训练技术改进\n* [LLaMA](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fllama): 开源且高效的基础语言模型\n* [VideoChat](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything): 以聊天为中心的视频理解\n* [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA): 大型语言与视觉助手\n\n\n## 🔒 使用条款\n我们的 MovieChat 仅是一个研究预览版，仅供非商业用途。您**严禁**将我们的 MovieChat 用于任何非法、有害、暴力、种族主义或色情目的。严格禁止从事任何可能违反这些指南的活动。 \n\n## ✏️ 引用\n\n如果您发现 MovieChat 对您的研究和应用有用，请使用以下 BibTeX 进行引用：\n\n```bibtex\n@article{song2023moviechat,\n  title={MovieChat: From Dense Token to Sparse Memory for Long Video Understanding},\n  author={Song, Enxin and Chai, Wenhao and Wang, Guanhong and Zhang, Yucheng and Zhou, Haoyang and Wu, Feiyang and Guo, Xun and Ye, Tian and Lu, Yan and Hwang, Jenq-Neng and others},\n  journal={arXiv preprint arXiv:2307.16449},\n  year={2023}\n}\n\n@article{song2024moviechat+,\n  title={MovieChat+: Question-aware Sparse Memory for Long Video Question Answering},\n  author={Song, Enxin and Chai, Wenhao and Ye, Tian and Hwang, Jenq-Neng and Li, Xi and Wang, Gaoang},\n  journal={arXiv preprint arXiv:2404.17176},\n  year={2024}\n}\n```","# MovieChat 快速上手指南\n\n**MovieChat** 是由 CVPR 2024 论文提出的开源工具，专注于长视频理解。它采用“从密集 Token 到稀疏记忆”的机制，能够在单张 24GB 显存的显卡上处理超过 10,000 帧的视频，显著降低了显存占用。最新版本已更新基于 LLaVA-OneVision 作为基础模型。\n\n## 环境准备\n\n- **操作系统**: Linux \u002F Windows \u002F macOS\n- **编程语言**: Python >= 3.8\n- **深度学习框架**: PyTorch (需匹配 CUDA 版本)\n- **硬件要求**: \n  - NVIDIA GPU (建议使用 24GB 显存以支持长视频处理)\n  - CUDA Toolkit (根据 PyTorch 版本选择)\n\n> 💡 **提示**: 若网络环境受限，建议在安装依赖时使用国内镜像源以提升下载速度。\n\n## 安装步骤\n\n### 1. 通过 PyPI 安装 (推荐)\nMovieChat 已发布至 PyPI，可直接通过 pip 安装：\n\n```bash\npip install Moviechat\n```\n\n若下载速度慢，可使用清华或阿里镜像源：\n\n```bash\npip install Moviechat -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 2. 源码部署 (可选)\n如需获取最新评估代码或修改源码，可克隆仓库：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat.git\ncd MovieChat\n```\n\n## 基本使用\n\n### 运行 Gradio 演示\nMovieChat 提供了便捷的 Web 界面进行交互式测试。请进入 `Gradio_demo` 目录并运行启动脚本：\n\n```bash\ncd Gradio_demo\npython app.py\n```\n\n启动后，浏览器将自动打开本地链接，即可上传视频并进行问答测试。\n\n### 模型评估\n项目包含针对多个基准测试（如 ActivityNet-QA, MSVD-QA 等）的评估代码。具体执行方式请参考 `eval_code` 目录下的说明文档。\n\n---\n*注：更多详细信息、数据集及论文请访问 [GitHub 仓库](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat)。*","某大型物流园区的安全审计团队需要审查长达 8 小时的仓库监控录像，以精准定位一次货物损坏事故的具体发生过程与责任人。\n\n### 没有 MovieChat 时\n- 传统多模态模型显存消耗过大，处理超过 1 小时的视频序列时常导致计算崩溃。\n- 为规避显存限制被迫随机抽帧，导致关键动作片段丢失，无法还原完整事件链条。\n- 缺乏长时序记忆能力，难以建立跨时间段的上下文关联，容易对模糊画面产生误判。\n- 高昂的云端算力成本使得大规模历史录像审查在经济上几乎不可行。\n\n### 使用 MovieChat 后\n- MovieChat 支持在单张 24GB 显卡上流畅处理超万帧视频，无需分段即可分析全程内容。\n- 稀疏记忆机制有效保留关键视觉信息，即使间隔数分钟也能准确关联前后因果逻辑。\n- 显存占用仅为其他方法的千分之一，大幅降低硬件门槛，普通工作站即可完成复杂查询。\n- 能够精确回答“第几小时几分发生了什么”，直接输出带时间戳的高精度事件描述。\n\nMovieChat 通过稀疏记忆架构，让长视频深度理解变得高效、低成本且精准可靠。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frese1f_MovieChat_e6b4b668.png","rese1f","Wenhao Chai","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Frese1f_996f5388.jpg","CS Ph.D. @ Princeton University","Princeton University","Princeton, NJ","wenhao.chai@princeton.edu","wenhaocha1","wenhaochai.com","https:\u002F\u002Fgithub.com\u002Frese1f",[86,90],{"name":87,"color":88,"percentage":89},"Python","#3572A5",94.2,{"name":91,"color":92,"percentage":93},"Jupyter Notebook","#DA5B0B",5.8,690,43,"2026-04-01T07:46:58","BSD-3-Clause","未说明","支持 24GB 显存显卡（可处理>10K frames 视频），具体型号与 CUDA 版本未说明",{"notes":101,"python":98,"dependencies":102},"可通过 pip 直接安装；部分训练集和测试集需从 Hugging Face 下载；提供 Gradio 演示界面；支持长视频理解。",[98],[51,13,14,26,52,54],[105,106,107,108,109,110],"computer-vision","multimodal-large-language-models","long-video-understanding","llama","large-language-models","dataset",8,null,"2026-03-27T02:49:30.150509","2026-04-06T07:12:03.405588",[116,121,126,131,136,141],{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},2601,"推理结果出现乱码或随机字符怎么办？","检查配置文件 `MovieChat.yaml` 中的模型路径设置。有用户反馈将 `llama_model` 设置为 `ckpt\u002FVicuna\u002Fvicuna-7b-delta-v0` 时会导致输出异常，建议修改为 `ckpt\u002FVicuna\u002F7B`。确保加载的 Vicuna 模型路径正确且文件存在。","https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Fissues\u002F12",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},2602,"运行 inference.py 无响应或报错 temp_frame_path 如何解决？","首先检查 `temp_frame_path` 对应的路径是否存在。如果仍然无效，可以尝试暂时使用一张图片作为 `snap_shot.jpg`，程序中的 `imwrite` 函数会自动将其覆盖为正确的快照。此外，请确保输入的视频长度足够长，以满足推断参数（infer argument）的要求。","https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Fissues\u002F4",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},2603,"模型对视频帧数和 Token 数有什么限制？","LLaMA 模型的输入限制为 256。在记忆整合（memory consolidation）处理后，最终保留的帧数不会超过 256 帧。在全局模式（global modal）下，长期记忆中也会保留 256 帧。如果视频过长，超出此限制的部分会被处理掉。","https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Fissues\u002F32",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},2604,"MovieChat-1K 数据集在哪里下载？包含多少部独立电影？","由于在构建数据集时未存储电影级的元数据，目前难以准确计算使用的唯一电影数量。此外，受限于电影数据的隐私问题，部分详细信息可能无法完全公开。建议关注官方后续关于数据集发布的更新通知。","https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Fissues\u002F29",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},2605,"验证集何时发布？提交失败算有效尝试吗？","验证集的获取方式将通过电子邮件通知。关于评估提交的规则，失败的提交将被视为有效尝试，并且会计入提交配额中。如有具体获取需求，建议直接通过邮件联系维护者。","https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Fissues\u002F66",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},2606,"Gradio 离线服务启动后前端无响应或报错如何处理？","此类错误通常涉及异步任务异常处理或数据验证问题。建议采取以下措施：1. 在任务函数内使用 try-except 捕获并处理异常；2. 调用任务函数时添加 await 以接收异常；3. 在事件循环关闭前等待任务对象完成；4. 检查传入数据是否符合 Pydantic 模型定义（如数据类型、范围、必填字段），根据错误信息调整数据直至通过验证。","https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat\u002Fissues\u002F10",[]]