[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-DAMO-NLP-SG--VideoLLaMA3":3,"tool-DAMO-NLP-SG--VideoLLaMA3":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":80,"owner_website":80,"owner_url":81,"languages":82,"stars":95,"forks":96,"last_commit_at":97,"license":98,"difficulty_score":10,"env_os":99,"env_gpu":100,"env_ram":99,"env_deps":101,"category_tags":114,"github_topics":80,"view_count":23,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":115,"updated_at":116,"faqs":117,"releases":138},2539,"DAMO-NLP-SG\u002FVideoLLaMA3","VideoLLaMA3","Frontier Multimodal Foundation Models for Image and Video Understanding","VideoLLaMA 3 是由阿里达摩院（DAMO-NLP-SG）团队最新推出的前沿多模态基础模型，专为深度理解图像和视频内容而设计。作为 VideoLLaMA 系列的第三代作品，它旨在让 AI 像人类一样“看懂”动态世界，不仅能识别静态图片中的细节，更能精准捕捉视频中复杂的时空变化、动作逻辑以及音频信息。\n\n在日常应用中，用户常常面临难以从海量视频素材中快速提取关键信息，或希望 AI 能理解长视频中前后关联事件的痛点。VideoLLaMA 3 正是为了解决这些问题而生。它具备强大的时空建模能力，能够处理长上下文视频，准确回答关于视频情节、物体运动轨迹及声音内容的复杂提问，有效减少了传统视觉模型容易产生的“幻觉”问题，即避免 AI 胡编乱造画面中不存在的内容。\n\n这款工具非常适合多类人群使用：开发者和研究人员可以利用其开源的代码和模型权重，进行二次开发或探索多模态学习的前沿技术；内容创作者和普通用户则可以通过其提供的在线演示，轻松实现视频内容的智能摘要、细节查询和互动分析，大幅提升信息获取效率。\n\n技术层面，VideoLLaMA 3 在架构上进行了显著优化，增强了对高分辨率图像和长视频","VideoLLaMA 3 是由阿里达摩院（DAMO-NLP-SG）团队最新推出的前沿多模态基础模型，专为深度理解图像和视频内容而设计。作为 VideoLLaMA 系列的第三代作品，它旨在让 AI 像人类一样“看懂”动态世界，不仅能识别静态图片中的细节，更能精准捕捉视频中复杂的时空变化、动作逻辑以及音频信息。\n\n在日常应用中，用户常常面临难以从海量视频素材中快速提取关键信息，或希望 AI 能理解长视频中前后关联事件的痛点。VideoLLaMA 3 正是为了解决这些问题而生。它具备强大的时空建模能力，能够处理长上下文视频，准确回答关于视频情节、物体运动轨迹及声音内容的复杂提问，有效减少了传统视觉模型容易产生的“幻觉”问题，即避免 AI 胡编乱造画面中不存在的内容。\n\n这款工具非常适合多类人群使用：开发者和研究人员可以利用其开源的代码和模型权重，进行二次开发或探索多模态学习的前沿技术；内容创作者和普通用户则可以通过其提供的在线演示，轻松实现视频内容的智能摘要、细节查询和互动分析，大幅提升信息获取效率。\n\n技术层面，VideoLLaMA 3 在架构上进行了显著优化，增强了对高分辨率图像和长视频的细粒度理解能力，同时保持了高效的推理速度。它不仅支持中英文双语交互，还在多个国际权威评测基准中取得了领先成绩。无论是用于构建智能视频助手，还是作为学术研究的基座模型，VideoLLaMA 3 都提供了一个强大且灵活的选择。目前，项目已在 GitHub 和 Hugging Face 上开源，欢迎社区体验与交流。","\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_readme_d5300b997889.png\" width=\"150\" style=\"margin-bottom: 0.2;\"\u002F>\n\u003Cp>\n\n\u003Ch3 align=\"center\">\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.13106\" style=\"color:#9C276A\">\nVideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding\u003C\u002Fa>\u003C\u002Fh3>\n\u003Ch5 align=\"center\"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 \u003C\u002Fh2>\n\n\n\u003Ch5 align=\"center\">\n\n[![hf_space](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Image_Demo-9C276A.svg)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flixin4ever\u002FVideoLLaMA3-Image)\n[![hf_space](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Video_Demo-9C276A.svg)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flixin4ever\u002FVideoLLaMA3)\n[![hf_checkpoint](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Checkpoints-9C276A.svg)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FDAMO-NLP-SG\u002Fvideollama3-678cdda9281a0e32fe79af15) \u003Cbr>\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-yellow)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002FLICENSE) \n[![Hits](https:\u002F\u002Fhits.seeyoufarm.com\u002Fapi\u002Fcount\u002Fincr\u002Fbadge.svg?url=https%3A%2F%2Fgithub.com%2FDAMO-NLP-SG%2FVideoLLaMA3&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=Visitor&edge_flat=false)](https:\u002F\u002Fhits.seeyoufarm.com)\n[![GitHub issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002FDAMO-NLP-SG\u002FVideoLLaMA3?color=critical&label=Issues)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fissues?q=is%3Aopen+is%3Aissue)\n[![GitHub closed issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-closed\u002FDAMO-NLP-SG\u002FVideoLLaMA3?color=success&label=Issues)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fissues?q=is%3Aissue+is%3Aclosed)  \u003Cbr>\n[![hf_paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Paper%20In%20HF-red.svg)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2501.13106)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2501.13106-AD1C18.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13106) \n\u003C\u002Fh5>\n\n\u003Cdetails open>\u003Csummary>💡 Some other multimodal-LLM projects from our team may interest you ✨. \u003C\u002Fsummary>\u003Cp>\n\u003C!--  may -->\n\n> [**VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2) \u003Cbr>\n> Zesen Cheng*, Sicong Leng*, Hang Zhang*, Yifei Xin*, Xin Li*, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoLLaMA2.svg?style=social)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2) [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2406.07476-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07476) \u003Cbe> \n\n> [**VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.00599) \u003Cbr>\n> Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoRefer)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoRefer.svg?style=social)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoRefer)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2501.00599-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.00599) \u003Cbr>\n\n> [**VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16922) \u003Cbr>\n> Sicong Leng*, Hang Zhang*, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVCD)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVCD.svg?style=social)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVCD)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2311.16922-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16922) \u003Cbr>\n\n> [**The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12787) \u003Cbr>\n> Sicong Leng*, Yun Xing*, Zesen Cheng*, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FCMM)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FCMM.svg?style=social)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FCMM)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2410.12787-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12787) \u003Cbr>\n\n> [**Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17243) \u003Cbr>\n> Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FInf-CLIP)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FInf-CLIP.svg?style=social)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FInf-CLIP)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2410.17243-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17243) \u003Cbr>\n\n\n\n\n\u003C\u002Fp>\u003C\u002Fdetails>\n\n\n## 📰 News\n\n* **[2025.02.07]**  🔥🔥 Release our re-captioned high-quality image-text dataset [VL3-Syn7M](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDAMO-NLP-SG\u002FVL3-Syn7M).\n* **[2025.01.26]**  🔥🔥 As of Jan 26, VideoLLaMA3-7B is the best 7B-sized model on [LVBench](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTHUDM\u002FLVBench) leaderboard.\n* **[2025.01.24]**  🔥🔥 As of Jan 24, VideoLLaMA3-7B is the best 7B-sized model on [VideoMME](https:\u002F\u002Fvideo-mme.github.io\u002Fhome_page.html) leaderboard.\n* **[2025.01.22]**  👋👋 Release technical report of VideoLLaMA 3. If you have works closely related to VideoLLaMA 3 but not mentioned in the paper, feel free to let us know.\n* **[2025.01.21]**  Release models and inference code of VideoLLaMA 3.\n\n## 🌟 Introduction\nVideoLLaMA 3 is a series of multimodal foundation models with frontier image and video understanding capacity.\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_readme_ef01fa9413f6.png\" style=\"max-width: 100%; height: auto;\">\n\n\u003Cdetails>\n  \u003Csummary>💡Click here to show detailed performance on video benchmarks\u003C\u002Fsummary>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_readme_e2e8d22bbea6.png\" style=\"max-width: 100%; height: auto;\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_readme_6880c21a6b61.png\" style=\"max-width: 100%; height: auto;\">\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>💡Click here to show detailed performance on image benchmarks\u003C\u002Fsummary>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_readme_7695d71577f1.png\" style=\"max-width: 100%; height: auto;\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_readme_3be8d03350f7.png\" style=\"max-width: 100%; height: auto;\">\n\u003C\u002Fdetails>\n\n## 🛠️ Requirements and Installation\n\nBasic Dependencies:\n\n* Python >= 3.10\n* Pytorch >= 2.4.0\n* CUDA Version >= 11.8\n* transformers >= 4.46.3\n\nInstall required packages:\n\n**[Inference-only]**\n\nFor stable inference, install the following package versions:\n\n```bash\n# PyTorch and torchvision for CUDA 11.8\npip install torch==2.4.0 torchvision==0.19.0 --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n\n# Flash-attn pinned to a compatible version\npip install flash-attn==2.7.3 --no-build-isolation --upgrade\n\n# Transformers and accelerate\npip install transformers==4.46.3 accelerate==1.0.1\n\n# Video processing dependencies\npip install decord ffmpeg-python imageio opencv-python\n```\n> ⚠ **Note:** For CUDA 11.8 with `torch==2.4.0` and `torchvision==0.19.0`, use `flash-attn==2.7.3`.  \n> If you are using a different Python or CUDA version, please check the [flash-attn releases](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\u002Freleases\u002F) to select the compatible wheel. Using incompatible versions may break the setup.\n\n**[Training]**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\ncd VideoLLaMA3\npip install -r requirements.txt\npip install flash-attn --no-build-isolation\n```\n\n## :earth_americas: Model Zoo\n\n| Model                | Base Model   | HF Link                                                      |\n| -------------------- | ------------ | ------------------------------------------------------------ |\n| VideoLLaMA3-7B       | Qwen2.5-7B   | [DAMO-NLP-SG\u002FVideoLLaMA3-7B](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideoLLaMA3-7B) |\n| VideoLLaMA3-2B       | Qwen2.5-1.5B | [DAMO-NLP-SG\u002FVideoLLaMA3-2B](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideoLLaMA3-2B) |\n| VideoLLaMA3-7B-Image | Qwen2.5-7B   | [DAMO-NLP-SG\u002FVideoLLaMA3-7B-Image](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideoLLaMA3-7B-Image) |\n| VideoLLaMA3-2B-Image | Qwen2.5-1.5B | [DAMO-NLP-SG\u002FVideoLLaMA3-2B-Image](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideoLLaMA3-2B-Image) |\n\nWe also upload the tuned vision encoder of VideoLLaMA3-7B for wider application:\n\n| Model                         | Base Model                | HF Link                                                      |\n| ----------------------------- | ------------------------- | ------------------------------------------------------------ |\n| VideoLLaMA3-7B Vision Encoder | siglip-so400m-patch14-384 | [DAMO-NLP-SG\u002FVL3-SigLIP-NaViT](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVL3-SigLIP-NaViT) |\n\n## 🤖 Inference\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoProcessor\n\ndevice = \"cuda:0\"\nmodel_path = \"DAMO-NLP-SG\u002FVideoLLaMA3-7B\"\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_path,\n    trust_remote_code=True,\n    device_map={\"\": device},\n    torch_dtype=torch.bfloat16,\n    attn_implementation=\"flash_attention_2\",\n)\nprocessor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)\n\nconversation = [\n    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"video\", \"video\": {\"video_path\": \".\u002Fassets\u002Fcat_and_chicken.mp4\", \"fps\": 1, \"max_frames\": 180}},\n            {\"type\": \"text\", \"text\": \"What is the cat doing?\"},\n        ]\n    },\n]\n\ninputs = processor(\n    conversation=conversation,\n    add_system_prompt=True,\n    add_generation_prompt=True,\n    return_tensors=\"pt\"\n)\ninputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}\nif \"pixel_values\" in inputs:\n    inputs[\"pixel_values\"] = inputs[\"pixel_values\"].to(torch.bfloat16)\noutput_ids = model.generate(**inputs, max_new_tokens=1024)\nresponse = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()\nprint(response)\n```\n\nFor more cases, please refer to [examples](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002Finference\u002Fexample_videollama3.py).\n\n### CookBook\nCheckout [inference notebooks](inference\u002Fnotebooks\u002F) that demonstrate how to use VideoLLaMA3 on various applications such as single-image understanding, multi-image understanding, visual referring and grounding, video understanding, etc.\n\n| Notebooks                | Description   |\n| :-------------------- | ------------------------------------------------------------------------ |\n| [Image Understanding](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002Finference\u002Fnotebooks\u002F01_single_image_understanding.ipynb)      | Demonstrations of using VideoLLaMA 3 for **general image understanding**, **chart analysis**, **table understanding**, **document recognition**, and **visual code analysis**|\n| [Multi-image Understanding](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002Finference\u002Fnotebooks\u002F02_multi_image_understanding.ipynb)       | Demonstrations of using VideoLLaMA 3 for **multi-image comparison and understanding** |\n| [Fine-grained Image Recognition & Understanding](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002Finference\u002Fnotebooks\u002F03_visual_referring_and_grounding.ipynb) | Demonstrations of using VideoLLaMA 3 for **visual referring & grounding** |\n| [Video Understanding](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002Finference\u002Fnotebooks\u002F04_video_understanding.ipynb) | Demonstrations of using VideoLLaMA 3 for **general video understanding**, **long video understanding** and **temporal grounding** |\n\n\n## 🤗 Demo\n\nIt is highly recommended to try our [online demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flixin4ever\u002FVideoLLaMA3) first.\n\nOtherwise, you can launch a gradio app locally:\n\n```bash\npython inference\u002Flaunch_gradio_demo.py --model-path DAMO-NLP-SG\u002FVideoLLaMA3-7B\n\noptions:\n  --model-path MODEL_PATH, --model_path MODEL_PATH\n  --server-port SERVER_PORT, --server_port SERVER_PORT\n  \tOptional. Port of the model server.\n  --interface-port INTERFACE_PORT, --interface_port INTERFACE_PORT\n  \tOptional. Port of the gradio interface.\n  --nproc NPROC\n  \tOptional. Number of model processes.\n```\n\n## 🗝️ Training\n\n### Step 1: Prepare training data\nTo use our training code, please organize the image and video data as you like under `data_root`, and then use one or more annotation files to record each conversation data and the corresponding image\u002Fvideo path. For example:\n```bash\ndata_root\n├── LLaVA-Video-178K\n│   ├── video_1.mp4\n│   └── ...\n├── LLaVA-OneVision-Data\n│   ├── image_1.jpg\n│   └── ...\n├── annotations_video.jsonl\n├── annotations_image.jsonl\n└── ...\n```\nThe annotation files are consist of a list of dictionaries, where each item follows the following format:\n```json\n[\n    {\n        \"image\": [\"images\u002Fxxx.jpg\"],\n        \"conversations\": [\n            {\n                \"from\": \"human\",\n                \"value\": \"\u003Cimage>\\nWhat are the colors of the bus in the image?\"\n            },\n            {\n                \"from\": \"gpt\",\n                \"value\": \"The bus in the image is white and red.\"\n            },\n            ...\n        ]\n    },\n    {\n        \"video\": [\"videos\u002Fxxx.mp4\"],\n        \"conversations\": [\n            {\n                \"from\": \"human\",\n                \"value\": \"\u003Cvideo>\\nWhat are the main activities that take place in the video?\"\n            },\n            {\n                \"from\": \"gpt\",\n                \"value\": \"The main activities that take place in the video are the preparation of camera equipment by a man, a group of men riding a helicopter, and a man sailing a boat through the water.\"\n            },\n            ...\n        ]\n    },\n    ...\n]\n```\nFor loading and memory efficiency, we recommend to use `.jsonl` files with [huggingface datasets](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Floading) format.\n### Step 2: (Optional) Convert HF checkpoint\nIf you want to finetune VideoLLaMA3 on your own data using this codebase, please first convert the checkpoints from huggingface to local format. For example:\n```bash\npython scripts\u002Fconvert_hf_checkpoint.py --model_path DAMO-NLP-SG\u002FVideoLLaMA3-7B --save_path weights\u002Fvideollama3_7b_local\n```\n### Step 3: Prepare training script\nWe provide some templates in `scripts\u002Ftrain` for all stages. You can modify the variables to fit your settings of data and models based on them. For example:\n```bash\n  --data_folder .\u002Fdatasets \\\n  --data_path .\u002Fdatasets\u002Fannotations_video.jsonl .\u002Fdatasets\u002Fannotations_image.jsonl \\\n  --model_path Qwen\u002FQwen2.5-1.5B-Instruct \\\n  --vision_encoder DAMO-NLP-SG\u002FSigLIP-NaViT \\\n```\nFor finetuneing, `--model_path` is the path to the converted checkpoint as described in step 2.\n### Step 4: Start training\nNow you can start training with your training scripts:\n```bash\n# VideoLLaMA3 Stage 1\nbash scripts\u002Ftrain\u002Fstage1_2b.sh\n# VideoLLaMA3 Stage 2\nbash scripts\u002Ftrain\u002Fstage2_2b.sh\n```\n### Some tips about CUDA OOM error:\n- Please try the latest main branch, where we optimize the memory consumption in [this commit](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fcommit\u002F21268660a67c115c6d6c6620780515626193af0f).\n- Try DeepSpeed [ZeRO-2\u002F3](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fdeepspeed) by passing `--deepspeed scripts\u002Fzero2.json \u002F zero3.json`.\n- Reduce the max number of visual tokens (high-resolution images and videos will be automatically downsampled to fit this length) and max length of sequences (sequences longer than this will be truncated) by setting `--mm_max_length` and `--model_max_length`, respectively.\n- Reduce the local batch size, i.e., `LOCAL_BATCH_SIZE` in the training script.\nYou can adjust the above hyperparameters according to the available GPU memory and number of GPUs to make the training fits your hardware.\n- **(New!)** If you still encounter memory issues after using the above tricks, you can try using an **experimental** feature by setting `--use_flash_loss True` in your training script. Specifically, it uses a tile-based CE implementation proposed in [Inf-CL](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FInf-CLIP) to reduce the memory consumption, which is very helpful when training models with long context or large vocabulary!\n\n\n## ✅ Evaluation\n#### Step 1: Prepare evaluation data\nFirst, please download the corresponding data according to the official instructions and organize it into the following format:\n\u003Cdetails>\n\u003Csummary>Click here to view the dataset directory organization\u003C\u002Fsummary>\n\n```bash\nbenchmarks\n└── video\n│   ├── activitynet_qa\n│   │   ├── all_test\n│   │   ├── test_a.json\n│   │   └── test_q.json\n│   ├── charades\n│   │   ├── Charades_v1\n│   │   └── charades_annotations_test-random_prompt.json\n│   ├── egoschema\n│   │   ├── good_clips_git\n│   │   └── questions.json\n│   ├── longvideobench\n│   │   ├── lvb_val.json\n│   │   ├── subtitles\n│   │   └── videos\n│   ├── lvbench\n│   │   ├── video\n│   │   └── video_info.meta.jsonl\n│   ├── mlvu\n│   │   ├── json\n│   │   └── video\n│   ├── mvbench\n│   │   ├── json\n│   │   └── video\n│   ├── nextqa\n│   │   ├── map_vid_vidorID.json\n│   │   ├── NExTVideo\n│   │   └── test.csv\n│   ├── perception_test\n│   │   ├── mc_question_test.json\n│   │   └── videos\n│   ├── tempcompass\n│   │   ├── captioning\n│   │   ├── caption_matching\n│   │   ├── multi-choice\n│   │   ├── videos\n│   │   └── yes_no\n│   ├── videomme\n│   │   ├── subtitles\n│   │   ├── test-00000-of-00001.parquet\n│   │   └── videos\n```\n\n\u003C\u002Fdetails>\n\n#### Step 2: Start evaluation\n```bash\nbash scripts\u002Feval\u002Feval_video.sh ${MODEL_PATH} ${BENCHMARKS} ${NUM_NODES} ${NUM_GPUS}\n```\nYou can change the directory of benchmarks and outputs via `DATA_ROOT` and `SAVE_DIR` in the evaluation script. Please check the scripts for more detailed usage.\n\n#### Step 3: Add new benchmark\nComing soon...\n\n\n## 📑 Citation\n\nIf you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:\n\n```bibtex\n@article{damonlpsg2025videollama3,\n  title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},\n  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},\n  journal={arXiv preprint arXiv:2501.13106},\n  year={2025},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13106}\n}\n\n@article{damonlpsg2024videollama2,\n  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},\n  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},\n  journal={arXiv preprint arXiv:2406.07476},\n  year={2024},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07476}\n}\n\n@article{damonlpsg2023videollama,\n  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},\n  author = {Zhang, Hang and Li, Xin and Bing, Lidong},\n  journal = {arXiv preprint arXiv:2306.02858},\n  year = {2023},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.02858}\n}\n```\n\n## 👍 Acknowledgement\nOur VideoLLaMA3 is built on top of [**SigLip**](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) and [**Qwen2.5**](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5). We also learned a lot from the implementation of [**LLaVA-OneVision**](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-NeXT), [**InternVL2**](https:\u002F\u002Finternvl.github.io\u002Fblog\u002F2024-07-02-InternVL-2.0\u002F), and [**Qwen2VL**](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2-VL). Besides, our VideoLLaMA3 benefits from tons of open-source efforts. We sincerely appreciate these efforts and compile a list in [ACKNOWLEDGEMENT.md](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002FACKNOWLEDGEMENT.md) to express our gratitude. If your work is used in VideoLLaMA3 but not mentioned in either this repo or the technical report, feel free to let us know :heart:.\n\n\n## 🔒 License\n\nThis project is released under the Apache 2.0 license as found in the LICENSE file.\nThe service is a research preview intended for **non-commercial use ONLY**, subject to the model Licenses of Qwen, Terms of Use of the data generated by OpenAI and Gemini, and Privacy Practices of ShareGPT. Please get in touch with us if you find any potential violations.\n","\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_readme_d5300b997889.png\" width=\"150\" style=\"margin-bottom: 0.2;\"\u002F>\n\u003Cp>\n\n\u003Ch3 align=\"center\">\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.13106\" style=\"color:#9C276A\">\nVideoLLaMA 3：面向视频理解的前沿多模态基础模型\u003C\u002Fa>\u003C\u002Fh3>\n\u003Ch5 align=\"center\"> 如果我们的项目对您有所帮助，请在 GitHub 上给我们点个赞 ⭐，以支持我们。🙏🙏 \u003C\u002Fh2>\n\n\n\u003Ch5 align=\"center\">\n\n[![hf_space](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Image_Demo-9C276A.svg)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flixin4ever\u002FVideoLLaMA3-Image)\n[![hf_space](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Video_Demo-9C276A.svg)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flixin4ever\u002FVideoLLaMA3)\n[![hf_checkpoint](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Checkpoints-9C276A.svg)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FDAMO-NLP-SG\u002Fvideollama3-678cdda9281a0e32fe79af15) \u003Cbr>\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-yellow)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002FLICENSE) \n[![Hits](https:\u002F\u002Fhits.seeyoufarm.com\u002Fapi\u002Fcount\u002Fincr\u002Fbadge.svg?url=https%3A%2F%2Fgithub.com%2FDAMO-NLP-SG%2FVideoLLaMA3&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=Visitor&edge_flat=false)](https:\u002F\u002Fhits.seeyoufarm.com)\n[![GitHub issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002FDAMO-NLP-SG\u002FVideoLLaMA3?color=critical&label=Issues)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fissues?q=is%3Aopen+is%3Aissue)\n[![GitHub closed issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-closed\u002FDAMO-NLP-SG\u002FVideoLLaMA3?color=success&label=Issues)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fissues?q=is%3Aissue+is%3Aclosed)  \u003Cbr>\n[![hf_paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Paper%20In%20HF-red.svg)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2501.13106)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2501.13106-AD1C18.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13106) \n\u003C\u002Fh5>\n\n\u003Cdetails open>\u003Csummary>💡 我们的团队还有其他多模态大模型项目，或许会令你感兴趣 ✨。\u003C\u002Fsummary>\u003Cp>\n\u003C!--  may -->\n\n> [**VideoLLaMA 2：推进视频大模型中的时空建模与音频理解**](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2) \u003Cbr>\n> Zesen Cheng*, Sicong Leng*, Hang Zhang*, Yifei Xin*, Xin Li*, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoLLaMA2.svg?style=social)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2) [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2406.07476-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07476) \u003Cbe> \n\n> [**VideoRefer Suite：利用视频大模型推进时空对象理解**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.00599) \u003Cbr>\n> Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoRefer)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoRefer.svg?style=social)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoRefer)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2501.00599-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.00599) \u003Cbr>\n\n> [**VCD：通过视觉对比解码缓解大型视觉-语言模型中的对象幻觉问题**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16922) \u003Cbr>\n> Sicong Leng*, Hang Zhang*, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVCD)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVCD.svg?style=social)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVCD)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2311.16922-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16922) \u003Cbr>\n\n> [**多模态的诅咒：评估大型多模态模型在语言、视觉和音频方面的幻觉现象**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12787) \u003Cbr>\n> Sicong Leng*, Yun Xing*, Zesen Cheng*, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FCMM)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FCMM.svg?style=social)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FCMM)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2410.12787-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12787) \u003Cbr>\n\n> [**突破内存限制：近无限批量规模的对比损失**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17243) \u003Cbr>\n> Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FInf-CLIP)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FInf-CLIP.svg?style=social)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FInf-CLIP)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2410.17243-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17243) \u003Cbr>\n\n\n\n\n\u003C\u002Fp>\u003C\u002Fdetails>\n\n\n## 📰 新闻\n\n* **[2025.02.07]**  🔥🔥 发布我们重新标注的高质量图文数据集 [VL3-Syn7M](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDAMO-NLP-SG\u002FVL3-Syn7M)。\n* **[2025.01.26]**  🔥🔥 截至1月26日，VideoLLaMA3-7B是[Hugging Face空间上的LVBench排行榜](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTHUDM\u002FLVBench)上表现最好的7B参数量模型。\n* **[2025.01.24]**  🔥🔥 截至1月24日，VideoLLaMA3-7B是[VideoMME排行榜](https:\u002F\u002Fvideo-mme.github.io\u002Fhome_page.html)上表现最好的7B参数量模型。\n* **[2025.01.22]**  👋👋 发布VideoLLaMA 3的技术报告。如果您有与VideoLLaMA 3密切相关但未在论文中提及的工作，欢迎随时告知我们。\n* **[2025.01.21]**  发布VideoLLaMA 3的模型及推理代码。\n\n## 🌟 简介\nVideoLLaMA 3是一系列具有前沿图像和视频理解能力的多模态基础模型。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_readme_ef01fa9413f6.png\" style=\"max-width: 100%; height: auto;\">\n\n\u003Cdetails>\n  \u003Csummary>💡点击此处查看视频基准测试的详细性能\u003C\u002Fsummary>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_readme_e2e8d22bbea6.png\" style=\"max-width: 100%; height: auto;\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_readme_6880c21a6b61.png\" style=\"max-width: 100%; height: auto;\">\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>💡点击此处查看图像基准测试的详细性能\u003C\u002Fsummary>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_readme_7695d71577f1.png\" style=\"max-width: 100%; height: auto;\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_readme_3be8d03350f7.png\" style=\"max-width: 100%; height: auto;\">\n\u003C\u002Fdetails>\n\n## 🛠️ 需求与安装\n\n基本依赖：\n\n* Python >= 3.10\n* PyTorch >= 2.4.0\n* CUDA 版本 >= 11.8\n* transformers >= 4.46.3\n\n安装所需软件包：\n\n**[仅推理]**\n\n为确保稳定的推理效果，建议安装以下版本的软件包：\n\n```bash\n\n# 用于 CUDA 11.8 的 PyTorch 和 torchvision\npip install torch==2.4.0 torchvision==0.19.0 --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n\n# 将 flash-attn 锁定到兼容版本\npip install flash-attn==2.7.3 --no-build-isolation --upgrade\n\n# Transformers 和 accelerate\npip install transformers==4.46.3 accelerate==1.0.1\n\n# 视频处理依赖项\npip install decord ffmpeg-python imageio opencv-python\n```\n> ⚠ **注意:** 对于使用 `torch==2.4.0` 和 `torchvision==0.19.0` 的 CUDA 11.8，请使用 `flash-attn==2.7.3`。  \n> 如果您使用的是其他 Python 或 CUDA 版本，请查看 [flash-attn 发布页面](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\u002Freleases\u002F) 以选择兼容的轮子文件。使用不兼容的版本可能会导致设置失败。\n\n**[训练]**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\ncd VideoLLaMA3\npip install -r requirements.txt\npip install flash-attn --no-build-isolation\n```\n\n## :earth_americas: 模型库\n\n| 模型                | 基础模型   | HF 链接                                                      |\n| -------------------- | ------------ | ------------------------------------------------------------ |\n| VideoLLaMA3-7B       | Qwen2.5-7B   | [DAMO-NLP-SG\u002FVideoLLaMA3-7B](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideoLLaMA3-7B) |\n| VideoLLaMA3-2B       | Qwen2.5-1.5B | [DAMO-NLP-SG\u002FVideoLLaMA3-2B](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideoLLaMA3-2B) |\n| VideoLLaMA3-7B-Image | Qwen2.5-7B   | [DAMO-NLP-SG\u002FVideoLLaMA3-7B-Image](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideoLLaMA3-7B-Image) |\n| VideoLLaMA3-2B-Image | Qwen2.5-1.5B | [DAMO-NLP-SG\u002FVideoLLaMA3-2B-Image](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideoLLaMA3-2B-Image) |\n\n我们还上传了 VideoLLaMA3-7B 的微调视觉编码器，以便更广泛的应用：\n\n| 模型                         | 基础模型                | HF 链接                                                      |\n| ----------------------------- | ------------------------- | ------------------------------------------------------------ |\n| VideoLLaMA3-7B 视觉编码器 | siglip-so400m-patch14-384 | [DAMO-NLP-SG\u002FVL3-SigLIP-NaViT](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVL3-SigLIP-NaViT) |\n\n## 🤖 推理\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoProcessor\n\ndevice = \"cuda:0\"\nmodel_path = \"DAMO-NLP-SG\u002FVideoLLaMA3-7B\"\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_path,\n    trust_remote_code=True,\n    device_map={\"\": device},\n    torch_dtype=torch.bfloat16,\n    attn_implementation=\"flash_attention_2\",\n)\nprocessor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)\n\nconversation = [\n    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"video\", \"video\": {\"video_path\": \".\u002Fassets\u002Fcat_and_chicken.mp4\", \"fps\": 1, \"max_frames\": 180}},\n            {\"type\": \"text\", \"text\": \"What is the cat doing?\"},\n        ]\n    },\n]\n\ninputs = processor(\n    conversation=conversation,\n    add_system_prompt=True,\n    add_generation_prompt=True,\n    return_tensors=\"pt\"\n)\ninputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}\nif \"pixel_values\" in inputs:\n    inputs[\"pixel_values\"] = inputs[\"pixel_values\"].to(torch.bfloat16)\noutput_ids = model.generate(**inputs, max_new_tokens=1024)\nresponse = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()\nprint(response)\n```\n\n更多案例请参考 [示例](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002Finference\u002Fexample_videollama3.py)。\n\n### 烹饪书\n查看 [推理笔记本](inference\u002Fnotebooks\u002F)，它们展示了如何在各种应用中使用 VideoLLaMA3，例如单张图像理解、多张图像理解、视觉引用与定位、视频理解等。\n\n| 笔记本                | 描述   |\n| :-------------------- | ------------------------------------------------------------------------ |\n| [图像理解](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002Finference\u002Fnotebooks\u002F01_single_image_understanding.ipynb)      | 展示了使用 VideoLLaMA 3 进行 **通用图像理解**、**图表分析**、**表格理解**、**文档识别** 和 **视觉代码分析**|\n| [多张图像理解](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002Finference\u002Fnotebooks\u002F02_multi_image_understanding.ipynb)       | 展示了使用 VideoLLaMA 3 进行 **多张图像比较和理解** |\n| [细粒度图像识别与理解](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002Finference\u002Fnotebooks\u002F03_visual_referring_and_grounding.ipynb) | 展示了使用 VideoLLaMA 3 进行 **视觉引用与定位** |\n| [视频理解](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002Finference\u002Fnotebooks\u002F04_video_understanding.ipynb) | 展示了使用 VideoLLaMA 3 进行 **通用视频理解**、**长视频理解** 和 **时间定位** |\n\n\n## 🤗 演示\n\n强烈建议先尝试我们的 [在线演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flixin4ever\u002FVideoLLaMA3)。\n\n或者，您也可以在本地启动一个 Gradio 应用程序：\n\n```bash\npython inference\u002Flaunch_gradio_demo.py --model-path DAMO-NLP-SG\u002FVideoLLaMA3-7B\n\n选项：\n  --model-path MODEL_PATH, --model_path MODEL_PATH\n  --server-port SERVER_PORT, --server_port SERVER_PORT\n  \t可选。模型服务器的端口。\n  --interface-port INTERFACE_PORT, --interface_port INTERFACE_PORT\n  \t可选。Gradio 界面的端口。\n  --nproc NPROC\n  \t可选。模型进程的数量。\n```\n\n## 🗝️ 训练\n\n### 第一步：准备训练数据\n要使用我们的训练代码，请在 `data_root` 目录下按您喜欢的方式组织图像和视频数据，然后使用一个或多个注释文件来记录每段对话数据及其对应的图像\u002F视频路径。例如：\n```bash\ndata_root\n├── LLaVA-Video-178K\n│   ├── video_1.mp4\n│   └── ...\n├── LLaVA-OneVision-Data\n│   ├── image_1.jpg\n│   └── ...\n├── annotations_video.jsonl\n├── annotations_image.jsonl\n└── ...\n```\n注释文件由一系列字典组成，每个条目遵循以下格式：\n```json\n[\n    {\n        \"image\": [\"images\u002Fxxx.jpg\"],\n        \"conversations\": [\n            {\n                \"from\": \"human\",\n                \"value\": \"\u003Cimage>\\n图中公交车有哪些颜色？\"\n            },\n            {\n                \"from\": \"gpt\",\n                \"value\": \"图中的公交车是白色和红色的。\"\n            },\n            ...\n        ]\n    },\n    {\n        \"video\": [\"videos\u002Fxxx.mp4\"],\n        \"conversations\": [\n            {\n                \"from\": \"human\",\n                \"value\": \"\u003Cvideo>\\n视频中主要发生了哪些活动？\"\n            },\n            {\n                \"from\": \"gpt\",\n                \"value\": \"视频中主要发生的活动包括一名男子在准备摄像设备、一群男子乘坐直升机以及一名男子驾船在水中航行。\"\n            },\n            ...\n        ]\n    },\n    ...\n]\n```\n为了加载效率和内存优化，我们建议使用符合 [huggingface datasets](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Floading) 格式的 `.jsonl` 文件。\n### 第二步：（可选）转换 HF 检查点\n如果您想使用本代码库在自己的数据上微调 VideoLLaMA3，请先将 Hugging Face 上的检查点转换为本地格式。例如：\n```bash\npython scripts\u002Fconvert_hf_checkpoint.py --model_path DAMO-NLP-SG\u002FVideoLLaMA3-7B --save_path weights\u002Fvideollama3_7b_local\n```\n### 第三步：准备训练脚本\n我们在 `scripts\u002Ftrain` 中提供了一些适用于各个阶段的模板。您可以根据这些模板修改变量，以适应您的数据和模型设置。例如：\n```bash\n  --data_folder .\u002Fdatasets \\\n  --data_path .\u002Fdatasets\u002Fannotations_video.jsonl .\u002Fdatasets\u002Fannotations_image.jsonl \\\n  --model_path Qwen\u002FQwen2.5-1.5B-Instruct \\\n  --vision_encoder DAMO-NLP-SG\u002FSigLIP-NaViT \\\n```\n对于微调，`--model_path` 是第 2 步中所述的已转换检查点的路径。\n### 第四步：开始训练\n现在您可以使用自己的训练脚本开始训练：\n```bash\n# VideoLLaMA3 第一阶段\nbash scripts\u002Ftrain\u002Fstage1_2b.sh\n# VideoLLaMA3 第二阶段\nbash scripts\u002Ftrain\u002Fstage2_2b.sh\n```\n### 关于 CUDA OOM 错误的一些提示：\n- 请尝试使用最新的主分支，我们在 [此提交](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fcommit\u002F21268660a67c115c6d6c6620780515626193af0f) 中优化了内存消耗。\n- 可以通过传递 `--deepspeed scripts\u002Fzero2.json \u002F zero3.json` 来尝试 DeepSpeed 的 [ZeRO-2\u002F3](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fdeepspeed)。\n- 通过分别设置 `--mm_max_length` 和 `--model_max_length`，减少视觉 token 的最大数量（高分辨率图像和视频会自动降采样以适应该长度）以及序列的最大长度（超过该长度的序列会被截断）。\n- 减少本地批次大小，即训练脚本中的 `LOCAL_BATCH_SIZE`。\n您可以根据可用的 GPU 内存和 GPU 数量调整上述超参数，使训练适配您的硬件。\n- **（新！）** 如果在使用上述方法后仍然遇到内存问题，您可以在训练脚本中设置 `--use_flash_loss True` 来尝试一项 **实验性** 功能。具体来说，它使用 [Inf-CL](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FInf-CLIP) 中提出的基于分块的交叉熵实现来降低内存消耗，这在训练具有长上下文或大词汇量的模型时非常有帮助！\n\n\n## ✅ 评估\n#### 第一步：准备评估数据\n首先，请根据官方说明下载相应数据，并将其组织成以下格式：\n\u003Cdetails>\n\u003Csummary>点击此处查看数据集目录结构\u003C\u002Fsummary>\n\n```bash\nbenchmarks\n└── video\n│   ├── activitynet_qa\n│   │   ├── all_test\n│   │   ├── test_a.json\n│   │   └── test_q.json\n│   ├── charades\n│   │   ├── Charades_v1\n│   │   └── charades_annotations_test-random_prompt.json\n│   ├── egoschema\n│   │   ├── good_clips_git\n│   │   └── questions.json\n│   ├── longvideobench\n│   │   ├── lvb_val.json\n│   │   ├── 字幕\n│   │   └── 视频\n│   ├── lvbench\n│   │   ├── 视频\n│   │   └── video_info.meta.jsonl\n│   ├── mlvu\n│   │   ├── json\n│   │   └── 视频\n│   ├── mvbench\n│   │   ├── json\n│   │   └── 视频\n│   ├── nextqa\n│   │   ├── map_vid_vidorID.json\n│   │   ├── NExTVideo\n│   │   └── test.csv\n│   ├── perception_test\n│   │   ├── mc_question_test.json\n│   │   └── 视频\n│   ├── tempcompass\n│   │   ├── captioning\n│   │   ├── caption_matching\n│   │   ├── 多选题\n│   │   ├── 视频\n│   │   └── 是\u002F否\n│   ├── videomme\n│   │   ├── 字幕\n│   │   ├── test-00000-of-00001.parquet\n│   │   └── 视频\n```\n\n\u003C\u002Fdetails>\n\n#### 第二步：开始评估\n```bash\nbash scripts\u002Feval\u002Feval_video.sh ${MODEL_PATH} ${BENCHMARKS} ${NUM_NODES} ${NUM_GPUS}\n```\n您可以通过评估脚本中的 `DATA_ROOT` 和 `SAVE_DIR` 来更改基准测试和输出的目录。更多详细用法请参阅相关脚本。\n\n#### 第三步：添加新的基准测试\n即将推出...\n\n\n## 📑 引用\n如果您发现 VideoLLaMA 对您的研究和应用有所帮助，请使用以下 BibTeX 格式引用：\n\n```bibtex\n@article{damonlpsg2025videollama3,\n  title={VideoLLaMA 3: 面向图像和视频理解的前沿多模态基础模型},\n  author={张博强、李科涵、程泽森、胡志强、袁宇谦、陈冠政、冷思聪、蒋宇明、张航、李欣、金鹏、张文琪、王凡、邴立东、赵德利},\n  journal={arXiv 预印本 arXiv:2501.13106},\n  year={2025},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13106}\n}\n\n@article{damonlpsg2024videollama2,\n  title={VideoLLaMA 2: 推进视频大模型中的时空建模与音频理解},\n  author={程泽森、冷思聪、张航、辛义飞、李欣、陈冠政、朱永鑫、张文琪、罗子阳、赵德利、邴立东},\n  journal={arXiv 预印本 arXiv:2406.07476},\n  year={2024},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07476}\n}\n\n@article{damonlpsg2023videollama,\n  title = {Video-LLaMA：面向视频理解的指令微调视听语言模型},\n  author = {张航、李欣、邴立东},\n  journal = {arXiv 预印本 arXiv:2306.02858},\n  year = {2023},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.02858}\n}\n```\n\n## 👍 致谢\n我们的 VideoLLaMA3 基于 [**SigLip**](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) 和 [**Qwen2.5**](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5) 构建。我们还从 [**LLaVA-OneVision**](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-NeXT)、[**InternVL2**](https:\u002F\u002Finternvl.github.io\u002Fblog\u002F2024-07-02-InternVL-2.0\u002F) 和 [**Qwen2VL**](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2-VL) 的实现中学习了很多。此外，VideoLLaMA3 还受益于大量开源项目的工作。我们由衷感谢这些贡献，并在 [ACKNOWLEDGEMENT.md](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fblob\u002Fmain\u002FACKNOWLEDGEMENT.md) 中列出了相关名单以表达我们的谢意。如果您的工作被用于 VideoLLaMA3 但未在此仓库或技术报告中提及，请随时告知我们 :heart:。\n\n\n## 🔒 许可协议\n\n本项目采用 Apache 2.0 许可协议发布，具体许可信息请参阅 LICENSE 文件。\n该服务为研究预览版，仅限 **非商业用途**，并受 Qwen 模型许可协议、OpenAI 和 Gemini 生成数据的使用条款以及 ShareGPT 隐私政策的约束。如您发现任何潜在违规行为，请与我们联系。","# VideoLLaMA3 快速上手指南\n\nVideoLLaMA 3 是阿里达摩院（DAMO-NLP-SG）推出的前沿多模态基础模型系列，具备卓越的图像和视频理解能力。本指南将帮助你快速配置环境并运行推理。\n\n## 1. 环境准备\n\n在开始之前，请确保你的系统满足以下基本要求：\n\n*   **Python**: >= 3.10\n*   **PyTorch**: >= 2.4.0\n*   **CUDA**: >= 11.8\n*   **Transformers**: >= 4.46.3\n\n> **注意**：为了获得稳定的推理性能，建议严格使用下文指定的库版本，特别是 `flash-attn` 与 PyTorch\u002FCUDA 版本的兼容性至关重要。\n\n## 2. 安装步骤\n\n仅用于推理（Inference-only）的用户，请按顺序执行以下命令安装依赖：\n\n```bash\n# 1. 安装 PyTorch 和 torchvision (针对 CUDA 11.8)\npip install torch==2.4.0 torchvision==0.19.0 --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n\n# 2. 安装 Flash-Attention (需指定兼容版本 2.7.3)\npip install flash-attn==2.7.3 --no-build-isolation --upgrade\n\n# 3. 安装 Transformers 和 Accelerate\npip install transformers==4.46.3 accelerate==1.0.1\n\n# 4. 安装视频处理相关依赖\npip install decord ffmpeg-python imageio opencv-python\n```\n\n> ⚠️ **重要提示**：如果你使用的 Python 或 CUDA 版本不同，请务必查阅 [flash-attn releases](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\u002Freleases\u002F) 以选择对应的 wheel 包，版本不匹配可能导致安装失败或运行时错误。\n\n## 3. 基本使用\n\n以下是一个最简单的视频问答推理示例。该代码加载 `VideoLLaMA3-7B` 模型，并对输入视频内容进行提问。\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoProcessor\n\n# 设置设备\ndevice = \"cuda:0\"\n# 模型路径，也可替换为 DAMO-NLP-SG\u002FVideoLLaMA3-2B 等其它版本\nmodel_path = \"DAMO-NLP-SG\u002FVideoLLaMA3-7B\"\n\n# 加载模型\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_path,\n    trust_remote_code=True,\n    device_map={\"\": device},\n    torch_dtype=torch.bfloat16,\n    attn_implementation=\"flash_attention_2\",\n)\n\n# 加载处理器\nprocessor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)\n\n# 构建对话内容\nconversation = [\n    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n    {\n        \"role\": \"user\",\n        \"content\": [\n            # 视频输入配置：路径、帧率(fps)、最大帧数(max_frames)\n            {\"type\": \"video\", \"video\": {\"video_path\": \".\u002Fassets\u002Fcat_and_chicken.mp4\", \"fps\": 1, \"max_frames\": 180}},\n            {\"type\": \"text\", \"text\": \"What is the cat doing?\"},\n        ]\n    },\n]\n\n# 预处理输入\ninputs = processor(\n    conversation=conversation,\n    add_system_prompt=True,\n    add_generation_prompt=True,\n    return_tensors=\"pt\"\n)\n\n# 将输入移至 GPU 并转换数据类型\ninputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}\nif \"pixel_values\" in inputs:\n    inputs[\"pixel_values\"] = inputs[\"pixel_values\"].to(torch.bfloat16)\n\n# 生成回答\noutput_ids = model.generate(**inputs, max_new_tokens=1024)\nresponse = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()\n\nprint(response)\n```\n\n### 更多资源\n*   **模型仓库**：可在 Hugging Face 获取 [VideoLLaMA3-7B](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideoLLaMA3-7B) 和 [VideoLLaMA3-2B](https:\u002F\u002Fhuggingface.co\u002FDAMO-NLP-SG\u002FVideoLLaMA3-2B) 等模型权重。\n*   **进阶示例**：更多应用场景（如单图\u002F多图理解、视觉定位、长视频理解等）请参考官方提供的 [Jupyter Notebooks](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Ftree\u002Fmain\u002Finference\u002Fnotebooks\u002F)。","某电商平台的智能客服团队正致力于升级其售后系统，希望实现用户直接上传“产品故障视频”后，AI 能自动分析故障原因并生成精准的维修建议或退换货指引。\n\n### 没有 VideoLLaMA3 时\n- **时空逻辑缺失**：传统模型仅能抽取关键帧进行静态图像识别，无法理解动作的先后顺序（如“先按下开关，指示灯闪烁三次后熄灭”），导致对动态故障过程的误判。\n- **细节捕捉不足**：对于视频中短暂出现的错误代码或细微的物理损坏（如接口松动瞬间），静态采样极易遗漏，造成诊断依据不全。\n- **人工复核成本高**：由于 AI 无法准确理解视频语境，客服团队仍需安排专人逐秒观看用户上传的视频来核实情况，处理效率低下且人力负担重。\n- **交互体验割裂**：用户被迫将视频内容转化为文字描述，不仅操作繁琐，还常因描述不准导致反复沟通，严重影响用户满意度。\n\n### 使用 VideoLLaMA3 后\n- **精准时空建模**：VideoLLaMA3 凭借强大的时空建模能力，能完整理解视频中动作的因果链条，准确识别出“操作顺序错误”或“间歇性故障”等复杂场景。\n- **细粒度视觉感知**：它能持续追踪视频流，敏锐捕捉转瞬即逝的屏幕报错信息或硬件异常状态，确保诊断依据的全面性与准确性。\n- **自动化闭环处理**：系统可直接基于视频内容生成结构化的故障诊断报告，大幅减少人工介入需求，使售后响应速度从小时级缩短至分钟级。\n- **自然多模态交互**：用户只需上传视频并简单提问，VideoLLaMA3 即可结合画面内容与自然语言进行深度推理，提供直观、易懂的解决方案。\n\nVideoLLaMA3 通过赋予系统真正的视频理解能力，将非结构化的视频数据转化为可执行的业务洞察，显著提升了智能客服的诊断精度与服务效率。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDAMO-NLP-SG_VideoLLaMA3_d5300b99.png","DAMO-NLP-SG","Language Technology Lab at Alibaba DAMO Academy","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FDAMO-NLP-SG_7176372c.png","",null,"https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG",[83,87,91],{"name":84,"color":85,"percentage":86},"Jupyter Notebook","#DA5B0B",91.5,{"name":88,"color":89,"percentage":90},"Python","#3572A5",8.4,{"name":92,"color":93,"percentage":94},"Shell","#89e051",0.1,1131,88,"2026-03-26T09:29:18","Apache-2.0","未说明","需要 NVIDIA GPU，支持 CUDA >= 11.8。推理示例使用 bfloat16 精度和 flash_attention_2，建议显存 16GB+（7B模型）或 8GB+（2B模型）。",{"notes":102,"python":103,"dependencies":104},"官方推荐用于稳定推理的特定版本组合为：CUDA 11.8 环境下使用 torch==2.4.0 和 flash-attn==2.7.3。若使用其他 Python 或 CUDA 版本，需自行查阅 flash-attn 发布页选择兼容版本。训练环境需额外安装 requirements.txt 中的依赖。",">= 3.10",[105,106,107,108,109,110,111,112,113],"torch==2.4.0","torchvision==0.19.0","flash-attn==2.7.3","transformers==4.46.3","accelerate==1.0.1","decord","ffmpeg-python","imageio","opencv-python",[26,14,52,54],"2026-03-27T02:49:30.150509","2026-04-06T08:18:26.716641",[118,123,128,133],{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},11725,"如何解决 ffprobe\u002Fffmpeg 相关的报错问题？","如果在运行推理代码时遇到 ffmpeg 或 ffprobe 相关错误，建议首先在与代码相同的目录下运行命令行测试（例如 `ffmpeg -i your_file.mp4`），以确认 ffmpeg 本身是否正常工作。如果提示“无此目录”等错误，通常是因为当前工作目录不正确。由于 ffmpeg 出错会中断推理进程，如果文件目录结构混乱，建议尝试使用其他视频读取包代替 ffmpeg，以避免因目录问题导致推理完全中断。","https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fissues\u002F17",{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},11726,"如何在离线环境中运行 VideoLLaMA3 推理？","直接使用 `save_pretrained` 保存模型可能会导致部分文件缺失，从而引发 `ValueError: You need to specify either text or text_target` 等错误。推荐的离线设置步骤如下：\n1. 将 Hugging Face 缓存目录（如 `\u002Fhome\u002F\u003Cuser>\u002F.cache\u002Fhuggingface\u002Fhub\u002F...`）中该模型的所有内容复制到你想保存模型的本地目录。\n2. 设置环境变量 `HF_HUB_CACHE` 指向该本地目录。\n3. 在初始化 AutoModel 或 AutoProcessor 时，使用模型标识符（如 \"DAMO-NLP-SG\u002FVideoLLaMA3-7B\"）并添加 `local_files_only=True` 参数，以确保不从网络请求资源。\n示例代码：\n```python\nmodel_path = \"DAMO-NLP-SG\u002FVideoLLaMA3-7B\"\nprocessor = AutoProcessor.from_pretrained(model_path, local_files_only=True)\n```","https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fissues\u002F16",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},11727,"使用 convert_hf_checkpoint.py 转换模型后，加载时出现视觉编码器权重不匹配（vision_encoder weights mismatch）怎么办？","这个问题是由于本地模型和远程模型在类定义上的细微差异导致的。当使用 `AutoModelForCausalLM` 从 Hugging Face Hub 加载时，transformers 会自动加载远程代码并在 `transformers_modules` 中创建模型类；而加载本地转换后的权重时，使用的是本地代码库中的类。这种差异可能导致权重键名不匹配（例如出现重复的 `vision_encoder` 路径）。\n建议参考官方提供的 `videollama3\u002Finfer.py` 文件进行本地推理，以确保使用正确的模型类定义和加载方式。官方后续也会提供将微调后的模型上传回 Hugging Face 的脚本。","https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fissues\u002F66",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},11728,"VideoLLaMA3 支持批量推理（Batch Inference）吗？","目前 VideoLLaMA3 仅支持 `batch_size=1` 的推理。如果尝试设置 `batch_size=2` 或更高，会在 `prepare_inputs_labels_for_multimodal` 阶段报错。因此，当前版本不支持多视频或多提示词的批量并行推理，需要逐条进行处理。","https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA3\u002Fissues\u002F62",[]]