[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-JIA-Lab-research--LLaMA-VID":3,"tool-JIA-Lab-research--LLaMA-VID":64},[4,17,27,36,44,52],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159636,2,"2026-04-17T23:33:34",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[13,26,14,35],"视频",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":10,"last_commit_at":42,"category_tags":43,"status":16},8553,"spec-kit","github\u002Fspec-kit","Spec Kit 是一款专为提升软件开发效率而设计的开源工具包，旨在帮助团队快速落地“规格驱动开发”（Spec-Driven Development）模式。传统开发中，需求文档往往与代码实现脱节，导致沟通成本高且结果不可控；而 Spec Kit 通过将规格说明书转化为可执行的指令，让 AI 直接依据明确的业务场景生成高质量代码，从而减少从零开始的随意编码，确保产出结果的可预测性。\n\n该工具特别适合希望利用 AI 辅助编程的开发者、技术负责人及初创团队。无论是启动全新项目还是在现有工程中引入规范化流程，用户只需通过简单的命令行操作，即可初始化项目并集成主流的 AI 编程助手。其核心技术亮点在于“规格即代码”的理念，支持社区扩展与预设模板，允许用户根据特定技术栈定制开发流程。此外，Spec Kit 强调官方维护的安全性，提供稳定的版本管理，帮助开发者在享受 AI 红利的同时，依然牢牢掌握架构设计的主动权，真正实现从“凭感觉写代码”到“按规格建系统”的转变。",88749,"2026-04-17T09:48:14",[15,26,14,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":53,"name":54,"github_repo":55,"description_zh":56,"stars":57,"difficulty_score":10,"last_commit_at":58,"category_tags":59,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,60,35,61,14,62,15,13,63],"数据工具","插件","其他","音频",{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":76,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":78,"owner_website":78,"owner_url":79,"languages":80,"stars":100,"forks":101,"last_commit_at":102,"license":103,"difficulty_score":104,"env_os":105,"env_gpu":106,"env_ram":107,"env_deps":108,"category_tags":122,"github_topics":78,"view_count":10,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":123,"updated_at":124,"faqs":125,"releases":156},8924,"JIA-Lab-research\u002FLLaMA-VID","LLaMA-VID","LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)","LLaMA-VID 是一款专为长视频理解设计的开源多模态大模型，其核心理念是“一张图像仅需两个令牌”。它基于强大的 LLaVA 框架构建，旨在突破现有模型在处理长视频时的上下文长度限制，让 AI 能够真正“看懂”并讨论长达数小时的电影或视频内容。\n\n传统视频分析模型往往受限于显存和计算能力，难以处理超长序列，或者需要极高的计算成本。LLaMA-VID 通过独特的令牌生成策略，将视觉信息高效压缩为上下文令牌和内容令牌，极大地降低了长视频处理的资源消耗，成功将支持的视频时长上限推至小时级别。这一创新不仅保留了关键视觉细节，还显著提升了推理效率。\n\n该工具非常适合人工智能研究人员、开发者以及需要对长视频内容进行深度分析的技术团队使用。无论是开发智能视频助手、构建影视剧情问答系统，还是进行多模态学术研究，LLaMA-VID 都提供了完整的训练代码、微调模型及数据集支持。凭借其高效的架构设计和在 ECCV 2024 上获得的认可，LLaMA-VID 为长视频智能理解领域提供了一个高性能且易于扩展的解决方案。","# LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models\n\n\u003Ca href='https:\u002F\u002Fllama-vid.github.io\u002F'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-Green'>\u003C\u002Fa>\n\u003Ca href='http:\u002F\u002F103.170.5.190:7864\u002F'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Demo-violet'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17043'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-Arxiv-red'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Fhuggingface.co\u002FYanweiLi'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Models-blue'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Data-green'>\u003C\u002Fa>\n\n\n\nLLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. We build this repo based on LLaVA.\n\n## Release\n- [24\u002F07\u002F04] 🔥 Our work has been accepted to ECCV 2024!\n- [23\u002F12\u002F05] 🔥 We release the full training and evalution [model](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224-long-video), [data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data), and scripts to support movie chating! \n- [23\u002F11\u002F29] 🔥 LLaMA-VID is comming! We release the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17043), [code](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID), [data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data), [models](https:\u002F\u002Fhuggingface.co\u002FYanweiLi), and [demo](https:\u002F\u002Fllama-vid.github.io\u002F) for LLaMA-VID!\n\n## Contents\n- [Demo](#demo)\n- [Install](#install)\n- [Model](#model)\n- [Preparation](#preparation)\n- [Train](#train)\n- [Evaluation](#evaluation)\n- [Examples](#examples)\n- [Citation](#citation)\n- [Acknowledgement](#acknowledgement)\n- [License](#license)\n\n## Demo\nWe provide some selected examples in this section. More examples can be found in our [project page](https:\u002F\u002Fllama-vid.github.io\u002F). Feel free to try our online [demo](https:\u002F\u002Fllama-vid.github.io\u002F)!\n\n\u003Cdiv align=center>\n\u003Cimg width=\"100%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FJIA-Lab-research_LLaMA-VID_readme_3eb58f23842d.png\"\u002F>\n\u003C\u002Fdiv>\n\n## Install\nPlease follow the instructions below to install the required packages.\n1. Clone this repository\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID.git\n```\n\n2. Install Package\n```bash\nconda create -n llamavid python=3.10 -y\nconda activate llamavid\ncd LLaMA-VID\npip install --upgrade pip  # enable PEP 660 support\npip install -e .\n```\n\n3. Install additional packages for training cases\n```bash\npip install ninja\npip install flash-attn --no-build-isolation\n```\n\n## Model\nLLaMA-VID simply contains three parts: encoder and decoder are adopted to produce visual embedding and text-guided features, respectively; \ncontext token and content token are transformed with the tailored token generation strategy; \ninstruction tuning is designed to unleash the potential of LLMs for image and video.\n\n\u003Cdiv align=center>\n\u003Cimg width=\"100%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FJIA-Lab-research_LLaMA-VID_readme_74987af7733e.png\"\u002F>\n\u003C\u002Fdiv>\n\nWe provide all our fully finetuned models on Stage 1 and 2 data (Long Video + Stage 3) for LLaMA-VID:\n\n| Type | Image Size | Max Token | Base LLM | Vision Encoder | Finetuning Data | Finetuning schedule | Download |\n|----------|----------|----------|----------|----------------|---------------|--------------------|------------------|\nImage only | 224 | 4K | Vicuna-7B-v1.5 | EVA-G | LLaVA1.5-Instruct | full_ft-1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224) |\nImage only | 336 | 4K | Vicuna-7B-v1.5 | EVA-G | LLaVA1.5-Instruct | full_ft-1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-336) |\nImage only | 336 | 4K | Vicuna-13B-v1.5 | EVA-G | LLaVA1.5-Instruct | full_ft-1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-13b-full-336) |\nShort video | 224 | 4K | Vicuna-7B-v1.5 | EVA-G | LLaVA1.5-VideoChatGPT-Instruct | full_ft-1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224-video-fps-1) |\nShort video | 224 | 4K | Vicuna-13B-v1.5 | EVA-G | LLaVA1.5-VideoChatGPT-Instruct | full_ft-1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-13b-full-224-video-fps-1) |\nLong video | 224 | 64K | Vicuna-7B-v1.5 | EVA-G | LLaVA1.5-VideoChatGPT-Instruct + LongVideoQA | full_ft-1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224-long-video) |\n\nHere are the pretrained weights (text decoder + context attention + projector) on Stage 1 data only:\n| Type | Image Size | Max Token | Base LLM | Vision Encoder | Pretrain Data | Pretraining schedule | Download |\n|----------|----------|----------|----------|----------------|---------------|----------------------|------------------|\nImage only | 224 | 4K | Vicuna-7B-v1.5 | EVA-G | LCS-558K | 1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-pretrain-224) |\nImage only | 336 | 4K | Vicuna-7B-v1.5 | EVA-G | LCS-558K | 1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-pretrain-336) |\nImage only | 336 | 4K | Vicuna-13B-v1.5 | EVA-G | LCS-558K | 1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-13b-pretrain-336) |\nShort video | 224 | 4K | Vicuna-7B-v1.5 | EVA-G | LCS-558K-WebVid-232K | 1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-pretrain-224-video-fps-1) |\nShort video | 224 | 4K | Vicuna-13B-v1.5 | EVA-G | LCS-558K-WebVid-232K | 1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-13b-pretrain-224-video-fps-1) |\n\n## Preparation\n### Dataset\nWe provide the processed image-based data for LLaMA-VID training. We organize the data in the format of LLaVA, please organize the training image-based data following [this](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FData.md) and evaluation image-based data following [this](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FEvaluation.md).\nPlease put the pretrained data, finetuned data, and eval data in  `LLaMA-VID-Pretrain`, `LLaMA-VID-Finetune`, and `LLaMA-VID-Eval` subset following [Structure](#structure).\n\nFor video-based dataset, please download the 2.5M subset from [WebVid](https:\u002F\u002Fmaxbain.com\u002Fwebvid-dataset\u002F) and ActivityNet dataset from [official website](http:\u002F\u002Factivity-net.org\u002Fdownload.html) or [video-chatgpt](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT\u002Fblob\u002Fmain\u002Fdocs\u002Ftrain_video_chatgpt.md).\nIf you want to perform evaluation, please also download corresponding files from [here](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT\u002Fblob\u002Fmain\u002Fquantitative_evaluation\u002FREADME.md). You can download MSVD-QA from [here](https:\u002F\u002Fmycuhk-my.sharepoint.com\u002F:u:\u002Fg\u002Fpersonal\u002F1155186668_link_cuhk_edu_hk\u002FEUNEXqg8pctPq3WZPHb4Fd8BYIxHO5qPCnU6aWsrV1O4JQ?e=guynwu) and MSRVTT-QA from [here](https:\u002F\u002Fmycuhk-my.sharepoint.com\u002F:u:\u002Fg\u002Fpersonal\u002F1155186668_link_cuhk_edu_hk\u002FEcEXh1HfTXhLrRnuwHbl15IBJeRop-d50Q90njHmhvLwtA?e=SE24eG).\n\nAs for long video tuning, please download the long video data from [MovieNet](https:\u002F\u002Fmovienet.github.io\u002F), shot detection results from [here](https:\u002F\u002Fmycuhk-my.sharepoint.com\u002F:u:\u002Fg\u002Fpersonal\u002F1155186668_link_cuhk_edu_hk\u002FEYbaGk86_WNFm9YP45WVQ_oB0GGkusDNBRwQQ19vBy4z2A?e=cKbiHJ) and our construced long video QA pairs from [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data). Place shot detection results under `LLaMA-VID-Finetune\u002Fmovienet\u002Ffiles` before preprocessing.\n\nFor meta info, please download the following files and organize them as in [Structure](#structure).\n\n| Data file name | Size |\n| --- | ---: |\n| [blip_laion_cc_sbu_558k.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fliuhaotian\u002FLLaVA-Pretrain\u002Fblob\u002Fmain\u002Fblip_laion_cc_sbu_558k.json) | 181M |\n| [llava_v1_5_mix665k.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fliuhaotian\u002FLLaVA-Instruct-150K\u002Fblob\u002Fmain\u002Fllava_v1_5_mix665k.json) | 1.03G |\n| [llava_558k_with_webvid.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data) | 254 MB |\n| [llava_v1_5_mix665k_with_video_chatgpt.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data) | 860 MB |\n| [llava_v1_5_mix665k_with_video_chatgpt_maxtime_5min.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data) | 860 MB |\n| [long_videoqa.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data) | 260MB |\n\n### Pretrained Weights\nWe recommend users to download the pretrained weights from the following link [Vicuna-7b-v1.5](https:\u002F\u002Fhuggingface.co\u002Flmsys\u002Fvicuna-7b-v1.5), [Vicuna-13b-v1.5](https:\u002F\u002Fhuggingface.co\u002Flmsys\u002Fvicuna-13b-v1.5), [EVA-ViT-G](https:\u002F\u002Fstorage.googleapis.com\u002Fsfr-vision-language-research\u002FLAVIS\u002Fmodels\u002FBLIP2\u002Feva_vit_g.pth), [QFormer-7b](https:\u002F\u002Fstorage.googleapis.com\u002Fsfr-vision-language-research\u002FLAVIS\u002Fmodels\u002FInstructBLIP\u002Finstruct_blip_vicuna7b_trimmed.pth), [QFormer-13b](https:\u002F\u002Fstorage.googleapis.com\u002Fsfr-vision-language-research\u002FLAVIS\u002Fmodels\u002FInstructBLIP\u002Finstruct_blip_vicuna13b_trimmed.pth) and put them in `model_zoo` following [Structure](#structure).\n\n\n### Structure\n\nThe folder structure should be organized as follows before training.\n\n```\nLLaMA-VID\n├── llamavid\n├── scripts\n├── work_dirs\n│   ├── llama-vid\n│   │   ├── llama-vid-13b-full-336\n│   │   ├── ...\n├── model_zoo\n│   ├── LLM\n│   │   ├── vicuna\n│   │   │   ├── 7B-V1.5\n│   │   │   ├── 13B-V1.5\n│   ├── LAVIS\n│   │   ├── eva_vit_g.pth\n│   │   ├── instruct_blip_vicuna7b_trimmed.pth\n│   │   ├── instruct_blip_vicuna13b_trimmed.pth\n├── data\n│   ├── LLaMA-VID-Pretrain\n│   │   ├── blip_laion_cc_sbu_558k.json\n│   │   ├── llava_558k_with_webvid.json\n│   │   ├── images\n│   │   ├── videos\n│   ├── LLaMA-VID-Finetune\n│   │   ├── llava_v1_5_mix665k.json\n│   │   ├── llava_v1_5_mix665k_maxround_6_total_921k.json\n│   │   ├── llava_v1_5_mix665k_maxround_12_total_714k.json\n│   │   ├── llava_v1_5_mix665k_with_video_chatgpt.json\n│   │   ├── llava_v1_5_mix665k_with_video_chatgpt_maxtime_5min.json\n│   │   ├── long_videoqa.json\n│   │   ├── movienet\n│   │   ├── activitynet\n│   │   ├── coco\n│   │   ├── gqa\n│   │   ├── ocr_vqa\n│   │   ├── textvqa\n│   │   ├── vg\n│   ├── LLaMA-VID-Eval\n│   │   ├── gqa\n│   │   ├── ...\n```\n\n## Train\n\nLLaMA-VID training consists of three stages: (1) feature alignment stage: bridge the vision and language tokens; (2) instruction tuning stage: teach the model to follow multimodal instructions; (3) long video tuning stage: extend the position embedding and teach the model to follow hour-long video instructions.\n\nLLaMA-VID is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.\n\nPlease make sure you download and organize the data following [Preparation](#preparation) before training.\n\n### Image Only\n\nIf you only want to train and finetune LLaMA-VID on image-based data, please run the following command for Vicuna-7B with image size 336:\n\n```bash\nbash scripts\u002Fimage_only\u002Ftrain\u002Fstage_1_2_full_v7b_336.sh\n```\nor for Vicuna-13B with image size 336:\n```bash\nbash scripts\u002Fimage_only\u002Ftrain\u002Fstage_1_2_full_v13b_336.sh\n```\nYou can also try that with a smaller image size 224 and less visual tokens:\n```bash\nbash scripts\u002Fimage_only\u002Ftrain\u002Fstage_1_2_full_v7b_224_grid_4.sh\n```\nPlease find more training scripts in `scripts\u002Fimage_only\u002Ftrain`.\n\n### Short Video\nIf you are interested in training and finetuning LLaMA-VID on short video-based data, please run the following command for Vicuna-7B with image size 224:\n```bash\nbash scripts\u002Fvideo\u002Ftrain\u002Fstage_1_2_full_v7b_224_fps_1.sh\n```\nor for Vicuna-13B with image size 224:\n```bash\nbash scripts\u002Fvideo\u002Ftrain\u002Fstage_1_2_full_v13b_224_fps_1.sh\n```\nPlease find more training scripts in `scripts\u002Fvideo\u002Ftrain`.\n\n### Long Video\nWe provide dataset and scripts for long video-based training. Please download the long video-based data following [Preparation](#preparation) and organize them as in [Structure](#structure).\nIn the training stage, we first extract all the frames from the long video and save the visual features local for efficient training. \n```bash\npython scripts\u002Fextra_tool\u002Fextract_movienet_features.py \\\n    --video_dir \u003Cpath to movienet video> \\\n    --files_dir \u003Cpath to movienet files> \\ # files in downladed MovieNet.tar.gz\n    --feat_dir \u003Cpath to output features>\n```\n\nAnd run the following command for Vicuna-7B with image size 224:\n```bash\nbash scripts\u002Fvideo\u002Ftrain\u002Fstage_3_full_v7b_224_longvid.sh\n```\n\n## Evaluation\nWe perform evaluation on both image-based and video-based benchmarks. Please download the evaluation data following [Preparation](#preparation) and organize them as in [Structure](#structure).\n\n### Image Only\n| LLM | Res. | Model | GQA | MMB | MME | POPE | SEED | SQA-Image | VizWiz | VQA v2 |\n|----------|----------|-----------|---|---|---|---|---|---|---|---|\nVicuna-7B | 224 | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224) | 63.0 | 65.3 | 1405.6 | 86.6 | 59.7 | 67.7 | 52.5 | 78.3 |\nVicuna-7B | 336 | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-336) | 64.3 | 65.1 | 1521.4 | 86.0 | 59.9 | 68.3 | 54.2 | 79.3 |\nVicuna-13B | 336 | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-13b-full-336) | 65.0 | 66.6 | 1542.3 | 86.0 | 62.3 | 70.0 | 54.3 | 80.0 |\n\n\n\nIf you want to evaluate the model on image-based benchmarks, please use the scripts in `scripts\u002Fimage_only\u002Feval`. \nFor example, run the following command for GQA evaluation:\n```bash\nbash scripts\u002Fimage_only\u002Feval\u002Fgqa.sh\n```\nPlease find more evaluation scripts in `scripts\u002Fimage_only\u002Feval`.\n\n### Video\n| LLM | Res. | Model | MSVD-QA | MSRVTT-QA | ActivityNet-QA | Correctness | Detail | Context | Temporal | Consistency |\n|----------|----------|-----------|---|---|---|---|---|---|---|---|\nVicuna-7B | 224 | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224-video-fps-1) | 69.7 | 57.7 | 47.4 | 2.96 | 3.00 | 3.53 | 2.46 | 2.51 | \nVicuna-13B | 224 | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-13b-full-224-video-fps-1) | 70.0 | 58.9 | 47.5 | 3.07 | 3.05 | 3.60 | 2.58 | 2.63 |\n\nIf you want to evaluate the model on video-based benchmarks, please use the scripts in `scripts\u002Fvideo\u002Feval`.\nFor example, run the following command for MSVD-QA evaluation:\n```bash\nbash scripts\u002Fvideo\u002Feval\u002Fmsvd_eval.sh\n```\nPlease find more evaluation scripts in `scripts\u002Fvideo\u002Feval`.\n\n### CLI Inference\nChat with images and videos using LLaMA-VID without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. With 4-bit quantization.\nPlease try this for image or video inference:\n\n```bash\npython -m llamavid.serve.cli \\\n    --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-336 \\\n    --image-file \u003Cpath to your image>\n```\n\nor try this for video inference:\n```bash\npython -m llamavid.serve.cli \\\n    --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-224-video-fps-1 \\\n    --image-file \u003Cpath to your video> \\\n    --temperature 0.5\n```\n\nYou can also try 4bit or 8bit for efficient inference \n```bash\npython -m llamavid.serve.cli \\\n    --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-224-video-fps-1 \\\n    --image-file \u003Cpath to your video>\n    --temperature 0.5 \\\n    --load-4bit\n```\n\n### Long Video Inference\nFor long video, if you want to inference on videos in movienet, please first process the video data and subtitles like this:\n```bash\npython scripts\u002Fextra_tool\u002Fextract_movienet_features.py \\\n    --video_dir \u003Cpath to movienet video> \\\n    --files_dir \u003Cpath to movienet files> \\ # files in downladed MovieNet.tar.gz\n    --feat_dir \u003Cpath to output features>\n```\n\nIf you want to inference with your customized video, please first process the video data and subtitles like this:\n```bash\npython scripts\u002Fextra_tool\u002Fextract_video_features_subtitles.py \\\n    --video_file \u003Cpath to customized video> \\\n    --feat_dir \u003Cpath to output features>\n```\n    \nThen, please try this for long video inference:\n```bash\npython llamavid\u002Fserve\u002Frun_llamavid_movie.py \\\n    --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-224-long-video \\\n    --video-file \u003Cpath to your processed video file> \\\n    --load-4bit\n```\n\n### Gradio Web UI\n\nHere, we adopt the Gradio UI similar to that in LLaVA to provide a user-friendly interface for LLaMA-VID.\nTo launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model workers to compare between different checkpoints, you only need to launch the controller and the web server *ONCE*.\n\n#### Launch a controller\n```Shell\npython -m llamavid.serve.controller --host 0.0.0.0 --port 10000\n```\n\n#### Launch a gradio web server.\n```Shell\npython -m llamavid.serve.gradio_web_server --controller http:\u002F\u002Flocalhost:10000 --model-list-mode reload\n```\nYou just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.\n\n#### Launch a model worker\nThis is the actual *worker* that performs the inference on the GPU.  Each worker is responsible for a single model specified in `--model-path`.\n\n```Shell\npython -m llamavid.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-vicuna-7b-short\n```\nWait until the process finishes loading the model and you see \"Uvicorn running on ...\".  Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.\n\nYou can launch as many workers as you want, and compare between different models in the same Gradio interface. For example, short video model here. Please keep the `--controller` the same, and modify the `--port` and `--worker` to a different port number for each worker.\n```Shell\npython -m llamavid.serve.model_worker_short --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port \u003Cdifferent from 40000, say 40001> --worker http:\u002F\u002Flocalhost:\u003Cchange accordingly, i.e. 40001> --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-224-video-fps-1\n```\n\nIf you are using an Apple device with an M1 or M2 chip, you can specify the mps device by using the `--device` flag: `--device mps`.\n\n#### Launch a model worker (Multiple GPUs, when GPU VRAM \u003C= 24GB)\n\nIf the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.), you may try running it with multiple GPUs. Our latest code base will automatically try to use multiple GPUs if you have more than one GPU. You can specify which GPUs to use with `CUDA_VISIBLE_DEVICES`. Below is an example of running with the first two GPUs.\n\n```Shell\nCUDA_VISIBLE_DEVICES=0,1 python -m llamavid.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-224-long-video\n```\n\n#### Launch a model worker (4-bit, 8-bit inference, quantized)\n\nYou can launch the model worker with quantized bits (4-bit, 8-bit), which allows you to run the inference with reduced GPU memory footprint. Note that inference with quantized bits may not be as accurate as the full-precision model. Simply append `--load-4bit` or `--load-8bit` to the **model worker** command that you are executing. Below is an example of running with 4-bit quantization.\n\n```Shell\npython -m llamavid.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-224-long-video --load-4bit\n```\n\n## Examples\nWe provide some examples in this section. More examples can be found in our [project page](https:\u002F\u002Fllama-vid.github.io\u002F).\n\n\u003Cdiv align=center>\n\u003Cimg width=\"100%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FJIA-Lab-research_LLaMA-VID_readme_12d12f2067ed.png\"\u002F>\n\u003C\u002Fdiv>\n\n## Citation\nIf you find this repo useful for your research, please consider citing the paper\n```\n@inproceedings{li2024llamavid,\n  title={LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models},\n  author={Li, Yanwei and Wang, Chengyao and Jia, Jiaya},\n  journal={European Conference on Computer Vision},\n  year={2024}\n}\n```\n\n## Acknowledgement\nWe would like to thank the following repos for their great work:\n\n- This work is built upon the [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA).\n- This work utilizes LLMs from [Vicuna](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat).\n- This work utilizes pretrained weights from [InstructBLIP](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS).\n- We perform video-based evaluation from [Video-ChatGPT](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT).\n\n## License\n[![Code License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache_2.0-yellow.svg)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID\u002Fblob\u002Fmain\u002FLICENSE)\n[![Data License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FData%20License-CC%20By%20NC%204.0-orange.svg)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID\u002Fblob\u002Fmain\u002FDATA_LICENSE)\n[![Weight License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeight%20License-CC%20By%20NC%204.0-red)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID\u002Fblob\u002Fmain\u002FWEIGHT_LICENSE)\n\nThe data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.\n","# LLaMA-VID：在大型语言模型中，一张图像值2个token\n\n\u003Ca href='https:\u002F\u002Fllama-vid.github.io\u002F'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-Green'>\u003C\u002Fa>\n\u003Ca href='http:\u002F\u002F103.170.5.190:7864\u002F'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Demo-violet'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17043'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-Arxiv-red'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Fhuggingface.co\u002FYanweiLi'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Models-blue'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20 Face-Data-green'>\u003C\u002Fa>\n\n\n\nLLaMA-VID使现有框架能够支持长达一小时的视频，并通过额外的上下文token将其上限进一步提升。我们基于LLaVA构建了这个仓库。\n\n## 发布\n- [24\u002F07\u002F04] 🔥 我们的工作已被ECCV 2024接收！\n- [23\u002F12\u002F05] 🔥 我们发布了完整的训练和评估[模型](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224-long-video)、[数据](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data)以及支持电影聊天的脚本！\n- [23\u002F11\u002F29] 🔥 LLaMA-VID即将发布！我们发布了[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17043)、[代码](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID)、[数据](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data)、[模型](https:\u002F\u002Fhuggingface.co\u002FYanweiLi)和[演示](https:\u002F\u002Fllama-vid.github.io\u002F)！\n\n## 目录\n- [演示](#demo)\n- [安装](#install)\n- [模型](#model)\n- [准备](#preparation)\n- [训练](#train)\n- [评估](#evaluation)\n- [示例](#examples)\n- [引用](#citation)\n- [致谢](#acknowledgement)\n- [许可证](#license)\n\n## 演示\n我们在这一部分提供了一些精选示例。更多示例请访问我们的[项目页面](https:\u002F\u002Fllama-vid.github.io\u002F)。欢迎试用我们的在线[演示](https:\u002F\u002Fllama-vid.github.io\u002F)！\n\n\u003Cdiv align=center>\n\u003Cimg width=\"100%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FJIA-Lab-research_LLaMA-VID_readme_3eb58f23842d.png\"\u002F>\n\u003C\u002Fdiv>\n\n## 安装\n请按照以下步骤安装所需的软件包。\n1. 克隆本仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID.git\n```\n\n2. 安装软件包\n```bash\nconda create -n llamavid python=3.10 -y\nconda activate llamavid\ncd LLaMA-VID\npip install --upgrade pip  # 启用PEP 660支持\npip install -e .\n```\n\n3. 安装用于训练的附加软件包\n```bash\npip install ninja\npip install flash-attn --no-build-isolation\n```\n\n## 模型\nLLaMA-VID主要包含三个部分：编码器和解码器分别用于生成视觉嵌入和文本引导特征；上下文token和内容token则通过定制的token生成策略进行转换；指令微调旨在释放大型语言模型在图像和视频处理方面的潜力。\n\n\u003Cdiv align=center>\n\u003Cimg width=\"100%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FJIA-Lab-research_LLaMA-VID_readme_74987af7733e.png\"\u002F>\n\u003C\u002Fdiv>\n\n我们为LLaMA-VID提供了所有在阶段1和阶段2数据（长视频+阶段3）上完全微调过的模型：\n\n| 类型 | 图像尺寸 | 最大token数 | 基础LLM | 视觉编码器 | 微调数据 | 微调计划 | 下载 |\n|----------|----------|----------|----------|----------------|---------------|--------------------|------------------|\n仅图像 | 224 | 4K | Vicuna-7B-v1.5 | EVA-G | LLaVA1.5-Instruct | full_ft-1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224) |\n仅图像 | 336 | 4K | Vicuna-7B-v1.5 | EVA-G | LLaVA1.5-Instruct | full_ft-1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-336) |\n仅图像 | 336 | 4K | Vicuna-13B-v1.5 | EVA-G | LLaVA1.5-Instruct | full_ft-1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-13b-full-336) |\n短视频 | 224 | 4K | Vicuna-7B-v1.5 | EVA-G | LLaVA1.5-VideoChatGPT-Instruct | full_ft-1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224-video-fps-1) |\n短视频 | 224 | 4K | Vicuna-13B-v1.5 | EVA-G | LLaVA1.5-VideoChatGPT-Instruct | full_ft-1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-13b-full-224-video-fps-1) |\n长视频 | 224 | 64K | Vicuna-7B-v1.5 | EVA-G | LLaVA1.5-VideoChatGPT-Instruct + LongVideoQA | full_ft-1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224-long-video) |\n\n以下是仅在阶段1数据上预训练的权重（文本解码器+上下文注意力+投影器）：\n| 类型 | 图像尺寸 | 最大token数 | 基础LLM | 视觉编码器 | 预训练数据 | 预训练计划 | 下载 |\n|----------|----------|----------|----------|----------------|---------------|----------------------|------------------|\n仅图像 | 224 | 4K | Vicuna-7B-v1.5 | EVA-G | LCS-558K | 1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-pretrain-224) |\n仅图像 | 336 | 4K | Vicuna-7B-v1.5 | EVA-G | LCS-558K | 1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-pretrain-336) |\n仅图像 | 336 | 4K | Vicuna-13B-v1.5 | EVA-G | LCS-558K | 1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-13b-pretrain-336) |\n短视频 | 224 | 4K | Vicuna-7B-v1.5 | EVA-G | LCS-558K-WebVid-232K | 1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-pretrain-224-video-fps-1) |\n短视频 | 224 | 4K | Vicuna-13B-v1.5 | EVA-G | LCS-558K-WebVid-232K | 1e | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-13b-pretrain-224-video-fps-1) |\n\n## 准备\n\n### 数据集\n我们提供了用于 LLaMA-VID 训练的处理过的图像数据。数据格式按照 LLaVA 的标准组织，请按照 [这篇文档](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FData.md) 组织训练用的图像数据，并按照 [这篇文档](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FEvaluation.md) 组织评估用的图像数据。\n请将预训练数据、微调数据和评估数据分别放入 `LLaMA-VID-Pretrain`、`LLaMA-VID-Finetune` 和 `LLaMA-VID-Eval` 子目录中，具体结构参考 [结构](#structure)。\n\n对于视频数据集，请从 [WebVid](https:\u002F\u002Fmaxbain.com\u002Fwebvid-dataset\u002F) 下载 250 万条目的子集，并从 [官方网站](http:\u002F\u002Factivity-net.org\u002Fdownload.html) 或 [video-chatgpt](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT\u002Fblob\u002Fmain\u002Fdocs\u002Ftrain_video_chatgpt.md) 下载 ActivityNet 数据集。\n若需进行评估，还请从 [这里](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT\u002Fblob\u002Fmain\u002Fquantitative_evaluation\u002FREADME.md) 下载相应文件。MSVD-QA 可从 [这里](https:\u002F\u002Fmycuhk-my.sharepoint.com\u002F:u:\u002Fg\u002Fpersonal\u002F1155186668_link_cuhk_edu_hk\u002FEUNEXqg8pctPq3WZPHb4Fd8BYIxHO5qPCnU6aWsrV1O4JQ?e=guynwu) 下载，MSRVTT-QA 则可从 [这里](https:\u002F\u002Fmycuhk-my.sharepoint.com\u002F:u:\u002Fg\u002Fpersonal\u002F1155186668_link_cuhk_edu_hk\u002FEcEXh1HfTXhLrRnuwHbl15IBJeRop-d50Q90njHmhvLwtA?e=SE24eG) 获取。\n\n关于长视频微调，请从 [MovieNet](https:\u002F\u002Fmovienet.github.io\u002F) 下载长视频数据，从 [这里](https:\u002F\u002Fmycuhk-my.sharepoint.com\u002F:u:\u002Fg\u002Fpersonal\u002F1155186668_link_cuhk_edu_hk\u002FEYbaGk86_WNFm9YP45WVQ_oB0GGkusDNBRwQQ19vBy4z2A?e=cKbiHJ) 下载镜头检测结果，并从 [这里](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data) 下载我们构建的长视频问答对。在预处理之前，请将镜头检测结果放置于 `LLaMA-VID-Finetune\u002Fmovienet\u002Ffiles` 目录下。\n\n元信息方面，请下载以下文件，并按 [结构](#structure) 进行组织。\n\n| 数据文件名 | 大小 |\n| --- | ---: |\n| [blip_laion_cc_sbu_558k.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fliuhaotian\u002FLLaVA-Pretrain\u002Fblob\u002Fmain\u002Fblip_laion_cc_sbu_558k.json) | 181 MB |\n| [llava_v1_5_mix665k.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fliuhaotian\u002FLLaVA-Instruct-150K\u002Fblob\u002Fmain\u002Fllava_v1_5_mix665k.json) | 1.03 GB |\n| [llava_558k_with_webvid.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data) | 254 MB |\n| [llava_v1_5_mix665k_with_video_chatgpt.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data) | 860 MB |\n| [llava_v1_5_mix665k_with_video_chatgpt_maxtime_5min.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data) | 860 MB |\n| [long_videoqa.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data) | 260 MB |\n\n### 预训练权重\n我们建议用户从以下链接下载预训练权重：[Vicuna-7b-v1.5](https:\u002F\u002Fhuggingface.co\u002Flmsys\u002Fvicuna-7b-v1.5)、[Vicuna-13b-v1.5](https:\u002F\u002Fhuggingface.co\u002Flmsys\u002Fvicuna-13b-v1.5)、[EVA-ViT-G](https:\u002F\u002Fstorage.googleapis.com\u002Fsfr-vision-language-research\u002FLAVIS\u002Fmodels\u002FBLIP2\u002Feva_vit_g.pth)、[QFormer-7b](https:\u002F\u002Fstorage.googleapis.com\u002Fsfr-vision-language-research\u002FLAVIS\u002Fmodels\u002FInstructBLIP\u002Finstruct_blip_vicuna7b_trimmed.pth)、[QFormer-13b](https:\u002F\u002Fstorage.googleapis.com\u002Fsfr-vision-language-research\u002FLAVIS\u002Fmodels\u002FInstructBLIP\u002Finstruct_blip_vicuna13b_trimmed.pth)，并将其放置在 `model_zoo` 目录中，具体结构参考 [结构](#structure)。\n\n### 结构\n\n在开始训练之前，文件夹结构应按如下方式组织：\n\n```\nLLaMA-VID\n├── llamavid\n├── scripts\n├── work_dirs\n│   ├── llama-vid\n│   │   ├── llama-vid-13b-full-336\n│   │   ├── ...\n├── model_zoo\n│   ├── LLM\n│   │   ├── vicuna\n│   │   │   ├── 7B-V1.5\n│   │   │   ├── 13B-V1.5\n│   ├── LAVIS\n│   │   ├── eva_vit_g.pth\n│   │   ├── instruct_blip_vicuna7b_trimmed.pth\n│   │   ├── instruct_blip_vicuna13b_trimmed.pth\n├── data\n│   ├── LLaMA-VID-Pretrain\n│   │   ├── blip_laion_cc_sbu_558k.json\n│   │   ├── llava_558k_with_webvid.json\n│   │   ├── images\n│   │   ├── videos\n│   ├── LLaMA-VID-Finetune\n│   │   ├── llava_v1_5_mix665k.json\n│   │   ├── llava_v1_5_mix665k_maxround_6_total_921k.json\n│   │   ├── llava_v1_5_mix665k_maxround_12_total_714k.json\n│   │   ├── llava_v1_5_mix665k_with_video_chatgpt.json\n│   │   ├── llava_v1_5_mix665k_with_video_chatgpt_maxtime_5min.json\n│   │   ├── long_videoqa.json\n│   │   ├── movienet\n│   │   ├── activitynet\n│   │   ├── coco\n│   │   ├── gqa\n│   │   ├── ocr_vqa\n│   │   ├── textvqa\n│   │   ├── vg\n│   ├── LLaMA-VID-Eval\n│   │   ├── gqa\n│   │   ├── ...\n```\n\n## 训练\n\nLLaMA-VID 的训练分为三个阶段：(1) 特征对齐阶段：连接视觉和语言的标记；(2) 指令微调阶段：教会模型遵循多模态指令；(3) 长视频微调阶段：扩展位置嵌入，并教会模型理解长达数小时的视频指令。\n\nLLaMA-VID 使用 8 块 80GB 显存的 A100 GPU 进行训练。若使用较少的 GPU，可以相应减少 `per_device_train_batch_size` 并增加 `gradient_accumulation_steps`。务必保持全局批次大小不变：`per_device_train_batch_size` × `gradient_accumulation_steps` × `num_gpus`。\n\n请确保在训练前按照 [准备工作](#preparation) 下载并整理好数据。\n\n### 仅图像\n若仅希望基于图像数据训练和微调 LLaMA-VID，请针对 Vicuna-7B（图像尺寸 336）运行以下命令：\n\n```bash\nbash scripts\u002Fimage_only\u002Ftrain\u002Fstage_1_2_full_v7b_336.sh\n```\n或针对 Vicuna-13B（图像尺寸 336）：\n\n```bash\nbash scripts\u002Fimage_only\u002Ftrain\u002Fstage_1_2_full_v13b_336.sh\n```\n您也可以尝试使用较小的图像尺寸 224 和更少的视觉标记：\n\n```bash\nbash scripts\u002Fimage_only\u002Ftrain\u002Fstage_1_2_full_v7b_224_grid_4.sh\n```\n更多训练脚本可在 `scripts\u002Fimage_only\u002Ftrain` 目录中找到。\n\n### 短视频\n若您对基于短视频数据训练和微调 LLaMA-VID 感兴趣，请针对 Vicuna-7B（图像尺寸 224）运行以下命令：\n\n```bash\nbash scripts\u002Fvideo\u002Ftrain\u002Fstage_1_2_full_v7b_224_fps_1.sh\n```\n或针对 Vicuna-13B（图像尺寸 224）：\n\n```bash\nbash scripts\u002Fvideo\u002Ftrain\u002Fstage_1_2_full_v13b_224_fps_1.sh\n```\n更多训练脚本可在 `scripts\u002Fvideo\u002Ftrain` 目录中找到。\n\n### 长视频\n我们提供了长视频训练的数据集和脚本。请按照 [准备工作](#preparation) 下载长视频数据，并按 [结构](#structure) 进行组织。\n在训练阶段，我们首先从长视频中提取所有帧，并将视觉特征本地保存，以提高训练效率。\n```bash\npython scripts\u002Fextra_tool\u002Fextract_movienet_features.py \\\n    --video_dir \u003Cmovienet视频路径> \\\n    --files_dir \u003Cmovienet文件路径> \\ # 已下载的MovieNet.tar.gz中的文件\n    --feat_dir \u003C输出特征路径>\n```\n\n然后，针对 Vicuna-7B（图像尺寸 224）运行以下命令：\n```bash\nbash scripts\u002Fvideo\u002Ftrain\u002Fstage_3_full_v7b_224_longvid.sh\n```\n\n## 评估\n我们在基于图像和基于视频的基准测试上进行了评估。请按照[准备工作](#preparation)下载评估数据，并按照[结构](#structure)进行组织。\n\n### 仅图像\n| LLM | 分辨率 | 模型 | GQA | MMB | MME | POPE | SEED | SQA-Image | VizWiz | VQA v2 |\n|----------|----------|-----------|---|---|---|---|---|---|---|---|\nVicuna-7B | 224 | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224) | 63.0 | 65.3 | 1405.6 | 86.6 | 59.7 | 67.7 | 52.5 | 78.3 |\nVicuna-7B | 336 | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-336) | 64.3 | 65.1 | 1521.4 | 86.0 | 59.9 | 68.3 | 54.2 | 79.3 |\nVicuna-13B | 336 | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-13b-full-336) | 65.0 | 66.6 | 1542.3 | 86.0 | 62.3 | 70.0 | 54.3 | 80.0 |\n\n\n\n如果您想在基于图像的基准测试上评估模型，请使用`scripts\u002Fimage_only\u002Feval`中的脚本。\n例如，运行以下命令进行GQA评估：\n```bash\nbash scripts\u002Fimage_only\u002Feval\u002Fgqa.sh\n```\n更多评估脚本请参见`scripts\u002Fimage_only\u002Feval`。\n\n### 视频\n| LLM | 分辨率 | 模型 | MSVD-QA | MSRVTT-QA | ActivityNet-QA | 正确性 | 细节 | 上下文 | 时间 | 一致性 |\n|----------|----------|-----------|---|---|---|---|---|---|---|---|\nVicuna-7B | 224 | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224-video-fps-1) | 69.7 | 57.7 | 47.4 | 2.96 | 3.00 | 3.53 | 2.46 | 2.51 | \nVicuna-13B | 224 | [ckpt](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-13b-full-224-video-fps-1) | 70.0 | 58.9 | 47.5 | 3.07 | 3.05 | 3.60 | 2.58 | 2.63 |\n\n如果您想在基于视频的基准测试上评估模型，请使用`scripts\u002Fvideo\u002Feval`中的脚本。\n例如，运行以下命令进行MSVD-QA评估：\n```bash\nbash scripts\u002Fvideo\u002Feval\u002Fmsvd_eval.sh\n```\n更多评估脚本请参见`scripts\u002Fvideo\u002Feval`。\n\n### CLI 推理\n无需Gradio界面即可使用LLaMA-VID与图像和视频进行对话。它还支持多GPU、4位和8位量化推理。采用4位量化时。\n请尝试以下命令进行图像或视频推理：\n\n```bash\npython -m llamavid.serve.cli \\\n    --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-336 \\\n    --image-file \u003C您的图像路径>\n```\n\n或者尝试以下命令进行视频推理：\n```bash\npython -m llamavid.serve.cli \\\n    --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-224-video-fps-1 \\\n    --image-file \u003C您的视频路径> \\\n    --temperature 0.5\n```\n\n您也可以尝试4位或8位以实现高效推理：\n```bash\npython -m llamavid.serve.cli \\\n    --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-224-video-fps-1 \\\n    --image-file \u003C您的视频路径>\n    --temperature 0.5 \\\n    --load-4bit\n```\n\n### 长视频推理\n对于长视频，如果您想对movienet中的视频进行推理，首先需要按如下方式处理视频数据和字幕：\n```bash\npython scripts\u002Fextra_tool\u002Fextract_movienet_features.py \\\n    --video_dir \u003Cmovienet视频路径> \\\n    --files_dir \u003Cmovienet文件路径> \\ # 下载的MovieNet.tar.gz中的文件\n    --feat_dir \u003C输出特征路径>\n```\n\n如果您想对自己的自定义视频进行推理，也需先按如下方式处理视频数据和字幕：\n```bash\npython scripts\u002Fextra_tool\u002Fextract_video_features_subtitles.py \\\n    --video_file \u003C自定义视频路径> \\\n    --feat_dir \u003C输出特征路径>\n```\n\n然后，您可以尝试以下命令进行长视频推理：\n```bash\npython llamavid\u002Fserve\u002Frun_llamavid_movie.py \\\n    --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-224-long-video \\\n    --video-file \u003C您已处理的视频文件路径> \\\n    --load-4bit\n```\n\n### Gradio Web UI\n\n在这里，我们采用了类似于 LLaVA 的 Gradio 界面，为 LLaMA-VID 提供一个用户友好的交互界面。要在本地启动 Gradio 演示，请依次运行以下命令。如果你计划启动多个模型工作进程以比较不同的检查点，只需*仅需一次*启动控制器和 Web 服务器即可。\n\n#### 启动控制器\n```Shell\npython -m llamavid.serve.controller --host 0.0.0.0 --port 10000\n```\n\n#### 启动 Gradio Web 服务器\n```Shell\npython -m llamavid.serve.gradio_web_server --controller http:\u002F\u002Flocalhost:10000 --model-list-mode reload\n```\n现在你已经成功启动了 Gradio Web 界面。你可以通过屏幕上打印的 URL 打开该界面。你可能会注意到模型列表中还没有任何模型，这是因为我们尚未启动任何模型工作进程。当启动模型工作进程后，模型列表会自动更新。\n\n#### 启动模型工作进程\n这是实际在 GPU 上执行推理的*工作进程*。每个工作进程负责 `--model-path` 中指定的单个模型。\n\n```Shell\npython -m llamavid.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-vicuna-7b-short\n```\n等待进程完成模型加载，并看到“Uvicorn running on ...”的提示后，刷新你的 Gradio Web 界面，你就会在模型列表中看到刚刚启动的模型。\n\n你可以根据需要启动任意数量的工作进程，并在同一 Gradio 界面中比较不同的模型。例如，这里有一个短视频模型。请确保 `--controller` 参数保持不变，而将 `--port` 和 `--worker` 分别设置为不同的端口号以区分各个工作进程。\n```Shell\npython -m llamavid.serve.model_worker_short --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port \u003C不同于40000，比如40001> --worker http:\u002F\u002Flocalhost:\u003C相应地改为40001> --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-224-video-fps-1\n```\n\n如果你使用的是搭载 M1 或 M2 芯片的 Apple 设备，可以通过 `--device` 标志指定 MPS 设备：`--device mps`。\n\n#### 启动模型工作进程（多 GPU，当 GPU 显存 ≤ 24GB 时）\n\n如果你的 GPU 显存小于 24GB（例如 RTX 3090、RTX 4090 等），可以尝试使用多 GPU 运行。我们最新的代码库会在你拥有多个 GPU 时自动尝试使用多 GPU。你可以通过 `CUDA_VISIBLE_DEVICES` 指定要使用的 GPU。下面是一个使用前两块 GPU 的示例：\n\n```Shell\nCUDA_VISIBLE_DEVICES=0,1 python -m llamavid.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-224-long-video\n```\n\n#### 启动模型工作进程（4-bit、8-bit 推理，量化版本）\n\n你可以启动使用量化位数（4-bit、8-bit）的模型工作进程，这样可以在降低 GPU 内存占用的同时进行推理。需要注意的是，量化位数的推理精度可能不如全精度模型高。只需在你要执行的**模型工作进程**命令中添加 `--load-4bit` 或 `--load-8bit` 即可。下面是一个使用 4-bit 量化运行的示例：\n\n```Shell\npython -m llamavid.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path work_dirs\u002Fllama-vid\u002Fllama-vid-7b-full-224-long-video --load-4bit\n```\n\n## 示例\n本节提供了一些示例。更多示例请参阅我们的[项目页面](https:\u002F\u002Fllama-vid.github.io\u002F)。\n\n\u003Cdiv align=center>\n\u003Cimg width=\"100%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FJIA-Lab-research_LLaMA-VID_readme_12d12f2067ed.png\"\u002F>\n\u003C\u002Fdiv>\n\n## 引用\n如果你发现本仓库对你的研究有所帮助，请考虑引用以下论文：\n```\n@inproceedings{li2024llamavid,\n  title={LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models},\n  author={Li, Yanwei and Wang, Chengyao and Jia, Jiaya},\n  journal={European Conference on Computer Vision},\n  year={2024}\n}\n```\n\n## 致谢\n我们感谢以下项目组的杰出工作：\n\n- 本工作基于 [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) 构建。\n- 本工作使用了来自 [Vicuna](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat) 的 LLM。\n- 本工作使用了来自 [InstructBLIP](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS) 的预训练权重。\n- 我们参考了 [Video-ChatGPT](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT) 的视频评估方法。\n\n## 许可证\n[![代码许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache_2.0-yellow.svg)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID\u002Fblob\u002Fmain\u002FLICENSE)\n[![数据许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FData%20License-CC%20By%20NC%204.0-orange.svg)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID\u002Fblob\u002Fmain\u002FDATA_LICENSE)\n[![权重许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeight%20License-CC%20By%20NC%204.0-red)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID\u002Fblob\u002Fmain\u002FWEIGHT_LICENSE)\n\n本项目的数据和检查点仅供研究使用，并受相关许可协议约束。它们同样受到 LLaVA、LLaMA、Vicuna 和 GPT-4 许可协议的限制。数据采用 CC BY NC 4.0 许可，仅允许非商业用途；使用该数据集训练的模型也不得用于研究以外的场景。","# LLaMA-VID 快速上手指南\n\nLLaMA-VID 是一个基于 LLaVA 架构的多模态大模型，专为处理长视频（长达数小时）而设计。它通过独特的令牌生成策略，将图像压缩为极少的上下文令牌，从而显著扩展了模型的视频理解能力。\n\n## 1. 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+)\n*   **GPU**: 建议至少 1 张 NVIDIA GPU (显存 >= 24GB，训练推荐 8x A100 80GB)\n*   **CUDA**: 已安装与 PyTorch 版本匹配的 CUDA 驱动\n*   **包管理器**: Conda (推荐 Miniconda 或 Anaconda)\n*   **Python**: 3.10\n\n## 2. 安装步骤\n\n请按照以下步骤克隆代码并安装依赖。\n\n### 2.1 克隆仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID.git\ncd LLaMA-VID\n```\n\n### 2.2 创建虚拟环境并安装基础包\n```bash\nconda create -n llamavid python=3.10 -y\nconda activate llamavid\npip install --upgrade pip\npip install -e .\n```\n\n### 2.3 安装训练专用依赖\n如果您计划进行模型训练或微调，需额外安装 `ninja` 和 `flash-attn` 以加速计算：\n```bash\npip install ninja\npip install flash-attn --no-build-isolation\n```\n> **提示**：国内用户若下载 `flash-attn` 较慢，可尝试使用清华源或阿里源加速 pip 安装，例如：`pip install flash-attn --no-build-isolation -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`。\n\n## 3. 基本使用\n\nLLaMA-VID 的核心优势在于长视频理解。使用前需先下载预训练权重和数据集。\n\n### 3.1 下载预训练权重与数据\n根据您的需求（仅图像、短视频或长视频），从 Hugging Face 下载对应的模型权重。以下是长视频模型的示例：\n*   **模型权重**: [llama-vid-7b-full-224-long-video](https:\u002F\u002Fhuggingface.co\u002FYanweiLi\u002Fllama-vid-7b-full-224-long-video)\n*   **数据集**: [LLaMA-VID-Data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYanweiLi\u002FLLaMA-VID-Data)\n\n请将下载的权重文件整理至项目根目录下的 `model_zoo` 文件夹，数据结构参考如下：\n```text\nLLaMA-VID\n├── model_zoo\n│   ├── LLM\n│   │   └── vicuna\n│   │       └── 7B-V1.5  # 存放 Vicuna 权重\n│   ├── LAVIS\n│   │   ├── eva_vit_g.pth\n│   │   └── instruct_blip_vicuna7b_trimmed.pth\n├── data\n│   ├── LLaMA-VID-Finetune\n│   │   └── long_videoqa.json # 存放长视频问答数据\n│   └── ...\n```\n*(具体文件组织请严格参照官方 README 中的 \"Structure\" 章节)*\n\n### 3.2 运行推理\u002F评估\n项目提供了封装好的脚本用于不同场景。以下是运行长视频微调后模型进行评估的示例命令（需先配置好数据路径）：\n\n```bash\n# 示例：运行长视频评估脚本 (具体参数需根据实际下载的数据集路径调整)\npython llamavid\u002Feval\u002Fmodel_video_chatgpt.py \\\n    --model-path \u002Fpath\u002Fto\u002Fllama-vid-7b-full-224-long-video \\\n    --video-dir \u002Fpath\u002Fto\u002Fvideo\u002Ffiles \\\n    --conv-mode llama_vid\n```\n\n### 3.3 启动训练 (可选)\n若您希望从头训练或微调模型，可使用提供的 Shell 脚本。以下是以 Vicuna-7B 为基础，针对长视频数据进行全量微调的示例：\n\n1.  **预处理长视频特征** (提取帧并保存视觉特征以加速训练):\n    ```bash\n    python scripts\u002Fextra_tool\u002Fextract_movienet_features.py \\\n        --video_dir \u003Cpath to movienet video> \\\n        --files_dir \u003Cpath to movienet files> \\\n        --feat_dir \u003Cpath to output features>\n    ```\n\n2.  **执行训练**:\n    ```bash\n    # 运行长视频训练脚本 (需确保显存充足，默认配置针对多卡环境)\n    bash scripts\u002Fvideo\u002Ftrain\u002Fstage_3_long_video_v7b_224.sh\n    ```\n    *注：如果显存有限，请编辑脚本，减小 `per_device_train_batch_size` 并相应增加 `gradient_accumulation_steps`，保持全局 batch size 不变。*\n\n---\n*更多详细用法、Demo 演示及完整数据集说明，请访问 [LLaMA-VID 项目主页](https:\u002F\u002Fllama-vid.github.io\u002F)。*","某影视分析团队需要让 AI 助手理解并回答关于长达两小时电影剧情的复杂问题，例如梳理人物关系或定位特定情节。\n\n### 没有 LLaMA-VID 时\n- **上下文长度受限**：传统多模态模型无法处理长视频，必须将电影切割成无数短片段，导致 AI 无法关联开头与结尾的剧情线索。\n- **信息丢失严重**：为了适配模型输入限制，不得不大幅降低采样帧率，遗漏关键动作细节或表情变化，造成回答不准确。\n- **推理成本高昂**：处理长视频需要生成海量视觉 Token，显存占用爆炸式增长，普通显卡根本无法运行，只能依赖昂贵的集群资源。\n- **交互体验割裂**：用户无法进行连贯的“电影对话”，每次提问都像是在询问独立的图片集，AI 缺乏对整体叙事的时间感知。\n\n### 使用 LLaMA-VID 后\n- **支持小时级视频**：LLaMA-VID 独特的令牌策略将整部电影压缩为极少的上下文 Token（如 2 个内容令牌 + 1 个上下文令牌），直接支持 64K 长度的上下文窗口。\n- **完整剧情理解**：模型能一次性“看”完整个视频，精准回答跨越数小时的因果问题，如“主角在片尾的决策是如何受片初事件影响的”。\n- **硬件门槛降低**：极高的压缩比使得在单张消费级显卡上也能流畅运行长视频分析任务，大幅降低了部署成本。\n- **自然连贯对话**：用户可以像和朋友聊电影一样与 AI 互动，随时追问细节，LLaMA-VID 能基于全片内容提供逻辑严密的连贯回复。\n\nLLaMA-VID 通过将长视频压缩为极少 Token，彻底打破了多模态大模型处理长内容的瓶颈，让“与整部电影对话”成为现实。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FJIA-Lab-research_LLaMA-VID_3eb58f23.png","JIA-Lab-research","JIA Lab","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FJIA-Lab-research_a2efb296.png",null,"https:\u002F\u002Fgithub.com\u002FJIA-Lab-research",[81,85,89,92,96],{"name":82,"color":83,"percentage":84},"Python","#3572A5",87.5,{"name":86,"color":87,"percentage":88},"Shell","#89e051",8.6,{"name":90,"color":91,"percentage":10},"JavaScript","#f1e05a",{"name":93,"color":94,"percentage":95},"HTML","#e34c26",1.5,{"name":97,"color":98,"percentage":99},"CSS","#663399",0.4,862,52,"2026-04-11T08:55:19","Apache-2.0",4,"Linux","必需，官方训练环境为 8x NVIDIA A100 (80GB 显存)。支持通过调整 batch_size 和 gradient_accumulation_steps 在更少 GPU 上运行。需安装 flash-attn，通常要求 CUDA 环境。","未说明（建议根据模型大小配置充足内存，7B\u002F13B 模型通常建议 32GB+）",{"notes":109,"python":110,"dependencies":111},"1. 必须使用 conda 创建 Python 3.10 环境进行安装。\n2. 安装 flash-attn 时需添加 '--no-build-isolation' 参数。\n3. 长视频训练前需预先提取视频帧特征以节省显存。\n4. 需手动下载并整理 Vicuna LLM、EVA-ViT-G 视觉编码器及 QFormer 等预训练权重到指定目录。\n5. 数据集结构复杂，需严格按照 README 中的目录结构组织图像、视频及标注文件。","3.10",[112,113,114,115,116,117,118,119,120,121],"torch","transformers","flash-attn","ninja","accelerate","deepspeed","peft","scikit-learn","sentencepiece","short-video-benchmark",[15,35,62],"2026-03-27T02:49:30.150509","2026-04-18T14:24:33.283105",[126,131,136,141,146,151],{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},40025,"微调阶段使用了哪些数据集（如 COCO, GQA 等），如何下载和格式化这些数据？","虽然文档的“数据集准备”部分未明确提及，但代码目录 `data\u002FLLaMA-VID-Finetune\u002F` 中确实包含了相关数据集。如果在训练过程中遇到类似 `Error in loading...` 的错误，通常是因为生成的数据格式不正确。请确保视频数据的格式与仓库中提供的 `llava_558k_with_webvid.json` 文件中的格式完全一致。","https:\u002F\u002Fgithub.com\u002FJIA-Lab-research\u002FLLaMA-VID\u002Fissues\u002F30",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},40026,"如何可视化图像中的高响应区域（High Response Areas）？代码似乎缺失或报错维度不匹配。","可视化代码需要手动处理维度问题。获取上下文嵌入的图像分数（image score）后，其形状通常为 (1, 32, 576)。首先需要在第二个维度上取平均值，将其形状缩减为 (1, 576)。随后，使用插值方法（如 `nn.functional.interpolate`）将其尺寸调整为与输入图像相同的尺寸，即可生成热力图进行可视化。","https:\u002F\u002Fgithub.com\u002FJIA-Lab-research\u002FLLaMA-VID\u002Fissues\u002F26",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},40027,"为什么在 MSVD QA 任务中，Stage 3 训练后的长视频模型准确率反而比短视频模型下降了？","这是一个已知现象。在使用 GPT-3.5-turbo 进行评估时，作者提供的短视频模型（`llama-vid-7b-full-224-video-fps-1`）在 MSVD 数据集上的得分约为 0.69，而长视频模型（`llama-vid-7b-full-224-long-video`）得分约为 0.52，准确率下降约 0.2 属于正常范围，具体数值可能因评估设置略有波动。","https:\u002F\u002Fgithub.com\u002FJIA-Lab-research\u002FLLaMA-VID\u002Fissues\u002F58",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},40028,"显存不足（40GB）时如何使用 DeepSpeed Zero3 进行训练？遇到参数加载错误或训练卡死怎么办？","1. **关于卡死**：DeepSpeed Zero3 不支持负载不均（imbalanced load）。如果不同 GPU 间的 batch size 不一致，训练会卡死。建议在 Stage 2 使用 `zero2_offload` 并冻结 Q-former。\n2. **关于参数加载错误**：如果使用 Zero3 加载预训练的 Q-former 时报错（如 shape 不匹配），需要自定义加载函数。检查模型参数是否包含 `ds_id` 属性来判断是否启用了 Zero3 优化，如果是，需使用特定的逻辑处理 `state_dict` 的加载，不能直接使用标准的 `load_state_dict`。","https:\u002F\u002Fgithub.com\u002FJIA-Lab-research\u002FLLaMA-VID\u002Fissues\u002F75",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},40029,"Demo 页面或推理时视频超过 1 分钟导致崩溃且无报错，以及模型名称对应关系是怎样的？","1. **模型对应**：`llama-vid-vicuna-7b-short` 与 `llama-vid-7b-full-224-video-fps-1` 是同一个模型，只是文件夹名称发生了变更。如果需要短视频模型，请下载后者。\n2. **崩溃问题**：长视频导致崩溃通常与显存溢出或默认配置限制有关，建议检查是否使用了针对长视频优化的模型版本（如 `long-video` 版本）或调整帧率采样策略。","https:\u002F\u002Fgithub.com\u002FJIA-Lab-research\u002FLLaMA-VID\u002Fissues\u002F25",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},40030,"Stage 2 微调默认使用的 JSON 文件是什么？为什么视频长度被限制在 5 分钟以内？","Stage 2 微调默认使用的文件通常是 `llava_v1_5_mix665k_with_video_chatgpt_maxtime_5min.json`。将视频限制在 5 分钟以内主要是为了控制显存占用和训练时长，确保在有限资源下模型能够收敛，同时覆盖大多数常见的短视频理解场景。","https:\u002F\u002Fgithub.com\u002FJIA-Lab-research\u002FLLaMA-VID\u002Fissues\u002F22",[]]