[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-haotian-liu--LLaVA":3,"tool-haotian-liu--LLaVA":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":79,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":110,"forks":111,"last_commit_at":112,"license":113,"difficulty_score":10,"env_os":114,"env_gpu":115,"env_ram":116,"env_deps":117,"category_tags":130,"github_topics":131,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":145,"updated_at":146,"faqs":147,"releases":177},3720,"haotian-liu\u002FLLaVA","LLaVA","[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.","LLaVA（Large Language and Vision Assistant）是一款开源的多模态人工智能模型，旨在让大语言模型具备理解和分析图像的能力。它通过“视觉指令微调”技术，将强大的语言模型与视觉编码器相结合，使用户能够像与 GPT-4V 交互一样，直接上传图片并询问关于图片内容的问题，例如描述场景、识别物体或解答图表数据。\n\nLLaVA 主要解决了传统大语言模型无法直接“看”图的局限，打破了文本与视觉信息之间的壁垒。它无需复杂的专用训练，即可在零样本情况下展现出惊人的图像理解力，甚至在部分基准测试中超越了早期的商业闭源模型。其独特的技术亮点在于高效的视觉指令微调方法，以及最新 LLaVA-NeXT 版本对更高分辨率图像、视频任务的支持，并能灵活适配 Llama-3、Qwen 等主流基座模型。\n\n这款工具非常适合 AI 研究人员探索多模态前沿技术，也适合开发者将其集成到需要图像理解功能的应用程序中。同时，由于社区提供了便捷的演示和部署方案，对多模态交互感兴趣的普通用户和技术爱好者也能轻松体验其能力。无论是构建智能助手、自动化数据分析，还是进行学术研究，LLaVA 都提供了一","LLaVA（Large Language and Vision Assistant）是一款开源的多模态人工智能模型，旨在让大语言模型具备理解和分析图像的能力。它通过“视觉指令微调”技术，将强大的语言模型与视觉编码器相结合，使用户能够像与 GPT-4V 交互一样，直接上传图片并询问关于图片内容的问题，例如描述场景、识别物体或解答图表数据。\n\nLLaVA 主要解决了传统大语言模型无法直接“看”图的局限，打破了文本与视觉信息之间的壁垒。它无需复杂的专用训练，即可在零样本情况下展现出惊人的图像理解力，甚至在部分基准测试中超越了早期的商业闭源模型。其独特的技术亮点在于高效的视觉指令微调方法，以及最新 LLaVA-NeXT 版本对更高分辨率图像、视频任务的支持，并能灵活适配 Llama-3、Qwen 等主流基座模型。\n\n这款工具非常适合 AI 研究人员探索多模态前沿技术，也适合开发者将其集成到需要图像理解功能的应用程序中。同时，由于社区提供了便捷的演示和部署方案，对多模态交互感兴趣的普通用户和技术爱好者也能轻松体验其能力。无论是构建智能助手、自动化数据分析，还是进行学术研究，LLaVA 都提供了一个强大且开放的基石。","# 🌋 LLaVA: Large Language and Vision Assistant\n\n*Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.*\n\n[📢 [LLaVA-NeXT Blog](https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-01-30-llava-next\u002F)] [[Project Page](https:\u002F\u002Fllava-vl.github.io\u002F)] [[Demo](https:\u002F\u002Fllava.hliu.cc\u002F)]  [[Data](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FData.md)] [[Model Zoo](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)]\n\n🤝Community Contributions: [[llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002Fpull\u002F3436)] [[Colab](https:\u002F\u002Fgithub.com\u002Fcamenduru\u002FLLaVA-colab)] [[🤗Space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fbadayvedat\u002FLLaVA)] [[Replicate](https:\u002F\u002Freplicate.com\u002Fyorickvp\u002Fllava-13b)] [[AutoGen](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fautogen\u002Fblob\u002Fmain\u002Fnotebook\u002Fagentchat_lmm_llava.ipynb)]  [[BakLLaVA](https:\u002F\u002Fgithub.com\u002FSkunkworksAI\u002FBakLLaVA)]\n\n**Improved Baselines with Visual Instruction Tuning** [[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.03744)] [[HF](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2310.03744)] \u003Cbr>\n[Haotian Liu](https:\u002F\u002Fhliu.cc), [Chunyuan Li](https:\u002F\u002Fchunyuan.li\u002F), [Yuheng Li](https:\u002F\u002Fyuheng-li.github.io\u002F), [Yong Jae Lee](https:\u002F\u002Fpages.cs.wisc.edu\u002F~yongjaelee\u002F)\n\n**Visual Instruction Tuning** (NeurIPS 2023, **Oral**) [[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08485)] [[HF](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2304.08485)] \u003Cbr>\n[Haotian Liu*](https:\u002F\u002Fhliu.cc), [Chunyuan Li*](https:\u002F\u002Fchunyuan.li\u002F), [Qingyang Wu](https:\u002F\u002Fscholar.google.ca\u002Fcitations?user=HDiw-TsAAAAJ&hl=en\u002F), [Yong Jae Lee](https:\u002F\u002Fpages.cs.wisc.edu\u002F~yongjaelee\u002F) (*Equal Contribution)\n\n\u003C!--p align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fllava.hliu.cc\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhaotian-liu_LLaVA_readme_6925f42ff9ac.png\" width=\"50%\">\u003C\u002Fa> \u003Cbr>\n    Generated by \u003Ca href=\"https:\u002F\u002Fgligen.github.io\u002F\">GLIGEN\u003C\u002Fa> via \"a cute lava llama with glasses\" and box prompt\n\u003C\u002Fp-->\n\n\n## Release\n\n- [2024\u002F05\u002F10] 🔥 **LLaVA-NeXT** (Stronger) models are released, stronger LMM with support of LLama-3 (8B) and Qwen-1.5 (72B\u002F110B). [[Blog](https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-05-10-llava-next-stronger-llms\u002F)] [[Checkpoints](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flmms-lab\u002Fllava-next-6623288e2d61edba3ddbf5ff)] [[Demo](https:\u002F\u002Fllava-next.lmms-lab.com\u002F)] [[Code](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-NeXT\u002F)] \n- [2024\u002F05\u002F10] 🔥 **LLaVA-NeXT** (Video) is released. The image-only-trained LLaVA-NeXT model is surprisingly strong on video tasks with zero-shot modality transfer. DPO training with AI feedback on videos can yield significant improvement. [[Blog](https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-04-30-llava-next-video\u002F)] [[Checkpoints](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flmms-lab\u002Fllava-next-video-661e86f5e8dabc3ff793c944)] [[Code](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-NeXT\u002F)]\n- [03\u002F10] Releasing **LMMs-Eval**, a highly efficient evaluation pipeline we used when developing LLaVA-NeXT. It supports the evaluation of LMMs on dozens of public datasets and allows new dataset onboarding, making the dev of new LMMs much faster. [[Blog](https:\u002F\u002Flmms-lab.github.io\u002Flmms-eval-blog\u002Flmms-eval-0.1\u002F)] [[Codebase](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval)]\n- [1\u002F30] 🔥 **LLaVA-NeXT** (LLaVA-1.6) is out! With additional scaling to LLaVA-1.5, LLaVA-NeXT-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks\u002Fapplications than before. Check out the [blog post](https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-01-30-llava-next\u002F), and explore the [demo](https:\u002F\u002Fllava.hliu.cc\u002F)! Models are available in [Model Zoo](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md). Training\u002Feval data and scripts coming soon.\n- [11\u002F10] [LLaVA-Plus](https:\u002F\u002Fllava-vl.github.io\u002Fllava-plus\u002F) is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). [[Project Page](https:\u002F\u002Fllava-vl.github.io\u002Fllava-plus\u002F)] [[Demo](https:\u002F\u002Fllavaplus.ngrok.io\u002F)] [[Code](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Plus-Codebase)] [[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.05437)]\n- [11\u002F2] [LLaVA-Interactive](https:\u002F\u002Fllava-vl.github.io\u002Fllava-interactive\u002F) is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. [[Project Page](https:\u002F\u002Fllava-vl.github.io\u002Fllava-interactive\u002F)] [[Demo](https:\u002F\u002Fllavainteractive.ngrok.io\u002F)] [[Code](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Interactive-Demo)] [[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00571)]\n- [10\u002F26] 🔥 LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement ([ckpts](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md#llava-v15), [script](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA#train)). We also provide a [doc](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FFinetune_Custom_Data.md) on how to finetune LLaVA-1.5 on your own dataset with LoRA.\n- [10\u002F12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our research! [[🤗 Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fetri-vilab\u002FKo-LLaVA)]\n- [10\u002F5] 🔥 LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the [technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.03744), and explore the [demo](https:\u002F\u002Fllava.hliu.cc\u002F)! Models are available in [Model Zoo](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md). The training data and scripts of LLaVA-1.5 are released [here](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA#train), and evaluation scripts are released [here](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FEvaluation.md)!\n- [9\u002F26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project [[LLavA-RLHF]](https:\u002F\u002Fllava-rlhf.github.io\u002F)\n- [9\u002F22] [LLaVA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08485) is accepted by NeurIPS 2023 as **oral presentation**, and [LLaVA-Med](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00890) is accepted by NeurIPS 2023 Datasets and Benchmarks Track as **spotlight presentation**.\n\n\u003Cdetails>\n\u003Csummary>More\u003C\u002Fsummary>\n\n- [11\u002F6] Support **Intel** dGPU and CPU platforms. [More details here.](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Ftree\u002Fintel\u002Fdocs\u002Fintel)\n- [10\u002F12] LLaVA is now supported in [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002Fpull\u002F3436) with 4-bit \u002F 5-bit quantization support!\n- [10\u002F11] The training data and scripts of LLaVA-1.5 are released [here](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA#train), and evaluation scripts are released [here](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FEvaluation.md)!\n- [10\u002F10] [Roboflow Deep Dive](https:\u002F\u002Fblog.roboflow.com\u002Ffirst-impressions-with-llava-1-5\u002F): First Impressions with LLaVA-1.5.\n- [9\u002F20] We summarize our empirical study of training 33B and 65B LLaVA models in a [note](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.09958). Further, if you are interested in the comprehensive review, evolution and trend of multimodal foundation models, please check out our recent survey paper [``Multimodal Foundation Models: From Specialists to General-Purpose Assistants''.](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.10020)\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhaotian-liu_LLaVA_readme_54a10cc984e5.jpeg\" width=50%\u002F>\n\u003C\u002Fp>\n\n- [7\u002F19] 🔥 We release a major upgrade, including support for LLaMA-2, LoRA training, 4-\u002F8-bit inference, higher resolution (336x336), and a lot more. We release [LLaVA Bench](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FLLaVA_Bench.md) for benchmarking open-ended visual chat with results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check out [LLaVA-from-LLaMA-2](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FLLaVA_from_LLaMA2.md), and our [model zoo](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)!\n- [6\u002F26] [CVPR 2023 Tutorial](https:\u002F\u002Fvlp-tutorial.github.io\u002F) on **Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4**!  Please check out [[Slides](https:\u002F\u002Fdatarelease.blob.core.windows.net\u002Ftutorial\u002Fvision_foundation_models_2023\u002Fslides\u002FChunyuan_cvpr2023_tutorial_lmm.pdf)] [[Notes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.14895)] [[YouTube](https:\u002F\u002Fyoutu.be\u002FmkI7EPD1vp8)] [[Bilibli](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Ng4y1T7v3\u002F)].\n- [6\u002F11] We released the preview for the most requested feature: DeepSpeed and LoRA support!  Please see documentations [here](.\u002Fdocs\u002FLoRA.md).\n- [6\u002F1] We released **LLaVA-Med: Large Language and Vision Assistant for Biomedicine**, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities.  Checkout the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00890) and [page](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLaVA-Med).\n- [5\u002F6] We are releasing [LLaVA-Lighting-MPT-7B-preview](https:\u002F\u002Fhuggingface.co\u002Fliuhaotian\u002FLLaVA-Lightning-MPT-7B-preview), based on MPT-7B-Chat!  See [here](#LLaVA-MPT-7b) for more details.\n- [5\u002F2] 🔥 We are releasing LLaVA-Lighting!  Train a lite, multimodal GPT-4 with just $40 in 3 hours!  See [here](#train-llava-lightning) for more details.\n- [4\u002F27] Thanks to the community effort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as few as 12GB VRAM!  Try it out [here](https:\u002F\u002Fgithub.com\u002Foobabooga\u002Ftext-generation-webui\u002Ftree\u002Fmain\u002Fextensions\u002Fllava).\n- [4\u002F17] 🔥 We released **LLaVA: Large Language and Vision Assistant**. We propose visual instruction tuning, towards building large language and vision models with GPT-4 level capabilities.  Checkout the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08485) and [demo](https:\u002F\u002Fllava.hliu.cc\u002F).\n\n\u003C\u002Fdetails>\n\n\u003C!-- \u003Ca href=\"https:\u002F\u002Fllava.hliu.cc\u002F\">\u003Cimg src=\"assets\u002Fdemo.gif\" width=\"70%\">\u003C\u002Fa> -->\n\n[![Code License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache_2.0-green.svg)](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002FLICENSE)\n**Usage and License Notices**: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the [OpenAI Terms of Use](https:\u002F\u002Fopenai.com\u002Fpolicies\u002Fterms-of-use) for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. [Llama community license](https:\u002F\u002Fai.meta.com\u002Fllama\u002Flicense\u002F) for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.\n\n\n## Contents\n- [Install](#install)\n- [LLaVA Weights](#llava-weights)\n- [Demo](#Demo)\n- [Model Zoo](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)\n- [Dataset](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FData.md)\n- [Train](#train)\n- [Evaluation](#evaluation)\n\n## Install\n\nIf you are not using Linux, do *NOT* proceed, see instructions for [macOS](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FmacOS.md) and [Windows](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FWindows.md).\n\n1. Clone this repository and navigate to LLaVA folder\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA.git\ncd LLaVA\n```\n\n2. Install Package\n```Shell\nconda create -n llava python=3.10 -y\nconda activate llava\npip install --upgrade pip  # enable PEP 660 support\npip install -e .\n```\n\n3. Install additional packages for training cases\n```\npip install -e \".[train]\"\npip install flash-attn --no-build-isolation\n```\n\n### Upgrade to latest code base\n\n```Shell\ngit pull\npip install -e .\n\n# if you see some import errors when you upgrade,\n# please try running the command below (without #)\n# pip install flash-attn --no-build-isolation --no-cache-dir\n```\n\n### Quick Start With HuggingFace\n\n\u003Cdetails>\n\u003Csummary>Example Code\u003C\u002Fsummary>\n\n```Python\nfrom llava.model.builder import load_pretrained_model\nfrom llava.mm_utils import get_model_name_from_path\nfrom llava.eval.run_llava import eval_model\n\nmodel_path = \"liuhaotian\u002Fllava-v1.5-7b\"\n\ntokenizer, model, image_processor, context_len = load_pretrained_model(\n    model_path=model_path,\n    model_base=None,\n    model_name=get_model_name_from_path(model_path)\n)\n```\n\nCheck out the details wth the `load_pretrained_model` function in `llava\u002Fmodel\u002Fbuilder.py`.\n\nYou can also use the `eval_model` function in `llava\u002Feval\u002Frun_llava.py` to get the output easily. By doing so, you can use this code on Colab directly after downloading this repository.\n\n``` python\nmodel_path = \"liuhaotian\u002Fllava-v1.5-7b\"\nprompt = \"What are the things I should be cautious about when I visit here?\"\nimage_file = \"https:\u002F\u002Fllava-vl.github.io\u002Fstatic\u002Fimages\u002Fview.jpg\"\n\nargs = type('Args', (), {\n    \"model_path\": model_path,\n    \"model_base\": None,\n    \"model_name\": get_model_name_from_path(model_path),\n    \"query\": prompt,\n    \"conv_mode\": None,\n    \"image_file\": image_file,\n    \"sep\": \",\",\n    \"temperature\": 0,\n    \"top_p\": None,\n    \"num_beams\": 1,\n    \"max_new_tokens\": 512\n})()\n\neval_model(args)\n```\n\u003C\u002Fdetails>\n\n## LLaVA Weights\nPlease check out our [Model Zoo](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md) for all public LLaVA checkpoints, and the instructions of how to use the weights.\n\n## Demo\n\n### Gradio Web UI\n\nTo launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model workers to compare between different checkpoints, you only need to launch the controller and the web server *ONCE*.\n\n```mermaid\nflowchart BT\n    %% Declare Nodes\n    gws(\"Gradio (UI Server)\")\n    c(\"Controller (API Server):\u003Cbr\u002F>PORT: 10000\")\n    mw7b(\"Model Worker:\u003Cbr\u002F>llava-v1.5-7b\u003Cbr\u002F>PORT: 40000\")\n    mw13b(\"Model Worker:\u003Cbr\u002F>llava-v1.5-13b\u003Cbr\u002F>PORT: 40001\")\n    sglw13b(\"SGLang Backend:\u003Cbr\u002F>llava-v1.6-34b\u003Cbr\u002F>http:\u002F\u002Flocalhost:30000\")\n    lsglw13b(\"SGLang Worker:\u003Cbr\u002F>llava-v1.6-34b\u003Cbr\u002F>PORT: 40002\")\n\n    %% Declare Styles\n    classDef data fill:#3af,stroke:#48a,stroke-width:2px,color:#444\n    classDef success fill:#8f8,stroke:#0a0,stroke-width:2px,color:#444\n    classDef failure fill:#f88,stroke:#f00,stroke-width:2px,color:#444\n\n    %% Assign Styles\n    class id,od data;\n    class cimg,cs_s,scsim_s success;\n    class ncimg,cs_f,scsim_f failure;\n\n    subgraph Demo Connections\n        direction BT\n        c\u003C-->gws\n        \n        mw7b\u003C-->c\n        mw13b\u003C-->c\n        lsglw13b\u003C-->c\n        sglw13b\u003C-->lsglw13b\n    end\n```\n\n#### Launch a controller\n```Shell\npython -m llava.serve.controller --host 0.0.0.0 --port 10000\n```\n\n#### Launch a gradio web server.\n```Shell\npython -m llava.serve.gradio_web_server --controller http:\u002F\u002Flocalhost:10000 --model-list-mode reload\n```\nYou just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.\n\n#### Launch a SGLang worker\n\nThis is the recommended way to serve LLaVA model with high throughput, and you need to install SGLang first. Note that currently `4-bit` quantization is not supported yet on SGLang-LLaVA, and if you have limited GPU VRAM, please check out model worker with [quantization](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA?tab=readme-ov-file#launch-a-model-worker-4-bit-8-bit-inference-quantized).\n\n```Shell\npip install \"sglang[all]\"\n```\n\nYou'll first launch a SGLang backend worker which will execute the models on GPUs. Remember the `--port` you've set and you'll use that later.\n\n```Shell\n# Single GPU\nCUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server --model-path liuhaotian\u002Fllava-v1.5-7b --tokenizer-path llava-hf\u002Fllava-1.5-7b-hf --port 30000\n\n# Multiple GPUs with tensor parallel\nCUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --model-path liuhaotian\u002Fllava-v1.5-13b --tokenizer-path llava-hf\u002Fllava-1.5-13b-hf --port 30000 --tp 2\n```\n\nTokenizers (temporary): `llava-hf\u002Fllava-1.5-7b-hf`, `llava-hf\u002Fllava-1.5-13b-hf`, `liuhaotian\u002Fllava-v1.6-34b-tokenizer`.\n\nYou'll then launch a LLaVA-SGLang worker that will communicate between LLaVA controller and SGLang backend to route the requests. Set `--sgl-endpoint` to `http:\u002F\u002F127.0.0.1:port` where `port` is the one you just set (default: 30000).\n\n```Shell\npython -m llava.serve.sglang_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --sgl-endpoint http:\u002F\u002F127.0.0.1:30000\n```\n\n#### Launch a model worker\n\nThis is the actual *worker* that performs the inference on the GPU.  Each worker is responsible for a single model specified in `--model-path`.\n\n```Shell\npython -m llava.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path liuhaotian\u002Fllava-v1.5-13b\n```\nWait until the process finishes loading the model and you see \"Uvicorn running on ...\".  Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.\n\nYou can launch as many workers as you want, and compare between different model checkpoints in the same Gradio interface. Please keep the `--controller` the same, and modify the `--port` and `--worker` to a different port number for each worker.\n```Shell\npython -m llava.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port \u003Cdifferent from 40000, say 40001> --worker http:\u002F\u002Flocalhost:\u003Cchange accordingly, i.e. 40001> --model-path \u003Cckpt2>\n```\n\nIf you are using an Apple device with an M1 or M2 chip, you can specify the mps device by using the `--device` flag: `--device mps`.\n\n#### Launch a model worker (Multiple GPUs, when GPU VRAM \u003C= 24GB)\n\nIf the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.), you may try running it with multiple GPUs. Our latest code base will automatically try to use multiple GPUs if you have more than one GPU. You can specify which GPUs to use with `CUDA_VISIBLE_DEVICES`. Below is an example of running with the first two GPUs.\n\n```Shell\nCUDA_VISIBLE_DEVICES=0,1 python -m llava.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path liuhaotian\u002Fllava-v1.5-13b\n```\n\n#### Launch a model worker (4-bit, 8-bit inference, quantized)\n\nYou can launch the model worker with quantized bits (4-bit, 8-bit), which allows you to run the inference with reduced GPU memory footprint, potentially allowing you to run on a GPU with as few as 12GB VRAM. Note that inference with quantized bits may not be as accurate as the full-precision model. Simply append `--load-4bit` or `--load-8bit` to the **model worker** command that you are executing. Below is an example of running with 4-bit quantization.\n\n```Shell\npython -m llava.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path liuhaotian\u002Fllava-v1.5-13b --load-4bit\n```\n\n#### Launch a model worker (LoRA weights, unmerged)\n\nYou can launch the model worker with LoRA weights, without merging them with the base checkpoint, to save disk space. There will be additional loading time, while the inference speed is the same as the merged checkpoints. Unmerged LoRA checkpoints do not have `lora-merge` in the model name, and are usually much smaller (less than 1GB) than the merged checkpoints (13G for 7B, and 25G for 13B).\n\nTo load unmerged LoRA weights, you simply need to pass an additional argument `--model-base`, which is the base LLM that is used to train the LoRA weights. You can check the base LLM of each LoRA weights in the [model zoo](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md).\n\n```Shell\npython -m llava.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path liuhaotian\u002Fllava-v1-0719-336px-lora-vicuna-13b-v1.3 --model-base lmsys\u002Fvicuna-13b-v1.3\n```\n\n### CLI Inference\n\nChat about images using LLaVA without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. With 4-bit quantization, for our LLaVA-1.5-7B, it uses less than 8GB VRAM on a single GPU.\n\n```Shell\npython -m llava.serve.cli \\\n    --model-path liuhaotian\u002Fllava-v1.5-7b \\\n    --image-file \"https:\u002F\u002Fllava-vl.github.io\u002Fstatic\u002Fimages\u002Fview.jpg\" \\\n    --load-4bit\n```\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhaotian-liu_LLaVA_readme_972b9db45c90.gif\" width=\"70%\">\n\n## Train\n\n*Below is the latest training configuration for LLaVA v1.5. For legacy models, please refer to README of [this](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Ftree\u002Fv1.0.1) version for now. We'll add them in a separate doc later.*\n\nLLaVA training consists of two stages: (1) feature alignment stage: use our 558K subset of the LAION-CC-SBU dataset to connect a *frozen pretrained* vision encoder to a *frozen LLM*; (2) visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data, plus around 515K VQA data from academic-oriented tasks, to teach the model to follow multimodal instructions.\n\nLLaVA is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.\n\n### Hyperparameters\nWe use a similar set of hyperparameters as Vicuna in finetuning.  Both hyperparameters used in pretraining and finetuning are provided below.\n\n1. Pretraining\n\n| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |\n| --- | ---: | ---: | ---: | ---: | ---: |\n| LLaVA-v1.5-13B | 256 | 1e-3 | 1 | 2048 | 0 |\n\n2. Finetuning\n\n| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |\n| --- | ---: | ---: | ---: | ---: | ---: |\n| LLaVA-v1.5-13B | 128 | 2e-5 | 1 | 2048 | 0 |\n\n### Download Vicuna checkpoints (automatically)\n\nOur base model Vicuna v1.5, which is an instruction-tuned chatbot, will be downloaded automatically when you run our provided training scripts. No action is needed.\n\n### Pretrain (feature alignment)\n\nPlease download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fliuhaotian\u002FLLaVA-Pretrain).\n\nPretrain takes around 5.5 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 3.5 hours for LLaVA-v1.5-7B.\n\nTraining script with DeepSpeed ZeRO-2: [`pretrain.sh`](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fscripts\u002Fv1_5\u002Fpretrain.sh).\n\n- `--mm_projector_type mlp2x_gelu`: the two-layer MLP vision-language connector.\n- `--vision_tower openai\u002Fclip-vit-large-patch14-336`: CLIP ViT-L\u002F14 336px.\n\n\u003Cdetails>\n\u003Csummary>Pretrain takes around 20 hours for LLaVA-7B on 8x V100 (32G)\u003C\u002Fsummary>\n\n We provide training script with DeepSpeed [here](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fscripts\u002Fpretrain_xformers.sh).\nTips:\n- If you are using V100 which is not supported by FlashAttention, you can use the [memory-efficient attention](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.05682) implemented in [xFormers](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fxformers). Install xformers and replace `llava\u002Ftrain\u002Ftrain_mem.py` above with [llava\u002Ftrain\u002Ftrain_xformers.py](llava\u002Ftrain\u002Ftrain_xformers.py).\n\u003C\u002Fdetails>\n\n### Visual Instruction Tuning\n\n1. Prepare data\n\nPlease download the annotation of the final mixture our instruction tuning data [llava_v1_5_mix665k.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fliuhaotian\u002FLLaVA-Instruct-150K\u002Fblob\u002Fmain\u002Fllava_v1_5_mix665k.json), and download the images from constituting datasets:\n\n- COCO: [train2017](http:\u002F\u002Fimages.cocodataset.org\u002Fzips\u002Ftrain2017.zip)\n- GQA: [images](https:\u002F\u002Fdownloads.cs.stanford.edu\u002Fnlp\u002Fdata\u002Fgqa\u002Fimages.zip)\n- OCR-VQA: [download script](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing), **we save all files as `.jpg`**\n- TextVQA: [train_val_images](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Ftextvqa\u002Fimages\u002Ftrain_val_images.zip)\n- VisualGenome: [part1](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Frak248\u002FVG_100K_2\u002Fimages.zip), [part2](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Frak248\u002FVG_100K_2\u002Fimages2.zip)\n\nAfter downloading all of them, organize the data as follows in `.\u002Fplayground\u002Fdata`,\n\n```\n├── coco\n│   └── train2017\n├── gqa\n│   └── images\n├── ocr_vqa\n│   └── images\n├── textvqa\n│   └── train_images\n└── vg\n    ├── VG_100K\n    └── VG_100K_2\n```\n\n2. Start training!\n\nYou may download our pretrained projectors in [Model Zoo](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md). It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function\u002Ftrain as we expected.\n\nVisual instruction tuning takes around 20 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 10 hours for LLaVA-v1.5-7B on 8x A100 (40G).\n\nTraining script with DeepSpeed ZeRO-3: [`finetune.sh`](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fscripts\u002Fv1_5\u002Ffinetune.sh).\n\nIf you are do not have enough GPU memory:\n\n- Use LoRA: [`finetune_lora.sh`](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fscripts\u002Fv1_5\u002Ffinetune_lora.sh). We are able to fit 13B training in 8-A100-40G\u002F8-A6000, and 7B training in 8-RTX3090. Make sure `per_device_train_batch_size*gradient_accumulation_steps` is the same as the provided script for best reproducibility.\n- Replace `zero3.json` with `zero3_offload.json` which offloads some parameters to CPU RAM. This slows down the training speed.\n\nIf you are interested in finetuning LLaVA model to your own task\u002Fdata, please check out [`Finetune_Custom_Data.md`](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FFinetune_Custom_Data.md)。\n\nNew options to note:\n\n- `--mm_projector_type mlp2x_gelu`: the two-layer MLP vision-language connector.\n- `--vision_tower openai\u002Fclip-vit-large-patch14-336`: CLIP ViT-L\u002F14 336px.\n- `--image_aspect_ratio pad`: this pads the non-square images to square, instead of cropping them; it slightly reduces hallucination.\n- `--group_by_modality_length True`: this should only be used when your instruction tuning dataset contains both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct). It makes the training sampler only sample a single modality (either image or language) during training, which we observe to speed up training by ~25%, and does not affect the final outcome.\n\n## Evaluation\n\nIn LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.\n\nSee [Evaluation.md](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FEvaluation.md).\n\n### GPT-assisted Evaluation\n\nOur GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models.  Please see our paper for more details.\n\n1. Generate LLaVA responses\n\n```Shell\npython model_vqa.py \\\n    --model-path .\u002Fcheckpoints\u002FLLaVA-13B-v0 \\\n    --question-file \\\n    playground\u002Fdata\u002Fcoco2014_val_qa_eval\u002Fqa90_questions.jsonl \\\n    --image-folder \\\n    \u002Fpath\u002Fto\u002Fcoco2014_val \\\n    --answers-file \\\n    \u002Fpath\u002Fto\u002Fanswer-file-our.jsonl\n```\n\n2. Evaluate the generated responses.  In our case, [`answer-file-ref.jsonl`](.\u002Fplayground\u002Fdata\u002Fcoco2014_val_qa_eval\u002Fqa90_gpt4_answer.jsonl) is the response generated by text-only GPT-4 (0314), with the context captions\u002Fboxes provided.\n\n```Shell\nOPENAI_API_KEY=\"sk-***********************************\" python llava\u002Feval\u002Feval_gpt_review_visual.py \\\n    --question playground\u002Fdata\u002Fcoco2014_val_qa_eval\u002Fqa90_questions.jsonl \\\n    --context llava\u002Feval\u002Ftable\u002Fcaps_boxes_coco2014_val_80.jsonl \\\n    --answer-list \\\n    \u002Fpath\u002Fto\u002Fanswer-file-ref.jsonl \\\n    \u002Fpath\u002Fto\u002Fanswer-file-our.jsonl \\\n    --rule llava\u002Feval\u002Ftable\u002Frule.json \\\n    --output \u002Fpath\u002Fto\u002Freview.json\n```\n\n3. Summarize the evaluation results\n\n```Shell\npython summarize_gpt_review.py\n```\n\n## Citation\n\nIf you find LLaVA useful for your research and applications, please cite using this BibTeX:\n```bibtex\n@misc{liu2024llavanext,\n    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},\n    url={https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-01-30-llava-next\u002F},\n    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},\n    month={January},\n    year={2024}\n}\n\n@misc{liu2023improvedllava,\n      title={Improved Baselines with Visual Instruction Tuning}, \n      author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},\n      publisher={arXiv:2310.03744},\n      year={2023},\n}\n\n@misc{liu2023llava,\n      title={Visual Instruction Tuning}, \n      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},\n      publisher={NeurIPS},\n      year={2023},\n}\n```\n\n## Acknowledgement\n\n- [Vicuna](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat): the codebase we built upon, and our base model Vicuna-13B that has the amazing language capabilities!\n\n## Related Projects\n\n- [Instruction Tuning with GPT-4](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM)\n- [LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLaVA-Med)\n- [Otter: In-Context Multi-Modal Instruction Tuning](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter)\n\nFor future project ideas, please check out:\n- [SEEM: Segment Everything Everywhere All at Once](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once)\n- [Grounded-Segment-Anything](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGrounded-Segment-Anything) to detect, segment, and generate anything by marrying [Grounding DINO](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGroundingDINO) and [Segment-Anything](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything).\n","# 🌋 LLaVA：大型语言与视觉助手\n\n*面向具备 GPT-4 级别能力的大型语言和视觉模型的视觉指令微调。*\n\n[📢 [LLaVA-NeXT 博客](https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-01-30-llava-next\u002F)] [[项目页面](https:\u002F\u002Fllava-vl.github.io\u002F)] [[演示](https:\u002F\u002Fllava.hliu.cc\u002F)]  [[数据](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FData.md)] [[模型库](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)]\n\n🤝社区贡献：[[llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002Fpull\u002F3436)] [[Colab](https:\u002F\u002Fgithub.com\u002Fcamenduru\u002FLLaVA-colab)] [[🤗Space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fbadayvedat\u002FLLaVA)] [[Replicate](https:\u002F\u002Freplicate.com\u002Fyorickvp\u002Fllava-13b)] [[AutoGen](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fautogen\u002Fblob\u002Fmain\u002Fnotebook\u002Fagentchat_lmm_llava.ipynb)]  [[BakLLaVA](https:\u002F\u002Fgithub.com\u002FSkunkworksAI\u002FBakLLaVA)]\n\n**通过视觉指令微调改进基线模型** [[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.03744)] [[HF](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2310.03744)] \u003Cbr>\n[刘浩天](https:\u002F\u002Fhliu.cc), [李春元](https:\u002F\u002Fchunyuan.li\u002F), [李宇恒](https:\u002F\u002Fyuheng-li.github.io\u002F), [李永宰](https:\u002F\u002Fpages.cs.wisc.edu\u002F~yongjaelee\u002F)\n\n**视觉指令微调**（NeurIPS 2023，**口头报告**）[[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08485)] [[HF](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2304.08485)] \u003Cbr>\n[刘浩天*](https:\u002F\u002Fhliu.cc), [李春元*](https:\u002F\u002Fchunyuan.li\u002F), [吴庆阳](https:\u002F\u002Fscholar.google.ca\u002Fcitations?user=HDiw-TsAAAAJ&hl=en\u002F), [李永宰](https:\u002F\u002Fpages.cs.wisc.edu\u002F~yongjaelee\u002F) (*同等贡献*)\n\n\u003C!--p align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fllava.hliu.cc\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhaotian-liu_LLaVA_readme_6925f42ff9ac.png\" width=\"50%\">\u003C\u002Fa> \u003Cbr>\n    Generated by \u003Ca href=\"https:\u002F\u002Fgligen.github.io\u002F\">GLIGEN\u003C\u002Fa> via \"a cute lava llama with glasses\" and box prompt\n\u003C\u002Fp-->\n\n\n## 发布\n\n- [2024\u002F05\u002F10] 🔥 **LLaVA-NeXT**（更强）模型发布，支持 LLama-3（8B）和 Qwen-1.5（72B\u002F110B）的更强大的 LMM。[[博客](https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-05-10-llava-next-stronger-llms\u002F)] [[检查点](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flmms-lab\u002Fllava-next-6623288e2d61edba3ddbf5ff)] [[演示](https:\u002F\u002Fllava-next.lmms-lab.com\u002F)] [[代码](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-NeXT\u002F)]\n- [2024\u002F05\u002F10] 🔥 **LLaVA-NeXT**（视频）发布。仅基于图像训练的 LLaVA-NeXT 模型在视频任务上表现出惊人的强大性能，实现了零样本模态迁移。通过对视频进行 AI 反馈的 DPO 训练可以显著提升性能。[[博客](https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-04-30-llava-next-video\u002F)] [[检查点](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flmms-lab\u002Fllava-next-video-661e86f5e8dabc3ff793c944)] [[代码](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-NeXT\u002F)]\n- [03\u002F10] 发布 **LMMs-Eval**，这是我们开发 LLaVA-NeXT 时使用的高效评估流水线。它支持对数十个公开数据集上的 LMM 进行评估，并允许新数据集的接入，从而大大加快了新 LMM 的开发速度。[[博客](https:\u002F\u002Flmms-lab.github.io\u002Flmms-eval-blog\u002Flmms-eval-0.1\u002F)] [[代码库](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval)]\n- [1\u002F30] 🔥 **LLaVA-NeXT**（LLaVA-1.6）发布！通过对 LLaVA-1.5 进一步扩展，LLaVA-NeXT-34B 在某些基准测试中超越了 Gemini Pro。现在它可以处理 4 倍于以往的像素，并执行更多任务和应用。请查看 [博客文章](https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-01-30-llava-next\u002F) 并体验 [演示](https:\u002F\u002Fllava.hliu.cc\u002F)！模型可在 [模型库](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md) 中找到。训练\u002F评估数据和脚本将很快发布。\n- [11\u002F10] [LLaVA-Plus](https:\u002F\u002Fllava-vl.github.io\u002Fllava-plus\u002F) 发布：学习使用工具创建多模态智能体，配备 LLaVA-Plus（可插拔并学习技能的 LLaVA）。[[项目页面](https:\u002F\u002Fllava-vl.github.io\u002Fllava-plus\u002F)] [[演示](https:\u002F\u002Fllavaplus.ngrok.io\u002F)] [[代码](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Plus-Codebase)] [[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.05437)]\n- [11\u002F2] [LLaVA-Interactive](https:\u002F\u002Fllava-vl.github.io\u002Fllava-interactive\u002F) 发布：通过一体化演示体验人机多模态交互的未来，涵盖图像聊天、分割、生成和编辑等功能。[[项目页面](https:\u002F\u002Fllava-vl.github.io\u002Fllava-interactive\u002F)] [[演示](https:\u002F\u002Fllavainteractive.ngrok.io\u002F)] [[代码](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Interactive-Demo)] [[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00571)]\n- [10\u002F26] 🔥 使用 LoRA 的 LLaVA-1.5 获得了与全模型微调相当的性能，同时降低了 GPU 内存需求（[检查点](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md#llava-v15)，[脚本](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA#train)）。我们还提供了一份 [文档](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FFinetune_Custom_Data.md)，介绍如何使用 LoRA 在您自己的数据集上对 LLaVA-1.5 进行微调。\n- [10\u002F12] 请查看由 ETRI 创造的韩语版 LLaVA（Ko-LLaVA），ETRI 慷慨地支持了我们的研究！[[🤗 演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fetri-vilab\u002FKo-LLaVA)]\n- [10\u002F5] 🔥 LLaVA-1.5 发布！在 11 个基准测试中达到 SOTA 水平，仅通过对原始 LLaVA 进行简单修改，便利用了所有公开数据，在单个 8-A100 节点上约 1 天内完成训练，超越了诸如 Qwen-VL-Chat 等使用数十亿规模数据的方法。请查看 [技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.03744)，并体验 [演示](https:\u002F\u002Fllava.hliu.cc\u002F)！模型可在 [模型库](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md) 中找到。LLaVA-1.5 的训练数据和脚本已在 [这里](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA#train) 发布，评估脚本则在 [这里](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FEvaluation.md) 发布！\n- [9\u002F26] LLaVA 通过人类反馈强化学习（RLHF）进行了改进，以提高事实准确性并减少幻觉现象。请在项目 [[LLavA-RLHF]](https:\u002F\u002Fllava-rlhf.github.io\u002F) 中查看新的 SFT 和 RLHF 检查点。\n- [9\u002F22] [LLaVA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08485) 被 NeurIPS 2023 接受为 **口头报告**，而 [LLaVA-Med](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00890) 则被 NeurIPS 2023 数据集与基准测试赛道接受为 **亮点报告**。\n\n\u003Cdetails>\n\u003Csummary>更多\u003C\u002Fsummary>\n\n- [11\u002F6] 支持 **Intel** 独显和 CPU 平台。[更多详情请点击这里。](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Ftree\u002Fintel\u002Fdocs\u002Fintel)\n- [10\u002F12] LLaVA 现已在 [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002Fpull\u002F3436) 中得到支持，并且支持 4 位\u002F5 位量化！\n- [10\u002F11] LLaVA-1.5 的训练数据和脚本已在此处发布 [这里](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA#train)，评估脚本则在此处发布 [这里](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FEvaluation.md)！\n- [10\u002F10] [Roboflow 深度解析](https:\u002F\u002Fblog.roboflow.com\u002Ffirst-impressions-with-llava-1-5\u002F)：初探 LLaVA-1.5。\n- [9\u002F20] 我们将训练 33B 和 65B LLaVA 模型的实证研究总结成了一篇 [笔记](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.09958)。此外，如果您对多模态基础模型的全面回顾、演进及趋势感兴趣，请查阅我们最近发表的综述论文 [《多模态基础模型：从专家到通用助手》](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.10020)。\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhaotian-liu_LLaVA_readme_54a10cc984e5.jpeg\" width=50%\u002F>\n\u003C\u002Fp>\n\n- [7\u002F19] 🔥 我们发布了重大升级，包括对 LLaMA-2 的支持、LoRA 训练、4\u002F8 位推理、更高分辨率（336x336）等功能。同时，我们还发布了 [LLaVA Bench](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FLLaVA_Bench.md)，用于基准测试开放式视觉聊天任务，并提供了 Bard 和 Bing-Chat 的评测结果。此外，我们也支持并验证了在 RTX 3090 和 RTX A6000 上的训练。请查看 [LLaVA-from-LLaMA-2](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FLLaVA_from_LLaMA2.md)，以及我们的 [模型库](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)！\n- [6\u002F26] [CVPR 2023 教程](https:\u002F\u002Fvlp-tutorial.github.io\u002F)主题为 **大型多模态模型：迈向构建并超越多模态 GPT-4**！请查看 [[幻灯片](https:\u002F\u002Fdatarelease.blob.core.windows.net\u002Ftutorial\u002Fvision_foundation_models_2023\u002Fslides\u002FChunyuan_cvpr2023_tutorial_lmm.pdf)] [[笔记](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.14895)] [[YouTube](https:\u002F\u002Fyoutu.be\u002FmkI7EPD1vp8)] [[Bilibili](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Ng4y1T7v3\u002F)]。\n- [6\u002F11] 我们发布了最受期待功能的预览版：DeepSpeed 和 LoRA 支持！请参阅相关文档 [这里](.\u002Fdocs\u002FLoRA.md)。\n- [6\u002F1] 我们发布了 **LLaVA-Med：生物医学领域的大型语言与视觉助手**，这是朝着构建具备 GPT-4 级别能力的生物医学领域大型语言与视觉模型迈出的重要一步。请查看 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00890) 和 [项目页面](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLaVA-Med)。\n- [5\u002F6] 我们正在发布 [LLaVA-Lighting-MPT-7B-preview](https:\u002F\u002Fhuggingface.co\u002Fliuhaotian\u002FLLaVA-Lightning-MPT-7B-preview)，基于 MPT-7B-Chat！更多详情请见 [这里](#LLaVA-MPT-7b)。\n- [5\u002F2] 🔥 我们发布了 LLaVA-Lighting！只需花费 40 美元，在 3 小时内即可训练出一个轻量级的多模态 GPT-4！更多详情请见 [这里](#train-llava-lightning)。\n- [4\u002F27] 感谢社区的努力，LLaVA-13B 结合 4 位量化技术，现在可以在仅配备 12GB 显存的 GPU 上运行！您可在此体验 [这里](https:\u002F\u002Fgithub.com\u002Foobabooga\u002Ftext-generation-webui\u002Ftree\u002Fmain\u002Fextensions\u002Fllava)。\n- [4\u002F17] 🔥 我们发布了 **LLaVA：大型语言与视觉助手**。我们提出了视觉指令微调方法，旨在构建具备 GPT-4 级别能力的大规模语言与视觉模型。请查看 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08485) 和 [演示](https:\u002F\u002Fllava.hliu.cc\u002F)。\n\n\u003C\u002Fdetails>\n\n\u003C!-- \u003Ca href=\"https:\u002F\u002Fllava.hliu.cc\u002F\">\u003Cimg src=\"assets\u002Fdemo.gif\" width=\"70%\">\u003C\u002Fa> -->\n\n[![代码许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache_2.0-green.svg)](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002FLICENSE)\n**使用与许可声明**：本项目使用了某些数据集和检查点，这些资源受其各自原始许可证的约束。用户必须遵守所有这些原始许可证中的条款与条件，包括但不限于针对该数据集的 [OpenAI 使用条款](https:\u002F\u002Fopenai.com\u002Fpolicies\u002Fterms-of-use)，以及针对使用该数据集训练的检查点所采用的基础语言模型的具体许可证（例如，针对 LLaMA-2 和 Vicuna-v1.5 的 [Llama 社区许可证](https:\u002F\u002Fai.meta.com\u002Fllama\u002Flicense\u002F)）。本项目不会施加超出原始许可证规定之外的任何额外限制。此外，提醒用户务必确保其对数据集和检查点的使用符合所有适用的法律法规。\n\n\n\n\n## 目录\n- [安装](#install)\n- [LLaVA 权重](#llava-weights)\n- [演示](#Demo)\n- [模型库](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)\n- [数据集](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FData.md)\n- [训练](#train)\n- [评估](#evaluation)\n\n## 安装\n\n如果您不使用 Linux，请*不要*继续操作，而是参阅适用于 [macOS](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FmacOS.md) 和 [Windows](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FWindows.md) 的说明。\n\n1. 克隆本仓库并进入 LLaVA 文件夹\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA.git\ncd LLaVA\n```\n\n2. 安装软件包\n```Shell\nconda create -n llava python=3.10 -y\nconda activate llava\npip install --upgrade pip  # 启用 PEP 660 支持\npip install -e .\n```\n\n3. 安装用于训练场景的附加包\n```\npip install -e \".[train]\"\npip install flash-attn --no-build-isolation\n```\n\n### 升级至最新代码库\n\n```Shell\ngit pull\npip install -e .\n\n# 如果升级后出现一些导入错误，\n# 请尝试运行以下命令（去掉 #）\n# pip install flash-attn --no-build-isolation --no-cache-dir\n```\n\n### HuggingFace 快速入门\n\n\u003Cdetails>\n\u003Csummary>示例代码\u003C\u002Fsummary>\n\n```Python\nfrom llava.model.builder import load_pretrained_model\nfrom llava.mm_utils import get_model_name_from_path\nfrom llava.eval.run_llava import eval_model\n\nmodel_path = \"liuhaotian\u002Fllava-v1.5-7b\"\n\ntokenizer, model, image_processor, context_len = load_pretrained_model(\n    model_path=model_path,\n    model_base=None,\n    model_name=get_model_name_from_path(model_path)\n)\n```\n\n详细信息请参阅 `llava\u002Fmodel\u002Fbuilder.py` 中的 `load_pretrained_model` 函数。\n\n您还可以使用 `llava\u002Feval\u002Frun_llava.py` 中的 `eval_model` 函数轻松获取输出。通过这种方式，您可以在下载本仓库后直接在 Colab 上运行此代码。\n\n``` python\nmodel_path = \"liuhaotian\u002Fllava-v1.5-7b\"\nprompt = \"我来这里参观时需要注意些什么？\"\nimage_file = \"https:\u002F\u002Fllava-vl.github.io\u002Fstatic\u002Fimages\u002Fview.jpg\"\n\nargs = type('Args', (), {\n    \"model_path\": model_path,\n    \"model_base\": None,\n    \"model_name\": get_model_name_from_path(model_path),\n    \"query\": prompt,\n    \"conv_mode\": None,\n    \"image_file\": image_file,\n    \"sep\": \",\",\n    \"temperature\": 0,\n    \"top_p\": None,\n    \"num_beams\": 1,\n    \"max_new_tokens\": 512\n})()\n\neval_model(args)\n```\n\u003C\u002Fdetails>\n\n## LLaVA 权重\n请查看我们的 [模型动物园](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)，以获取所有公开的 LLaVA 检查点，以及如何使用这些权重的说明。\n\n## 演示\n\n### Gradio Web UI\n\n要在本地启动 Gradio 演示，请依次运行以下命令。如果您计划启动多个模型工作节点来比较不同的检查点，只需*一次*启动控制器和 Web 服务器即可。\n\n```mermaid\nflowchart BT\n    %% 声明节点\n    gws(\"Gradio (UI 服务器)\")\n    c(\"控制器 (API 服务器):\u003Cbr\u002F>端口: 10000\")\n    mw7b(\"模型工作节点:\u003Cbr\u002F>llava-v1.5-7b\u003Cbr\u002F>端口: 40000\")\n    mw13b(\"模型工作节点:\u003Cbr\u002F>llava-v1.5-13b\u003Cbr\u002F>端口: 40001\")\n    sglw13b(\"SGLang 后端:\u003Cbr\u002F>llava-v1.6-34b\u003Cbr\u002F>http:\u002F\u002Flocalhost:30000\")\n    lsglw13b(\"SGLang 工作节点:\u003Cbr\u002F>llava-v1.6-34b\u003Cbr\u002F>端口: 40002\")\n\n    %% 声明样式\n    classDef data fill:#3af,stroke:#48a,stroke-width:2px,color:#444\n    classDef success fill:#8f8,stroke:#0a0,stroke-width:2px,color:#444\n    classDef failure fill:#f88,stroke:#f00,stroke-width:2px,color:#444\n\n    %% 分配样式\n    class id,od data;\n    class cimg,cs_s,scsim_s success;\n    class ncimg,cs_f,scsim_f failure;\n\n    subgraph 演示连接\n        direction BT\n        c\u003C-->gws\n        \n        mw7b\u003C-->c\n        mw13b\u003C-->c\n        lsglw13b\u003C-->c\n        sglw13b\u003C-->lsglw13b\n    end\n```\n\n#### 启动控制器\n```Shell\npython -m llava.serve.controller --host 0.0.0.0 --port 10000\n```\n\n#### 启动 Gradio Web 服务器。\n```Shell\npython -m llava.serve.gradio_web_server --controller http:\u002F\u002Flocalhost:10000 --model-list-mode reload\n```\n您已经成功启动了 Gradio Web 界面。现在，您可以使用屏幕上打印的 URL 打开该界面。您可能会注意到模型列表中还没有任何模型。不用担心，因为我们尚未启动任何模型工作节点。当您启动一个模型工作节点时，模型列表会自动更新。\n\n#### 启动 SGLang 工作节点\n\n这是以高吞吐量服务 LLaVA 模型的推荐方式，您需要先安装 SGLang。请注意，目前 SGLang-LLaVA 尚不支持 `4-bit` 量化。如果您 GPU 显存有限，请查看带有[量化](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA?tab=readme-ov-file#launch-a-model-worker-4-bit-8-bit-inference-quantized)的模型工作节点。\n\n```Shell\npip install \"sglang[all]\"\n```\n\n首先，您将启动一个 SGLang 后端工作节点，它将在 GPU 上执行模型推理。请记住您设置的 `--port`，稍后会用到。\n\n```Shell\n# 单 GPU\nCUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server --model-path liuhaotian\u002Fllava-v1.5-7b --tokenizer-path llava-hf\u002Fllava-1.5-7b-hf --port 30000\n\n# 多 GPU 张量并行\nCUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --model-path liuhaotian\u002Fllava-v1.5-13b --tokenizer-path llava-hf\u002Fllava-1.5-13b-hf --port 30000 --tp 2\n```\n\n临时分词器：`llava-hf\u002Fllava-1.5-7b-hf`、`llava-hf\u002Fllava-1.5-13b-hf`、`liuhaotian\u002Fllava-v1.6-34b-tokenizer`。\n\n接下来，您将启动一个 LLaVA-SGLang 工作节点，它负责在 LLaVA 控制器和 SGLang 后端之间通信，以路由请求。将 `--sgl-endpoint` 设置为 `http:\u002F\u002F127.0.0.1:端口`，其中 `端口` 是您刚刚设置的（默认为 30000）。\n\n```Shell\npython -m llava.serve.sglang_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --sgl-endpoint http:\u002F\u002F127.0.0.1:30000\n```\n\n#### 启动模型工作节点\n\n这是实际在 GPU 上执行推理的*工作节点*。每个工作节点负责 `--model-path` 中指定的单个模型。\n\n```Shell\npython -m llava.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path liuhaotian\u002Fllava-v1.5-13b\n```\n\n等待进程完成模型加载，并看到“Uvicorn 正在运行…”的消息。此时，刷新您的 Gradio Web 界面，您就会在模型列表中看到刚刚启动的模型。\n\n您可以根据需要启动任意数量的工作节点，并在同一 Gradio 界面中比较不同模型的检查点。请保持 `--controller` 不变，但为每个工作节点修改 `--port` 和 `--worker` 的端口号。\n\n```Shell\npython -m llava.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port \u003C不同于 40000，例如 40001> --worker http:\u002F\u002Flocalhost:\u003C相应地更改，即 40001> --model-path \u003Cckpt2>\n```\n\n如果您使用的是搭载 M1 或 M2 芯片的 Apple 设备，可以通过 `--device` 标志指定 mps 设备：`--device mps`。\n\n#### 启动模型工作节点（多 GPU，当 GPU 显存 ≤ 24GB）\n\n如果您的 GPU 显存小于 24GB（例如 RTX 3090、RTX 4090 等），可以尝试使用多 GPU 运行。我们最新的代码库会在您拥有多个 GPU 时自动尝试使用多 GPU。您可以通过 `CUDA_VISIBLE_DEVICES` 指定要使用的 GPU。以下是使用前两块 GPU 的示例：\n\n```Shell\nCUDA_VISIBLE_DEVICES=0,1 python -m llava.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path liuhaotian\u002Fllava-v1.5-13b\n```\n\n#### 启动模型工作节点（4-bit、8-bit 推理，量化）\n\n您可以启动带有量化位数（4-bit、8-bit）的模型工作节点，这样可以减少 GPU 内存占用，从而可能在显存低至 12GB 的 GPU 上运行。请注意，使用量化位数进行推理的准确性可能不如全精度模型。只需在您正在执行的**模型工作节点**命令中添加 `--load-4bit` 或 `--load-8bit` 即可。以下是使用 4-bit 量化运行的示例：\n\n```Shell\npython -m llava.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path liuhaotian\u002Fllava-v1.5-13b --load-4bit\n```\n\n#### 启动模型工作节点（LoRA 权重，未合并）\n\n您可以启动带有 LoRA 权重的模型工作节点，而不将其与基础检查点合并，以节省磁盘空间。虽然加载时间会有所增加，但推理速度与已合并的检查点相同。未合并的 LoRA 检查点在模型名称中没有 `lora-merge`，通常比已合并的检查点小得多（7B 的未合并版本小于 1GB，而 13B 的未合并版本约为 25GB）。\n\n要加载未合并的 LoRA 权重，您只需传递一个额外的参数 `--model-base`，即用于训练 LoRA 权重的基础 LLM。您可以在 [模型动物园](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)中查看每组 LoRA 权重对应的基础 LLM。\n\n```Shell\npython -m llava.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path liuhaotian\u002Fllava-v1-0719-336px-lora-vicuna-13b-v1.3 --model-base lmsys\u002Fvicuna-13b-v1.3\n```\n\n### 命令行推理\n\n使用 LLaVA 在无需 Gradio 界面的情况下对图像进行对话。它还支持多 GPU、4 位和 8 位量化推理。采用 4 位量化后，对于我们的 LLaVA-1.5-7B 模型，在单个 GPU 上仅需不到 8GB 显存。\n\n```Shell\npython -m llava.serve.cli \\\n    --model-path liuhaotian\u002Fllava-v1.5-7b \\\n    --image-file \"https:\u002F\u002Fllava-vl.github.io\u002Fstatic\u002Fimages\u002Fview.jpg\" \\\n    --load-4bit\n```\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhaotian-liu_LLaVA_readme_972b9db45c90.gif\" width=\"70%\">\n\n## 训练\n\n*以下是 LLaVA v1.5 的最新训练配置。对于旧版本模型，请暂时参考 [此](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Ftree\u002Fv1.0.1) 版本的 README。我们稍后会将其添加到单独的文档中。*\n\nLLaVA 的训练分为两个阶段：(1) 特征对齐阶段：使用我们从 LAION-CC-SBU 数据集中提取的 55.8 万张子集，将一个 *冻结的预训练* 视觉编码器与一个 *冻结的 LLM* 连接起来；(2) 视觉指令微调阶段：使用由 GPT 生成的 15 万条多模态指令遵循数据，再加上约 51.5 万条面向学术任务的 VQA 数据，来训练模型遵循多模态指令。\n\nLLaVA 在 8 张 80GB 显存的 A100 GPU 上进行训练。如果希望在较少的 GPU 上训练，可以相应地减少 `per_device_train_batch_size` 并增加 `gradient_accumulation_steps`。务必保持全局批次大小不变：`per_device_train_batch_size` × `gradient_accumulation_steps` × `num_gpus`。\n\n### 超参数\n我们在微调时使用了与 Vicuna 类似的超参数集合。以下同时提供了预训练和微调所使用的超参数。\n\n1. 预训练\n\n| 超参数 | 全局批次大小 | 学习率 | 轮数 | 最大长度 | 权重衰减 |\n| --- | ---: | ---: | ---: | ---: | ---: |\n| LLaVA-v1.5-13B | 256 | 1e-3 | 1 | 2048 | 0 |\n\n2. 微调\n\n| 超参数 | 全局批次大小 | 学习率 | 轮数 | 最大长度 | 权重衰减 |\n| --- | ---: | ---: | ---: | ---: | ---: |\n| LLaVA-v1.5-13B | 128 | 2e-5 | 1 | 2048 | 0 |\n\n### 自动下载 Vicuna 检查点\n我们的基础模型 Vicuna v1.5 是一个经过指令微调的聊天机器人，当您运行我们提供的训练脚本时，它会自动下载，无需任何操作。\n\n### 预训练（特征对齐）\n请从 [这里](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fliuhaotian\u002FLLaVA-Pretrain) 下载我们在论文中使用的带有 BLIP 字幕的 LAION-CC-SBU 数据集 55.8 万张子集。\n\n由于分辨率提升至 336px，LLaVA-v1.5-13B 在 8 张 A100 (80G) 上预训练大约需要 5.5 小时。而 LLaVA-v1.5-7B 则大约需要 3.5 小时。\n\n使用 DeepSpeed ZeRO-2 的训练脚本：[`pretrain.sh`](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fscripts\u002Fv1_5\u002Fpretrain.sh)。\n\n- `--mm_projector_type mlp2x_gelu`：两层 MLP 视觉-语言连接器。\n- `--vision_tower openai\u002Fclip-vit-large-patch14-336`：CLIP ViT-L\u002F14 336px。\n\n\u003Cdetails>\n\u003Csummary>LLaVA-7B 在 8 张 V100 (32G) 上预训练大约需要 20 小时\u003C\u002Fsummary>\n\n我们在此提供使用 DeepSpeed 的训练脚本 [here](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fscripts\u002Fpretrain_xformers.sh)。\n提示：\n- 如果您使用的是不支持 FlashAttention 的 V100 显卡，可以使用 [xFormers](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fxformers) 中实现的 [内存高效注意力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.05682)。安装 xformers，并将上述 `llava\u002Ftrain\u002Ftrain_mem.py` 替换为 [llava\u002Ftrain\u002Ftrain_xformers.py](llava\u002Ftrain\u002Ftrain_xformers.py)。\n\u003C\u002Fdetails>\n\n### 视觉指令微调\n\n1. 准备数据\n\n请下载我们指令微调数据最终混合体的标注文件 [llava_v1_5_mix665k.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fliuhaotian\u002FLLaVA-Instruct-150K\u002Fblob\u002Fmain\u002Fllava_v1_5_mix665k.json)，并从组成数据集的各个来源下载图像：\n\n- COCO：[train2017](http:\u002F\u002Fimages.cocodataset.org\u002Fzips\u002Ftrain2017.zip)\n- GQA：[images](https:\u002F\u002Fdownloads.cs.stanford.edu\u002Fnlp\u002Fdata\u002Fgqa\u002Fimages.zip)\n- OCR-VQA：[下载脚本](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing)，**我们将所有文件保存为 `.jpg` 格式**\n- TextVQA：[train_val_images](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Ftextvqa\u002Fimages\u002Ftrain_val_images.zip)\n- VisualGenome：[part1](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Frak248\u002FVG_100K_2\u002Fimages.zip)、[part2](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Frak248\u002FVG_100K_2\u002Fimages2.zip)\n\n全部下载完成后，请按照以下方式在 `.\u002Fplayground\u002Fdata` 目录下组织数据：\n\n```\n├── coco\n│   └── train2017\n├── gqa\n│   └── images\n├── ocr_vqa\n│   └── images\n├── textvqa\n│   └── train_images\n└── vg\n    ├── VG_100K\n    └── VG_100K_2\n```\n\n2. 开始训练！\n\n您可以从 [Model Zoo](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md) 下载我们预训练好的投影器。不建议使用旧版投影器，因为它们可能是用不同版本的代码库训练的，一旦有任何选项设置错误，模型将无法按预期运行或训练。\n\n由于分辨率提升至 336px，LLaVA-v1.5-13B 在 8 张 A100 (80G) 上视觉指令微调大约需要 20 小时。而 LLaVA-v1.5-7B 在 8 张 A100 (40G) 上则大约需要 10 小时。\n\n使用 DeepSpeed ZeRO-3 的训练脚本：[`finetune.sh`](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fscripts\u002Fv1_5\u002Ffinetune.sh)。\n\n如果您显存不足：\n\n- 使用 LoRA：[`finetune_lora.sh`](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fscripts\u002Fv1_5\u002Ffinetune_lora.sh)。我们可以在 8 张 A100-40G 或 8 张 A6000 上完成 13B 模型的训练，在 8 张 RTX3090 上完成 7B 模型的训练。请确保 `per_device_train_batch_size*gradient_accumulation_steps` 与提供的脚本一致，以获得最佳的可重复性。\n- 将 `zero3.json` 替换为 `zero3_offload.json`，该文件会将部分参数卸载到 CPU 内存中。这会降低训练速度。\n\n如果您有兴趣将 LLaVA 模型微调到自己的任务或数据上，请参阅 [`Finetune_Custom_Data.md`](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FFinetune_Custom_Data.md)。\n\n需要注意的新选项：\n\n- `--mm_projector_type mlp2x_gelu`：两层 MLP 视觉-语言连接器。\n- `--vision_tower openai\u002Fclip-vit-large-patch14-336`：CLIP ViT-L\u002F14 336px。\n- `--image_aspect_ratio pad`：此选项会将非正方形图像填充为正方形，而不是裁剪它们；这可以略微减少幻觉现象。\n- `--group_by_modality_length True`：此选项仅应在您的指令微调数据集同时包含语言（例如 ShareGPT）和多模态（例如 LLaVA-Instruct）数据时使用。它会使训练采样器在训练过程中只采样单一模态（要么是图像，要么是语言），我们观察到这样可以使训练速度加快约 25%，且不会影响最终结果。\n\n## 评估\n\n在 LLaVA-1.5 中，我们使用一组多样化的 12 个基准测试来评估模型。为了确保可重复性，我们使用贪婪解码来评估模型。我们不使用束搜索进行评估，以使推理过程与实时输出的聊天演示保持一致。\n\n详情请参阅 [Evaluation.md](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FEvaluation.md)。\n\n### GPT辅助评估\n\n我们提供的多模态建模GPT辅助评估流程，旨在全面了解视觉-语言模型的能力。更多详情请参阅我们的论文。\n\n1. 生成LLaVA响应\n\n```Shell\npython model_vqa.py \\\n    --model-path .\u002Fcheckpoints\u002FLLaVA-13B-v0 \\\n    --question-file \\\n    playground\u002Fdata\u002Fcoco2014_val_qa_eval\u002Fqa90_questions.jsonl \\\n    --image-folder \\\n    \u002Fpath\u002Fto\u002Fcoco2014_val \\\n    --answers-file \\\n    \u002Fpath\u002Fto\u002Fanswer-file-our.jsonl\n```\n\n2. 对生成的响应进行评估。在本例中，[`answer-file-ref.jsonl`](.\u002Fplayground\u002Fdata\u002Fcoco2014_val_qa_eval\u002Fqa90_gpt4_answer.jsonl) 是由纯文本GPT-4 (0314) 生成的响应，上下文包含说明文字和边界框信息。\n\n```Shell\nOPENAI_API_KEY=\"sk-***********************************\" python llava\u002Feval\u002Feval_gpt_review_visual.py \\\n    --question playground\u002Fdata\u002Fcoco2014_val_qa_eval\u002Fqa90_questions.jsonl \\\n    --context llava\u002Feval\u002Ftable\u002Fcaps_boxes_coco2014_val_80.jsonl \\\n    --answer-list \\\n    \u002Fpath\u002Fto\u002Fanswer-file-ref.jsonl \\\n    \u002Fpath\u002Fto\u002Fanswer-file-our.jsonl \\\n    --rule llava\u002Feval\u002Ftable\u002Frule.json \\\n    --output \u002Fpath\u002Fto\u002Freview.json\n```\n\n3. 总结评估结果\n\n```Shell\npython summarize_gpt_review.py\n```\n\n## 引用\n\n如果您发现LLaVA对您的研究和应用有所帮助，请使用以下BibTeX格式引用：\n\n```bibtex\n@misc{liu2024llavanext,\n    title={LLaVA-NeXT：改进的推理能力、OCR与世界知识},\n    url={https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-01-30-llava-next\u002F},\n    author={刘浩天、李春元、李宇恒、李博、张元翰、沈胜、李勇宰},\n    month={一月},\n    year={2024}\n}\n\n@misc{liu2023improvedllava,\n      title={通过视觉指令微调改进基线}, \n      author={刘浩天、李春元、李宇恒、李勇宰},\n      publisher={arXiv:2310.03744},\n      year={2023},\n}\n\n@misc{liu2023llava,\n      title={视觉指令微调}, \n      author={刘浩天、李春元、吴清阳、李勇宰},\n      publisher={NeurIPS},\n      year={2023},\n}\n```\n\n## 致谢\n\n- [Vicuna](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat)：我们基于其代码库构建，并以此为基础模型Vicuna-13B，它拥有卓越的语言能力！\n\n## 相关项目\n\n- [使用GPT-4进行指令微调](https:\u002F\u002Fgithub.com\u002FInstruction-Tuning-with-GPT-4\u002FGPT-4-LLM)\n- [LLaVA-Med：一天内训练用于生物医学的大规模视觉-语言助手](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLLaVA-Med)\n- [Otter：上下文多模态指令微调](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter)\n\n如需未来项目创意，请查看：\n- [SEEM：一次完成所有地方的所有内容分割](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once)\n- [Grounded-Segment-Anything](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGrounded-Segment-Anything)，通过结合[Grounding DINO](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGroundingDINO)和[Segment-Anything](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything)，实现检测、分割及生成任何内容的功能。","# LLaVA 快速上手指南\n\nLLaVA (Large Language and Vision Assistant) 是一个开源的多模态大语言模型，通过视觉指令微调，实现了接近 GPT-4 级别的图像理解与对话能力。本指南将帮助你快速在本地部署并运行 LLaVA。\n\n## 1. 环境准备\n\n在开始之前，请确保你的开发环境满足以下要求：\n\n*   **操作系统**: 推荐 **Linux** (Ubuntu\u002FCentOS 等)。\n    *   *注：macOS 和 Windows 用户请参考官方文档中的特定适配方案，本指南主要针对 Linux 环境。*\n*   **硬件要求**:\n    *   **GPU**: 推荐使用 NVIDIA GPU。\n        *   运行 7B\u002F13B 模型建议显存 ≥ 16GB (使用量化版本可降低至 12GB 或更低)。\n        *   训练或运行更大参数模型（如 34B+）需要更高显存或多卡环境。\n    *   **CPU**: 多核处理器，建议 8 核以上。\n    *   **内存**: 建议 32GB 以上。\n*   **软件依赖**:\n    *   **Conda**: 用于管理 Python 环境 (推荐 Miniconda 或 Anaconda)。\n    *   **CUDA**: 已安装与你的 GPU 驱动的兼容版本 (通常建议 CUDA 11.8 或 12.1)。\n    *   **Git**: 用于克隆代码库。\n\n## 2. 安装步骤\n\n### 第一步：克隆仓库\n打开终端，执行以下命令获取最新代码：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA.git\ncd LLaVA\n```\n\n### 第二步：创建并激活 Conda 环境\n创建一个名为 `llava` 的 Python 3.10 环境：\n\n```bash\nconda create -n llava python=3.10 -y\nconda activate llava\n```\n\n### 第三步：安装核心依赖\n升级 pip 以支持 PEP 660，并安装 LLaVA 包：\n\n```bash\npip install --upgrade pip  # enable PEP 660 support\npip install -e .\n```\n\n> **国内加速提示**：如果下载速度较慢，可临时切换至清华或阿里镜像源：\n> `pip install -e . -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n\n### 第四步：安装训练与推理增强包（可选但推荐）\n如果你计划进行模型训练或使用 Flash Attention 加速推理，请执行：\n\n```bash\npip install -e \".[train]\"\npip install flash-attn --no-build-isolation\n```\n\n> **注意**：`flash-attn` 编译可能需要较长时间，请确保已安装对应的 CUDA Toolkit。若遇到导入错误，可尝试添加 `--no-cache-dir` 参数重新安装。\n\n## 3. 基本使用\n\n安装完成后，你可以通过 Python 脚本快速加载模型并进行图像对话。以下是一个基于 Hugging Face `transformers` 接口的最小化示例。\n\n**前提条件**：你需要先从 [Hugging Face Model Zoo](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Flmms-lab\u002Fllava-next-6623288e2d61edba3ddbf5ff) 或 [LLaVA 官方模型库](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md) 下载模型权重（例如 `llava-v1.5-7b`），或者直接在代码中使用模型 ID（需联网）。\n\n```python\nimport torch\nfrom llava.model.builder import load_pretrained_model\nfrom llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token\nfrom llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN\nfrom transformers import TextIteratorStreamer\nfrom threading import Thread\n\n# 配置模型路径 (可以是本地路径或 HF model_id)\nmodel_path = \"llava-hf\u002Fllava-1.5-7b-hf\" \n\n# 加载预训练模型\ntokenizer, model, image_processor, context_len = load_pretrained_model(\n    model_path=model_path,\n    model_base=None,\n    model_name=get_model_name_from_path(model_path),\n    device=\"cuda\",\n    device_map={\"\": \"cuda\"},\n    torch_dtype=torch.float16,\n    load_8bit=False,\n    cpu_offloading=False,\n    debug=False\n)\n\n# 准备输入\nimage_file = \"example.jpg\"  # 替换为你的图片路径\nprompt = \"这张图片里有什么？请详细描述。\"\n\n# 处理图像和文本\n# 注意：实际使用时需根据具体版本调整 prompt 格式，LLaVA-1.5+ 通常自动处理 \u003Cimage> 标记\ninputs = tokenizer_image_token(prompt, tokenizer, return_tensors='pt').unsqueeze(0).cuda()\n\n# 生成回复 (简化版调用，实际生产环境建议使用完整的 generate 流程)\n# 此处仅为演示加载成功，完整生成逻辑需参考官方 demo 脚本\nprint(\"模型加载成功！可以进行推理。\")\n```\n\n**更简单的命令行测试方式**：\n如果你已经下载了模型权重，可以使用官方提供的推理脚本（需确保图片路径正确）：\n\n```bash\npython llava\u002Fserve\u002Fcli.py \\\n    --model-path \u002Fpath\u002Fto\u002Fllava-v1.5-7b \\\n    --image-file .\u002Fimages\u002Fllama.png \\\n    --query \"What is the animal in the picture doing?\"\n```\n\n运行后，终端将输出模型对图片的描述。","某电商平台的客服团队每天需处理成千上万张用户上传的商品破损或发错货的照片，传统流程依赖人工逐一查看并录入系统。\n\n### 没有 LLaVA 时\n- 客服人员必须肉眼仔细辨别图片细节，耗时费力且容易因疲劳产生误判。\n- 无法直接通过自然语言询问图片内容（如“这个划痕在什么位置？”），只能依靠人工描述转录。\n- 遇到复杂场景（如包装完好但内部物品碎裂）时，非专业人员难以准确界定责任归属。\n- 大量图片数据沉睡在服务器中，无法转化为结构化文本数据用于后续的质量分析。\n- 响应速度慢，用户往往需要等待数小时甚至隔天才能收到初步反馈，严重影响体验。\n\n### 使用 LLaVA 后\n- LLaVA 能自动“看懂”用户上传的图片，瞬间识别出破损类型、位置及严重程度，并生成详细文字报告。\n- 支持多轮视觉对话，系统可直接回答“包装盒是否变形”或“标签是否匹配”等具体问题，无需人工介入。\n- 面对复杂场景，LLaVA 结合视觉与逻辑推理能力，能准确判断是物流问题还是发货错误，辅助定责。\n- 自动将图片信息转化为结构化数据存入数据库，为产品质检和物流优化提供实时数据支撑。\n- 实现秒级自动回复，用户上传图片后立即获得初步解决方案，大幅缩短客诉处理周期。\n\nLLaVA 将原本割裂的视觉感知与语言理解能力深度融合，让机器真正具备了像人类专家一样“看图说话”并解决实际问题的高效生产力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhaotian-liu_LLaVA_ad660d97.png","haotian-liu","Haotian Liu","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhaotian-liu_dcf83070.jpg",null,"xAI","Cupertino, CA","imhaotian","https:\u002F\u002Fhliu.cc","https:\u002F\u002Fgithub.com\u002Fhaotian-liu",[86,90,94,98,102,106],{"name":87,"color":88,"percentage":89},"Python","#3572A5",85.7,{"name":91,"color":92,"percentage":93},"Shell","#89e051",8.5,{"name":95,"color":96,"percentage":97},"JavaScript","#f1e05a",2.8,{"name":99,"color":100,"percentage":101},"HTML","#e34c26",2.1,{"name":103,"color":104,"percentage":105},"CSS","#663399",0.5,{"name":107,"color":108,"percentage":109},"Dockerfile","#384d54",0.4,24647,2758,"2026-04-05T05:01:46","Apache-2.0","Linux, macOS, Windows","训练必需 NVIDIA GPU (如 8xA100, RTX 3090, RTX A6000)；推理支持 4-bit 量化在 12GB VRAM 上运行，也支持 Intel dGPU 和 CPU","未说明",{"notes":118,"python":119,"dependencies":120},"非 Linux 用户需参考专门的 macOS 或 Windows 文档。训练建议安装 flash-attn（无隔离构建）。支持 LoRA 微调以降低显存需求。支持 llama.cpp 进行 4-bit\u002F5-bit 量化推理。项目依赖的基础模型（如 LLaMA）和数据集需遵守各自的原始许可证。","3.10",[121,122,123,124,125,126,127,128,129],"torch","transformers","accelerate","peft","flash-attn","deepspeed","scikit-learn","sentencepiece","bitsandbytes",[54,26,13],[132,133,134,135,136,137,138,139,140,141,142,143,144],"gpt-4","chatbot","chatgpt","llama","multimodal","llava","foundation-models","instruction-tuning","multi-modality","visual-language-learning","llama-2","llama2","vision-language-model","2026-03-27T02:49:30.150509","2026-04-06T08:46:03.361774",[148,153,158,163,168,173],{"id":149,"question_zh":150,"answer_zh":151,"source_url":152},17044,"启动服务时遇到 ImportError: cannot import name 'LlavaLlamaForCausalLM' from 'llava.model' 错误怎么办？","该错误通常由包版本不兼容或文件路径变更引起。解决方案如下：\n1. 检查类定义位置：运行命令 `grep -R \"class LlavaLlamaForCausalLM\" -n llava | head`。如果输出显示类在 `llava\u002Fmodel\u002Flanguage_model\u002Fllava_llama.py`，则需修改导入路径。\n2. 修改代码：打开 `\u002FLLaVA\u002Fllava\u002F__init__.py`，将 `from .model import LlavaLlamaForCausalLM` 替换为实际路径，例如 `from llava.model.language_model.llava_llama import LlavaLlamaForCausalLM`。\n3. 更新依赖：确保安装兼容版本的库，推荐配置为 `transformers==4.37.2`, `accelerate==0.28.0`, `deepspeed==0.14.4` 以及 `flash-attn\u003C=2.6.3`。也可尝试运行 `pip install -e \".[train]\"` 重新安装环境。","https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fissues\u002F1101",{"id":154,"question_zh":155,"answer_zh":156,"source_url":157},17045,"如何在自定义数据集上评估微调后的 LLaVA 模型？数据格式是什么？","评估自定义数据时，需要构建符合要求的 JSONL 格式文件。每条数据应包含图像路径、文本问题、类别和问题 ID。示例格式如下：\n{\"image\": \"SomeImage_141.JPG\", \"text\": \"Some Question ?\", \"category\": \"conv\", \"question_id\": 0}\n其中 \"image\" 字段指向图像文件路径，\"text\" 为对应的问题内容。准备好该文件后，可参考项目文档中的评估脚本进行推理测试。","https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fissues\u002F963",{"id":159,"question_zh":160,"answer_zh":161,"source_url":162},17046,"预训练（pretrain）过程中报错或失败如何解决？","预训练阶段的许多错误（如算子不支持或内存问题）可以通过升级 PyTorch 版本解决。建议将 PyTorch 升级到 2.0 或更高版本。此外，如果是多卡训练（如 8 张 A100）失败而少卡正常，请检查显存分配和 NCCL 配置，确保使用了最新的 CUDA 驱动，并尝试在 Docker 容器中重新部署整个项目以排除环境干扰。","https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fissues\u002F24",{"id":164,"question_zh":165,"answer_zh":166,"source_url":167},17047,"运行 model_worker 时出现 AttributeError: 'LlamaModel' object has no attribute 'vision_tower' 错误？","此错误表明模型配置文件缺失视觉塔（vision tower）相关参数。解决方法：\n1. 检查模型文件夹下的 `config.json`，确认是否包含 `mm_vision_tower`, `mm_hidden_size`, `mm_use_im_start_end` 等字段。\n2. 如果缺失，说明模型权重未正确转换或下载的是旧版本 Delta 权重。请重新下载最新的 13B Delta v0 权重，并按照官方指南重新生成完整的 LLaVA 模型文件。\n3. 若问题依旧，建议在带有新版 CUDA 驱动的 Docker 环境中重新安装整个项目。","https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fissues\u002F15",{"id":169,"question_zh":170,"answer_zh":171,"source_url":172},17048,"如何实现 LLaVA 的批量评估（Batch Evaluation）以提高效率？","目前推荐使用 SGLang 后端进行批量评估，速度可提升 5 倍。示例代码见 SGLang 仓库的 llava_bench。若需在原生代码中实现批量推理，需注意填充（Padding）方向：\n1. 必须使用左侧填充（Left Padding）。可以自定义函数 `left_pad_sequence_to_max_length` 对 input_ids 进行填充。\n2. 在处理批次提示词时，先计算最大长度，然后将所有序列填充至该长度并堆叠（stack）。\n3. 注意：原生脚本可能存在生成 NaN 的问题，需关注后续修复或使用 SGLang 替代方案。","https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fissues\u002F754",{"id":174,"question_zh":175,"answer_zh":176,"source_url":152},17049,"安装 flash-attn 时遇到构建错误或兼容性问题如何处理？","安装 flash-attn 失败通常与 CUDA 版本或构建隔离有关。建议步骤：\n1. 手动下载特定版本的 wheel 包（如 flash-attn 2.7.1 或 \u003C=2.6.3），避免直接使用 pip 在线编译。\n2. 使用命令 `pip install flash-attn --no-build-isolation --no-cache-dir` 进行安装。\n3. 确保系统 CUDA 版本（如 12.4）与安装的 flash-attn 版本兼容。如果仍然报错，尝试降低 flash-attn 版本至 2.6.3 以下，或检查 gcc\u002Fg++ 编译器版本是否匹配。",[178,183,188,193,198,203],{"id":179,"version":180,"summary_zh":181,"released_at":182},99294,"v1.2.0","LLaVA-1.6 正式发布！在 LLaVA-1.5 的基础上进一步扩大规模，LLaVA-1.6-34B 在部分基准测试上超越了 Gemini Pro。现在，它能够处理的像素数量是之前的四倍，并支持更多任务和应用场景。欢迎查看[博客文章](https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-01-30-llava-1-6\u002F)，并体验[演示](https:\u002F\u002Fllava.hliu.cc\u002F)！模型已在[模型库](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)中上线。训练与评估数据及脚本即将开放下载。","2024-01-31T06:07:08",{"id":184,"version":185,"summary_zh":186,"released_at":187},99295,"v1.1.3","### 更新\n\n- 支持在 LLaVA-1.5 的指令微调阶段使用 LoRA——性能可与全模型微调相媲美，且对 GPU 显存的需求更低。（[检查点\u002F日志](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md#llava-v15)、[脚本](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA#train)）\n- 使用您自己的数据对 LLaVA-1.5 进行微调，以适配您的特定任务。（[说明](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FFinetune_Custom_Data.md)）\n- 提供对 Windows 系统的基本支持。（[说明](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FWindows.md)）\n- 修复：启用梯度累积时的训练行为与大批次训练一致。\n\n### 注意事项\n\n- LLaVA-1.5 采用了新的 LoRA 配置：\n  - rank：128\n  - alpha：256\n  - LoRA 学习率：2e-4\n  - 投影层学习率：2e-5\n","2023-10-26T20:40:13",{"id":189,"version":190,"summary_zh":191,"released_at":192},99296,"v1.1.1","在这一版本中，我们发布了 LLaVA 1.5 的[训练](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA#train)脚本、数据以及基准测试上的[评估](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FEvaluation.md)脚本。立即动手打造属于你的 LLaVA 吧！\n\n> LLaVA-1.5 在 11 个基准上取得了当前最优性能；仅通过对原始 LLaVA 进行简单修改，便充分利用了所有公开数据，在单台配备 8 张 A100 显卡的节点上约 1 天内即可完成训练，并超越了诸如 Qwen-VL-Chat 等使用百亿级别数据的方法。欢迎查看[技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.03744)，并体验[演示](https:\u002F\u002Fllava.hliu.cc\u002F)！模型已在[模型库](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)中开放下载！","2023-10-12T00:17:34",{"id":194,"version":195,"summary_zh":196,"released_at":197},99297,"v1.1.0","🔥 LLaVA-1.5 正式发布！本次版本支持 LLaVA-1.5 模型的推理与服务部署。\n*我们将在下周发布训练脚本、数据集以及在各基准上的评估脚本。*\n\n> LLaVA-1.5 在 11 个基准上取得 SOTA 成绩，仅对原版 LLaVA 进行了简单修改，便充分利用了所有公开数据，在单台配备 8 张 A100 显卡的节点上约 1 天即可完成训练，性能超越了诸如 Qwen-VL-Chat 等使用数十亿规模数据的方法。欢迎查看[技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.03744)，并体验[演示](https:\u002F\u002Fllava.hliu.cc\u002F)！模型已收录于[模型库](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)，训练与评估脚本将于下周发布！","2023-10-08T14:01:33",{"id":199,"version":200,"summary_zh":201,"released_at":202},99298,"v1.0.2","- 新增模型库\n- 使用最新的训练配置提升了对 ScienceQA 的支持\n- 优化了文档\n\n我们正在持续改进文档。如果您发现任何文档内容不够清晰，请随时告诉我们，谢谢！","2023-09-05T18:30:32",{"id":204,"version":205,"summary_zh":206,"released_at":207},99299,"v1.0.1","- 新增 **LLaMA-2** 支持\n- **全面支持 LoRA**。为了使模型训练更加便捷，我们发布了一组基于 LoRA 的模型权重，支持在学术级硬件上进行训练（例如 4 张 A6000 或 8 张 3090 显卡），**无需 CPU offloading**。\n- 针对大型多模态模型的训练进行了更灵活的设计，包括可切换不同的语言模型、视觉编码器等，更多功能即将推出。\n- 使用 CLIP-ViT-L-336px 作为视觉编码器，支持更高分辨率的输入，以实现更细致的视觉理解。\n- 对部分设计进行了简化和优化，使训练过程更加简单流畅。\n- 完全支持 DeepSpeed。\n- 改进了预训练阶段的模型检查点保存方式，从而节省磁盘空间。\n- 优化了 WebUI 界面。\n- 提升了多 GPU 推理的支持能力。\n- 支持 4 位和 8 位量化推理。\n- 支持交互式 CLI 推理。\n\n本次发布的所有模型均采用 LLaVA-LCS-558K 进行预训练，并使用 LLaVA-Instruct-80K 进行指令微调，以确保训练预算高效且经济可行。**在 8 张 3090 显卡上，完整的训练流程（包括预训练和微调）可在 6 小时内完成。**\n\n*我们希望此次发布能够进一步惠及社区，让大型多模态模型更加易于获取和使用。*\n\n#### 详细变更\n\n- 分词器。我们移除了对额外标记（`\u003CIM_START>`、`\u003CIM_END>`、`\u003CIM_PATCH>`）的依赖，因此在预训练阶段，分词器完全保持不变，仅更新线性投影层的权重。\n- 提示词。\n    - 预训练。我们简化了预训练提示词，去除了诸如“描述图像细节”之类的附加指令。实践表明，这样做不仅不影响零样本推理效果，还能略微提升训练速度。\n    - 我们保持训练和测试提示词的一致性，这被证实能够在推理阶段小幅提升模型性能。","2023-07-30T01:51:38"]