[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-PKU-YuanGroup--MoE-LLaVA":3,"tool-PKU-YuanGroup--MoE-LLaVA":65},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160411,2,"2026-04-18T23:33:24",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},8553,"spec-kit","github\u002Fspec-kit","Spec Kit 是一款专为提升软件开发效率而设计的开源工具包，旨在帮助团队快速落地“规格驱动开发”（Spec-Driven Development）模式。传统开发中，需求文档往往与代码实现脱节，导致沟通成本高且结果不可控；而 Spec Kit 通过将规格说明书转化为可执行的指令，让 AI 直接依据明确的业务场景生成高质量代码，从而减少从零开始的随意编码，确保产出结果的可预测性。\n\n该工具特别适合希望利用 AI 辅助编程的开发者、技术负责人及初创团队。无论是启动全新项目还是在现有工程中引入规范化流程，用户只需通过简单的命令行操作，即可初始化项目并集成主流的 AI 编程助手。其核心技术亮点在于“规格即代码”的理念，支持社区扩展与预设模板，允许用户根据特定技术栈定制开发流程。此外，Spec Kit 强调官方维护的安全性，提供稳定的版本管理，帮助开发者在享受 AI 红利的同时，依然牢牢掌握架构设计的主动权，真正实现从“凭感觉写代码”到“按规格建系统”的转变。",88749,"2026-04-17T09:48:14",[15,26,14,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":10,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85267,"2026-04-18T11:00:28",[26,51,52,53,14,54,15,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,51,54],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":81,"owner_twitter":80,"owner_website":82,"owner_url":83,"languages":84,"stars":105,"forks":106,"last_commit_at":107,"license":108,"difficulty_score":109,"env_os":110,"env_gpu":111,"env_ram":112,"env_deps":113,"category_tags":120,"github_topics":121,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":126,"updated_at":127,"faqs":128,"releases":157},9429,"PKU-YuanGroup\u002FMoE-LLaVA","MoE-LLaVA","【TMM 2025🔥】 Mixture-of-Experts for Large Vision-Language Models","MoE-LLaVA 是一款面向大型视觉 - 语言模型的创新开源项目，旨在通过“混合专家”（Mixture-of-Experts, MoE）架构提升模型处理复杂多模态任务的能力。传统视觉 - 语言模型在参数量增大时往往面临计算成本高昂、推理速度慢的瓶颈，而 MoE-LLaVA 巧妙地将模型划分为多个专用“专家”子网络，仅在处理特定任务时动态激活相关部分。这种设计不仅显著降低了计算资源消耗，还大幅提升了模型在图像理解、视频分析及跨模态推理等场景下的效率与准确性。\n\n该项目特别适合研究人员和开发者使用，尤其是那些希望在有限算力条件下探索大规模多模态模型潜力的团队。对于需要快速原型验证或部署高效 AI 应用的企业技术部门，MoE-LLaVA 也提供了灵活的接口和丰富的预训练模型支持。其核心技术亮点在于动态路由机制，能够智能分配输入数据到最合适的专家模块，从而实现性能与资源的最优平衡。此外，项目社区活跃，配套提供了 Hugging Face 空间演示、Colab 笔记本及详细文档，方便用户快速上手。无论是学术研究还是工业落地，MoE-LLaVA 都为多模态人工智能的发展提供了强有力的工具支持。","\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_2b0bdf10f458.png\" width=\"250\" style=\"margin-bottom: 0.2;\"\u002F>\n\u003Cp>\n\u003Ch2 align=\"center\"> \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.15947\">MoE-LLaVA: Mixture of Experts for Large Vision-Language Models\u003C\u002Fa>\u003C\u002Fh2>\n\u003Ch5 align=\"center\"> If you like our project, please give us a star ⭐ on GitHub for latest update.  \u003C\u002Fh2>\n\n\u003Ch5 align=\"center\">\n    \n\n\n[![hf_space](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Open%20In%20Spaces-blue.svg)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FMoE-LLaVA)\n[![Replicate demo and cloud API](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_7dacf1cc5d87.png)](https:\u002F\u002Freplicate.com\u002Fcamenduru\u002Fmoe-llava)\n[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fcamenduru\u002FMoE-LLaVA-jupyter\u002Fblob\u002Fmain\u002FMoE_LLaVA_jupyter.ipynb)\n[![hf_space](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Paper%20In%20HF-red.svg)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2401.15947)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2401.15947-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.15947) \n[![youtube](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-YouTube-000000?logo=youtube&logoColor=FF0000)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=uYb38g-weEY)\n[![jiqizhixin](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-WeChat@机器之心-000000?logo=wechat&logoColor=07C160)](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FICylR6n2LhqQRS0CAHFI1A)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-yellow)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fblob\u002Fmain\u002FLICENSE) \n[![GitHub issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002FPKU-YuanGroup\u002FMoE-LLaVA?color=critical&label=Issues)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fissues?q=is%3Aopen+is%3Aissue)\n[![GitHub closed issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-closed\u002FPKU-YuanGroup\u002FMoE-LLaVA?color=success&label=Issues)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fissues?q=is%3Aissue+is%3Aclosed)  \u003Cbr>\n\u003C\u002Fh5>\n\n\u003Cdetails open>\u003Csummary>💡 I also have other vision-language projects that may interest you ✨. \u003C\u002Fsummary>\u003Cp>\n\u003C!--  may -->\n\n> [**Open-Sora Plan: Open-Source Large Video Generation Model**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00131) \u003Cbr>\n> Bin Lin and Yunyang Ge and Xinhua Cheng and Zongjian Li and Bin Zhu and Shaodong Wang and Xianyi He and Yang Ye and Shenghai Yuan and Liuhan Chen and Tanghui Jia and Junwu Zhang and Zhenyu Tang and Yatian Pang and Bin She and Cen Yan and Zhiheng Hu and Xiaoyi Dong and Lin Chen and Zhang Pan and Xing Zhou and Shaoling Dong and Yonghong Tian and Li Yuan \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FOpen-Sora-Plan.svg?style=social)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan) [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2412.00131-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00131) \u003Cbr>\n\n> [**Video-LLaVA: Learning United Visual Representation by Alignment Before Projection**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10122) \u003Cbr>\n> Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FVideo-LLaVA.svg?style=social)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA) [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2311.10122-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10122) \u003Cbr>\n\n> [**LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01852) \u003Cbr>\n> Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FLanguageBind)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FLanguageBind.svg?style=social)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FLanguageBind)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2310.01852-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01852) \u003Cbr>\n\n\u003C\u002Fp>\u003C\u002Fdetails>\n\n\n## 📣 News\n* **[2025.07.15]**  🔥🔥🔥 Our MoE-LLaVA has been accepted at IEEE Transactions on Multimedia (TMM)!\n* **[2024.03.16]**  🎉 We release all stage2 models, cheching our [model zoo](#-model-zoo).\n* **[2024.02.03]**  🎉 We release a stronger [MoE-LLaVA-StableLM](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-1.8B-4e-384). The average performance is close to LLaVA-1.5-7B by using **2.0B** sparse activated parameters, checking our [model zoo](#-model-zoo).\n* **[2024.02.02]**  🤝 Enjoying the [![Replicate demo and cloud API](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_7dacf1cc5d87.png)](https:\u002F\u002Freplicate.com\u002Fcamenduru\u002Fmoe-llava) and [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fcamenduru\u002FMoE-LLaVA-jupyter\u002Fblob\u002Fmain\u002FMoE_LLaVA_jupyter.ipynb), created by [@camenduru](https:\u002F\u002Fgithub.com\u002Fcamenduru), who generously supports our research!\n* **[2024.02.01]**  🔥 People who cannot access HF can now download the model through the \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_375e99f5914e.png\" width=\"20px\" style=\"max-width: 100%;\"> model scope, checking our [model zoo](#-model-zoo).\n* **[2024.01.30]**  🔥 We release a stronger [MoE-LLaVA-Phi2](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e-384). The average performance **surpasses LLaVA-1.5-7B by using 3.6B** sparse activated parameters, checking our [model zoo](#-model-zoo).\n* **[2024.01.27]**  🤗 [Hugging Face demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FMoE-LLaVA) and **all codes & datasets** are available now! Welcome to **watch** 👀 this repository for the latest updates.\n\n## 😮 Highlights\n\nMoE-LLaVA shows excellent performance in multi-modal learning.\n\n### 🔥 High performance, but with fewer parameters\n- with just **3B sparsely activated parameters**, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmarks.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_f8a956258d35.jpg\" width=55%>\n\u003C\u002Fp>\n\n### 🚀 Simple baseline, learning multi-modal interactions with sparse pathways.\n- With the addition of **a simple MoE tuning stage**, we can complete the training of MoE-LLaVA on **8 A100 GPUs** within 1 days.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_b1757404cf1b.jpg\" width=65%>\n\u003C\u002Fp>\n\n## 🤗 Demo\n\n### Gradio Web UI  \u003Ca href='https:\u002F\u002Fgithub.com\u002Fgradio-app\u002Fgradio'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgradio-app\u002Fgradio'>\u003C\u002Fa> \n\nHighly recommend trying out our web demo by the following command, which incorporates all features currently supported by MoE-LLaVA. We also provide [online demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FMoE-LLaVA) in Huggingface Spaces.\n```bash\n# use phi2\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fgradio_web_server.py --model-path \"LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e\" \n# use qwen\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fgradio_web_server.py --model-path \"LanguageBind\u002FMoE-LLaVA-Qwen-1.8B-4e\" \n# use stablelm\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fgradio_web_server.py --model-path \"LanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e\" \n```\n\n\n\nhttps:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fassets\u002F62638829\u002F8541aac6-9ef6-4fde-aa94-80d0375b9bdb\n\n\n\n### CLI Inference\n\n```bash\n# use phi2\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fcli.py --model-path \"LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e\"  --image-file \"image.jpg\"\n# use qwen\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fcli.py --model-path \"LanguageBind\u002FMoE-LLaVA-Qwen-1.8B-4e\"  --image-file \"image.jpg\"\n# use stablelm\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fcli.py --model-path \"LanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e\"  --image-file \"image.jpg\"\n```\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_9ad64fafcad4.gif\" \u002F>\n\n## 🐳 Model Zoo\n\n| Model | Activated Param | Transformers(HF) | ModelScope(HF) | Avg | VQAv2 | GQA | VizWiz | SQA-IMG | T-VQA | POPE | MME | MM-Bench | MM-Vet |\n|----------|-----------|-----------|---|---|---|---|---|---|---|---|---|---|---|\n| MoE-LLaVA-1.6B×4-Top2 | 2.0B | [🤗LanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e) | [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_375e99f5914e.png\" width=\"20px\" style=\"max-width: 100%;\">PKU-YuanLab\u002FMoE-LLaVA-StableLM-1.6B-4e](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FPKU-YuanLab\u002FMoE-LLaVA-StableLM-1.6B-4e) | 57.3 | 76.7 | 60.3 | 36.2 | 62.6 | 50.1 | 85.7 | 1318.1 | 60.2 | 26.9 |\n| MoE-LLaVA-1.8B×4-Top2 | 2.2B | [🤗LanguageBind\u002FMoE-LLaVA-Qwen-1.8B-4e](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Qwen-1.8B-4e) | [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_375e99f5914e.png\" width=\"20px\" style=\"max-width: 100%;\">PKU-YuanLab\u002FMoE-LLaVA-Qwen-1.8B-4e](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FPKU-YuanLab\u002FMoE-LLaVA-Qwen-1.8B-4e) | 56.7 | 76.2 | 61.5 | 32.6 | 63.1 | 48.0 | 87.0 | 1291.6 | 59.6 | 25.3 |\n| MoE-LLaVA-2.7B×4-Top2 | 3.6B | [🤗LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e) | [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_375e99f5914e.png\" width=\"20px\" style=\"max-width: 100%;\">PKU-YuanLab\u002FMoE-LLaVA-Phi2-2.7B-4e](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FPKU-YuanLab\u002FMoE-LLaVA-Phi2-2.7B-4e) | 61.1 | 77.6 | 61.4 | 43.9 | 68.5 | 51.4 | 86.3 | 1423.0 | 65.2 | 34.3 |\n| MoE-LLaVA-1.6B×4-Top2-384 | 2.0B | [🤗LanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e-384](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e-384) | [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_375e99f5914e.png\" width=\"20px\" style=\"max-width: 100%;\">PKU-YuanLab\u002FMoE-LLaVA-StableLM-1.6B-4e-384](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FPKU-YuanLab\u002FMoE-LLaVA-StableLM-1.6B-4e-384) | 60.0 | 78.6 | 61.5 | 40.5 | 63.9 | 54.3 | 85.9 | 1335.7 | 63.3 | 32.3 |\n| MoE-LLaVA-2.7B×4-Top2-384 | 3.6B | [🤗LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e-384](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e-384) | [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_375e99f5914e.png\" width=\"20px\" style=\"max-width: 100%;\">PKU-YuanLab\u002FMoE-LLaVA-Phi2-2.7B-4e-384](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FPKU-YuanLab\u002FMoE-LLaVA-Phi2-2.7B-4e-384) | **62.9** | 79.9 | 62.6 | 43.7 | 70.3 | 57.0 | 85.7 | 1431.3 | 68.0 | 35.9 |\n| LLaVA-1.5 | 7B | [🤗liuhaotian\u002Fllava-v1.5-7b](https:\u002F\u002Fhuggingface.co\u002Fliuhaotian\u002Fllava-v1.5-7b) | - | 62.0 | 78.5 | 62.0 | 50.0 | 66.8 | 58.2 | 85.9 | 1510.7 | 64.3 | 30.5 |\n\n\u003C!--\n| LLaVA-1.5 | 13B | [liuhaotian\u002Fllava-v1.5-13b](https:\u002F\u002Fhuggingface.co\u002Fliuhaotian\u002Fllava-v1.5-13b) | 64.9 | 80.0 | 63.3 | 53.6 | 71.6 | 61.3 | 85.9 | 1531.3 | 67.7 | 35.4 |\n-->\n\n\u003Cdetails>\n\n\n🚨 **Please know https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fissues\u002F27.**\n\n\n\u003Csummary>Stage2 Model\u003C\u002Fsummary>\n\n\n    \n| Model  | Checkpoint |\n|----------|-----------|\n| MoE-LLaVA-1.6B×4-Top2 | [LanguageBind\u002FMoE-LLaVA-StableLM-Stage2](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-Stage2) |\n| MoE-LLaVA-1.6B×4-Top2-384 | [LanguageBind\u002FMoE-LLaVA-StableLM-Stage2-384](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-Stage2-384) |\n| MoE-LLaVA-1.8B×4-Top2 | [LanguageBind\u002FMoE-LLaVA-Qwen-Stage2](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Qwen-Stage2) |\n| MoE-LLaVA-2.7B×4-Top2 | [LanguageBind\u002FMoE-LLaVA-Phi2-Stage2](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-Stage2) |\n| MoE-LLaVA-2.7B×4-Top2-384 | [LanguageBind\u002FMoE-LLaVA-Phi2-Stage2-384](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-Stage2-384) |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Pretrain Model\u003C\u002Fsummary>\n\n| Model  | Checkpoint |\n|----------|-----------|\n| MoE-LLaVA-1.6B×4-Top2 | [LanguageBind\u002FMoE-LLaVA-StableLM-Pretrain](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-Pretrain) |\n| MoE-LLaVA-1.6B×4-Top2-384 | [LanguageBind\u002FMoE-LLaVA-StableLM-384-Pretrain](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-384-Pretrain) |\n| MoE-LLaVA-1.8B×4-Top2 | [LanguageBind\u002FMoE-LLaVA-Qwen-Pretrain](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Qwen-Pretrain) |\n| MoE-LLaVA-2.7B×4-Top2 | [LanguageBind\u002FMoE-LLaVA-Phi2-Pretrain](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-Pretrain) |\n| MoE-LLaVA-2.7B×4-Top2-384 | [LanguageBind\u002FMoE-LLaVA-Phi2-384-Pretrain](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-384-Pretrain) |\n\n\n\u003C\u002Fdetails>\n\n## ⚙️ Requirements and Installation\nWe recommend the requirements as follows.\n* Python == 3.10\n* Pytorch == 2.0.1\n* CUDA Version >= 11.7\n* **Transformers == 4.37.0**\n* **Tokenizers==0.15.1**\n* Install required packages:\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\ncd MoE-LLaVA\nconda create -n moellava python=3.10 -y\nconda activate moellava\npip install --upgrade pip  # enable PEP 660 support\npip install -e .\npip install -e \".[train]\"\npip install flash-attn --no-build-isolation\n\n# Below are optional. For Qwen model.\ngit clone https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\ncd flash-attention && pip install .\n# Below are optional. Installing them might be slow.\n# pip install csrc\u002Flayer_norm\n# If the version of flash-attn is higher than 2.1.1, the following is not needed.\n# pip install csrc\u002Frotary\n```\n\n> [!Warning]\n> \u003Cdiv align=\"left\">\n> \u003Cb>\n> 🚨 We find that using flash attention2 makes performance degradation.\n> \u003C\u002Fb>\n> \u003C\u002Fdiv>\n\n## 🗝️ Training & Validating\nThe training & validating instruction is in [TRAIN.md](docs\u002FTRAIN.md) & [EVAL.md](docs\u002FEVAL.md).\n\n## 💡 Customizing your MoE-LLaVA\nThe instruction is in [CUSTOM.md](docs\u002FCUSTOM.md).\n\n## 😍 Visualization\nThe instruction is in [VISUALIZATION.md](docs\u002FVISUALIZATION.md).\n\n## 🤖 API\n**We open source all codes.** If you want to load the model (e.g. ```LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e```) on local, you can use the following code snippets.\n\n**Using the following command to run the code.**\n\n```bash\ndeepspeed --include localhost:0 predict.py\n```\n\n```python\nimport torch\nfrom PIL import Image\nfrom moellava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN\nfrom moellava.conversation import conv_templates, SeparatorStyle\nfrom moellava.model.builder import load_pretrained_model\nfrom moellava.utils import disable_torch_init\nfrom moellava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria\n\ndef main():\n    disable_torch_init()\n    image = 'moellava\u002Fserve\u002Fexamples\u002Fextreme_ironing.jpg'\n    inp = 'What is unusual about this image?'\n    model_path = 'LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e'  # LanguageBind\u002FMoE-LLaVA-Qwen-1.8B-4e or LanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e\n    device = 'cuda'\n    load_4bit, load_8bit = False, False  # FIXME: Deepspeed support 4bit or 8bit?\n    model_name = get_model_name_from_path(model_path)\n    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device)\n    image_processor = processor['image']\n    conv_mode = \"phi\"  # qwen or stablelm\n    conv = conv_templates[conv_mode].copy()\n    roles = conv.roles\n    image_tensor = image_processor.preprocess(Image.open(image).convert('RGB'), return_tensors='pt')['pixel_values'].to(model.device, dtype=torch.float16)\n\n    print(f\"{roles[1]}: {inp}\")\n    inp = DEFAULT_IMAGE_TOKEN + '\\n' + inp\n    conv.append_message(conv.roles[0], inp)\n    conv.append_message(conv.roles[1], None)\n    prompt = conv.get_prompt()\n    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()\n    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2\n    keywords = [stop_str]\n    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)\n\n    with torch.inference_mode():\n        output_ids = model.generate(\n            input_ids,\n            images=image_tensor,\n            do_sample=True,\n            temperature=0.2,\n            max_new_tokens=1024,\n            use_cache=True,\n            stopping_criteria=[stopping_criteria])\n\n    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=True).strip()\n    print(outputs)\n\nif __name__ == '__main__':\n    main()\n```\n\n## 🙌 Related Projects\n* [Video-LLaVA](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA) This framework empowers the model to efficiently utilize the united visual tokens.\n* [LanguageBind](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FLanguageBind) An open source five modalities language-based retrieval framework.\n\n## 👍 Acknowledgement\n* [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) The codebase we built upon and it is an efficient large language and vision assistant.\n\n## 🔒 License\n* The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fblob\u002Fmain\u002FLICENSE) file.\n* The service is a research preview intended for non-commercial use only, subject to the model [License](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fllama\u002Fblob\u002Fmain\u002FMODEL_CARD.md) of LLaMA, [Terms of Use](https:\u002F\u002Fopenai.com\u002Fpolicies\u002Fterms-of-use) of the data generated by OpenAI, and [Privacy Practices](https:\u002F\u002Fchrome.google.com\u002Fwebstore\u002Fdetail\u002Fsharegpt-share-your-chatg\u002Fdaiacboceoaocpibfodeljbdfacokfjb) of ShareGPT. Please contact us if you find any potential violation.\n\n\n\n## ✏️ Citation\nIf you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.\n\n```BibTeX\n@article{lin2024moe,\n  title={MoE-LLaVA: Mixture of Experts for Large Vision-Language Models},\n  author={Lin, Bin and Tang, Zhenyu and Ye, Yang and Cui, Jiaxi and Zhu, Bin and Jin, Peng and Zhang, Junwu and Ning, Munan and Yuan, Li},\n  journal={arXiv preprint arXiv:2401.15947},\n  year={2024}\n}\n```\n\n```BibTeX\n@article{lin2023video,\n  title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},\n  author={Lin, Bin and Zhu, Bin and Ye, Yang and Ning, Munan and Jin, Peng and Yuan, Li},\n  journal={arXiv preprint arXiv:2311.10122},\n  year={2023}\n}\n```\n\n\n\n## ✨ Star History\n[![Star History](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_d0fd5c820d1d.png)](https:\u002F\u002Fstar-history.com\u002F#PKU-YuanGroup\u002FMoE-LLaVA&Date)\n\n\n## 🤝 Contributors\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_130ad8f89176.png\" \u002F>\n\u003C\u002Fa>\n","\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_2b0bdf10f458.png\" width=\"250\" style=\"margin-bottom: 0.2;\"\u002F>\n\u003Cp>\n\u003Ch2 align=\"center\"> \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.15947\">MoE-LLaVA：面向大型视觉-语言模型的专家混合模型\u003C\u002Fa>\u003C\u002Fh2>\n\u003Ch5 align=\"center\"> 如果您喜欢我们的项目，请在 GitHub 上为我们点亮星标 ⭐，以获取最新更新。  \u003C\u002Fh2>\n\n\u003Ch5 align=\"center\">\n    \n\n\n[![hf_space](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Open%20In%20Spaces-blue.svg)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FMoE-LLaVA)\n[![Replicate 演示及云端 API](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_7dacf1cc5d87.png)](https:\u002F\u002Freplicate.com\u002Fcamenduru\u002Fmoe-llava)\n[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fcamenduru\u002FMoE-LLaVA-jupyter\u002Fblob\u002Fmain\u002FMoE_LLaVA_jupyter.ipynb)\n[![hf_space](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-论文在 HF 上-red.svg)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2401.15947)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2401.15947-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.15947) \n[![YouTube](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-YouTube-000000?logo=youtube&logoColor=FF0000)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=uYb38g-weEY)\n[![机器之心](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-微信@机器之心-000000?logo=wechat&logoColor=07C160)](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FICylR6n2LhqQRS0CAHFI1A)\n[![许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F许可证-Apache%202.0-yellow)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fblob\u002Fmain\u002FLICENSE) \n[![GitHub 问题](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002FPKU-YuanGroup\u002FMoE-LLaVA?color=critical&label=Issues)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fissues?q=is%3Aopen+is%3Aissue)\n[![GitHub 已关闭的问题](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-closed\u002FPKU-YuanGroup\u002FMoE-LLaVA?color=success&label=Issues)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fissues?q=is%3Aissue+is%3Aclosed)  \u003Cbr>\n\u003C\u002Fh5>\n\n\u003Cdetails open>\u003Csummary>💡 我还有其他可能让您感兴趣的视觉-语言项目 ✨。 \u003C\u002Fsummary>\u003Cp>\n\u003C!--  may -->\n\n> [**Open-Sora 计划：开源大型视频生成模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00131) \u003Cbr>\n> 林斌、葛云阳、程新华、李宗健、朱彬、王绍东、何先义、叶洋、袁圣海、陈刘汉、贾唐辉、张俊武、唐振宇、庞雅婷、佘彬、严岑、胡志恒、董晓艺、陈林、潘章、周兴、董绍玲、田永宏、袁力 \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FOpen-Sora-Plan.svg?style=social)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan) [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2412.00131-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00131) \u003Cbr>\n\n> [**Video-LLaVA：通过对齐后再投影学习统一的视觉表征**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10122) \u003Cbr>\n> 林斌、叶洋、朱彬、崔家熙、宁慕楠、金鹏、袁力 \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FVideo-LLaVA.svg?style=social)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA) [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2311.10122-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10122) \u003Cbr>\n\n> [**LanguageBind：通过基于语言的语义对齐将视频-语言预训练扩展至 N 模态**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01852) \u003Cbr>\n> 朱彬、林斌、宁慕楠、杨燕、崔家熙、王洪法、庞雅婷、蒋文浩、张俊武、李宗伟、张万才、李志峰、刘伟、袁力 \u003Cbr>\n[![github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Github-black?logo=github)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FLanguageBind)  [![github](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FLanguageBind.svg?style=social)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FLanguageBind)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2310.01852-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01852) \u003Cbr>\n\n\u003C\u002Fp>\u003C\u002Fdetails>\n\n\n## 📣 新闻\n* **[2025.07.15]**  🔥🔥🔥 我们的 MoE-LLaVA 已被 IEEE 多媒体汇刊（TMM）接收！\n* **[2024.03.16]**  🎉 我们发布了所有 stage2 模型，请查看我们的 [模型库](#-model-zoo)。\n* **[2024.02.03]**  🎉 我们发布了一个更强大的 [MoE-LLaVA-StableLM](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-1.8B-4e-384)。该模型仅使用 **2.0B** 的稀疏激活参数，平均性能就接近 LLaVA-1.5-7B，请查看我们的 [模型库](#-model-zoo)。\n* **[2024.02.02]**  🤝 您可以体验 [![Replicate 演示及云端 API](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_7dacf1cc5d87.png)](https:\u002F\u002Freplicate.com\u002Fcamenduru\u002Fmoe-llava) 和 [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fcamenduru\u002FMoE-LLaVA-jupyter\u002Fblob\u002Fmain\u002FMoE_LLaVA_jupyter.ipynb)，这些内容由 [@camenduru](https:\u002F\u002Fgithub.com\u002Fcamenduru) 制作，他慷慨地支持我们的研究！\n* **[2024.02.01]**  🔥 对于无法访问 Hugging Face 的用户，现在可以通过 \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_375e99f5914e.png\" width=\"20px\" style=\"max-width: 100%;\"> ModelScope 下载模型，请查看我们的 [模型库](#-model-zoo)。\n* **[2024.01.30]**  🔥 我们发布了一个更强大的 [MoE-LLaVA-Phi2](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e-384)。该模型仅使用 **3.6B** 的稀疏激活参数，平均性能就 **超越了 LLaVA-1.5-7B**，请查看我们的 [模型库](#-model-zoo)。\n* **[2024.01.27]**  🤗 [Hugging Face 演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FMoE-LLaVA)以及 **所有代码和数据集** 现已可用！欢迎 **关注** 👀 此仓库，以获取最新动态。\n\n## 😮 亮点\n\nMoE-LLaVA 在多模态学习方面表现出色。\n\n### 🔥 高性能，但参数量更少\n- 仅需 **3B 的稀疏激活参数**，MoE-LLaVA 就能在多种视觉理解数据集上达到与 LLaVA-1.5-7B 相当的性能，甚至在物体幻觉基准测试中超越了 LLaVA-1.5-13B。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_f8a956258d35.jpg\" width=55%>\n\u003C\u002Fp>\n\n### 🚀 简单的基线，通过稀疏路径学习多模态交互。\n- 通过增加 **一个简单的 MoE 调优阶段**，我们可以在 **8 张 A100 GPU** 上用不到 1 天的时间完成 MoE-LLaVA 的训练。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_b1757404cf1b.jpg\" width=65%>\n\u003C\u002Fp>\n\n## 🤗 演示\n\n### Gradio Web UI  \u003Ca href='https:\u002F\u002Fgithub.com\u002Fgradio-app\u002Fgradio'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgradio-app\u002Fgradio'>\u003C\u002Fa> \n\n强烈推荐您通过以下命令试用我们的网页演示，它包含了 MoE-LLaVA 目前支持的所有功能。我们还在 Huggingface Spaces 中提供了 [在线演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FMoE-LLaVA)。\n```bash\n# 使用 phi2\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fgradio_web_server.py --model-path \"LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e\"\n\n# 使用Qwen\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fgradio_web_server.py --model-path \"LanguageBind\u002FMoE-LLaVA-Qwen-1.8B-4e\" \n# 使用StableLM\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fgradio_web_server.py --model-path \"LanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e\" \n```\n\n\n\nhttps:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fassets\u002F62638829\u002F8541aac6-9ef6-4fde-aa94-80d0375b9bdb\n\n\n\n### 命令行推理\n\n```bash\n# 使用Phi2\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fcli.py --model-path \"LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e\"  --image-file \"image.jpg\"\n# 使用Qwen\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fcli.py --model-path \"LanguageBind\u002FMoE-LLaVA-Qwen-1.8B-4e\"  --image-file \"image.jpg\"\n# 使用StableLM\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fcli.py --model-path \"LanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e\"  --image-file \"image.jpg\"\n```\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_9ad64fafcad4.gif\" \u002F>\n\n## 🐳 模型库\n\n| 模型 | 激活参数 | Transformers(HF) | ModelScope(HF) | 平均 | VQAv2 | GQA | VizWiz | SQA-IMG | T-VQA | POPE | MME | MM-Bench | MM-Vet |\n|----------|-----------|-----------|---|---|---|---|---|---|---|---|---|---|---|\n| MoE-LLaVA-1.6B×4-Top2 | 2.0B | [🤗LanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e) | [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_375e99f5914e.png\" width=\"20px\" style=\"max-width: 100%;\">PKU-YuanLab\u002FMoE-LLaVA-StableLM-1.6B-4e](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FPKU-YuanLab\u002FMoE-LLaVA-StableLM-1.6B-4e) | 57.3 | 76.7 | 60.3 | 36.2 | 62.6 | 50.1 | 85.7 | 1318.1 | 60.2 | 26.9 |\n| MoE-LLaVA-1.8B×4-Top2 | 2.2B | [🤗LanguageBind\u002FMoE-LLaVA-Qwen-1.8B-4e](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Qwen-1.8B-4e) | [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_375e99f5914e.png\" width=\"20px\" style=\"max-width: 100%;\">PKU-YuanLab\u002FMoE-LLaVA-Qwen-1.8B-4e](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FPKU-YuanLab\u002FMoE-LLaVA-Qwen-1.8B-4e) | 56.7 | 76.2 | 61.5 | 32.6 | 63.1 | 48.0 | 87.0 | 1291.6 | 59.6 | 25.3 |\n| MoE-LLaVA-2.7B×4-Top2 | 3.6B | [🤗LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e) | [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_375e99f5914e.png\" width=\"20px\" style=\"max-width: 100%;\">PKU-YuanLab\u002FMoE-LLaVA-Phi2-2.7B-4e](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FPKU-YuanLab\u002FMoE-LLaVA-Phi2-2.7B-4e) | 61.1 | 77.6 | 61.4 | 43.9 | 68.5 | 51.4 | 86.3 | 1423.0 | 65.2 | 34.3 |\n| MoE-LLaVA-1.6B×4-Top2-384 | 2.0B | [🤗LanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e-384](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e-384) | [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_375e99f5914e.png\" width=\"20px\" style=\"max-width: 100%;\">PKU-YuanLab\u002FMoE-LLaVA-StableLM-1.6B-4e-384](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FPKU-YuanLab\u002FMoE-LLaVA-StableLM-1.6B-4e-384) | 60.0 | 78.6 | 61.5 | 40.5 | 63.9 | 54.3 | 85.9 | 1335.7 | 63.3 | 32.3 |\n| MoE-LLaVA-2.7B×4-Top2-384 | 3.6B | [🤗LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e-384](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e-384) | [\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_375e99f5914e.png\" width=\"20px\" style=\"max-width: 100%;\">PKU-YuanLab\u002FMoE-LLaVA-Phi2-2.7B-4e-384](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FPKU-YuanLab\u002FMoE-LLaVA-Phi2-2.7B-4e-384) | **62.9** | 79.9 | 62.6 | 43.7 | 70.3 | 57.0 | 85.7 | 1431.3 | 68.0 | 35.9 |\n| LLaVA-1.5 | 7B | [🤗liuhaotian\u002Fllava-v1.5-7b](https:\u002F\u002Fhuggingface.co\u002Fliuhaotian\u002Fllava-v1.5-7b) | - | 62.0 | 78.5 | 62.0 | 50.0 | 66.8 | 58.2 | 85.9 | 1510.7 | 64.3 | 30.5 |\n\n\u003C!--\n| LLaVA-1.5 | 13B | [liuhaotian\u002Fllava-v1.5-13b](https:\u002F\u002Fhuggingface.co\u002Fliuhaotian\u002Fllava-v1.5-13b) | 64.9 | 80.0 | 63.3 | 53.6 | 71.6 | 61.3 | 85.9 | 1531.3 | 67.7 | 35.4 |\n-->\n\n\u003Cdetails>\n\n\n🚨 **请注意 https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fissues\u002F27。**\n\n\n\u003Csummary>Stage2模型\u003C\u002Fsummary>\n\n\n    \n| 模型  | 检查点 |\n|----------|-----------|\n| MoE-LLaVA-1.6B×4-Top2 | [LanguageBind\u002FMoE-LLaVA-StableLM-Stage2](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-Stage2) |\n| MoE-LLaVA-1.6B×4-Top2-384 | [LanguageBind\u002FMoE-LLaVA-StableLM-Stage2-384](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-Stage2-384) |\n| MoE-LLaVA-1.8B×4-Top2 | [LanguageBind\u002FMoE-LLaVA-Qwen-Stage2](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Qwen-Stage2) |\n| MoE-LLaVA-2.7B×4-Top2 | [LanguageBind\u002FMoE-LLaVA-Phi2-Stage2](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-Stage2) |\n| MoE-LLaVA-2.7B×4-Top2-384 | [LanguageBind\u002FMoE-LLaVA-Phi2-Stage2-384](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-Stage2-384) |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>预训练模型\u003C\u002Fsummary>\n\n| 模型  | 检查点 |\n|----------|-----------|\n| MoE-LLaVA-1.6B×4-Top2 | [LanguageBind\u002FMoE-LLaVA-StableLM-Pretrain](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-Pretrain) |\n| MoE-LLaVA-1.6B×4-Top2-384 | [LanguageBind\u002FMoE-LLaVA-StableLM-384-Pretrain](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-StableLM-384-Pretrain) |\n| MoE-LLaVA-1.8B×4-Top2 | [LanguageBind\u002FMoE-LLaVA-Qwen-Pretrain](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Qwen-Pretrain) |\n| MoE-LLaVA-2.7B×4-Top2 | [LanguageBind\u002FMoE-LLaVA-Phi2-Pretrain](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-Pretrain) |\n| MoE-LLaVA-2.7B×4-Top2-384 | [LanguageBind\u002FMoE-LLaVA-Phi2-384-Pretrain](https:\u002F\u002Fhuggingface.co\u002FLanguageBind\u002FMoE-LLaVA-Phi2-384-Pretrain) |\n\n\n\u003C\u002Fdetails>\n\n## ⚙️ 要求与安装\n我们推荐以下要求：\n* Python == 3.10\n* Pytorch == 2.0.1\n* CUDA版本 >= 11.7\n* **Transformers == 4.37.0**\n* **Tokenizers==0.15.1**\n* 安装所需包：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\ncd MoE-LLaVA\nconda create -n moellava python=3.10 -y\nconda activate moellava\npip install --upgrade pip  # 启用PEP 660支持\npip install -e .\npip install -e \".[train]\"\npip install flash-attn --no-build-isolation\n\n# 以下是可选的。对于Qwen模型。\ngit clone https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\ncd flash-attention && pip install .\n# 以下是可选的。安装它们可能会比较慢。\n# pip install csrc\u002Flayer_norm\n# 如果flash-attn版本高于2.1.1，则不需要以下操作。\n# pip install csrc\u002Frotary\n```\n\n> [!警告]\n> \u003Cdiv align=\"left\">\n> \u003Cb>\n> 🚨 我们发现使用flash attention2会导致性能下降。\n> \u003C\u002Fb>\n> \u003C\u002Fdiv>\n\n## 🗝️ 训练与验证\n训练与验证说明请参阅 [TRAIN.md](docs\u002FTRAIN.md) 和 [EVAL.md](docs\u002FEVAL.md)。\n\n## 💡 自定义你的MoE-LLaVA\n自定义说明请参阅 [CUSTOM.md](docs\u002FCUSTOM.md)。\n\n## 😍 可视化\n可视化说明请参阅 [VISUALIZATION.md](docs\u002FVISUALIZATION.md)。\n\n## 🤖 API\n**我们已将所有代码开源。** 如果您想在本地加载该模型（例如 ```LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e```），可以使用以下代码片段。\n\n**使用以下命令运行代码。**\n\n```bash\ndeepspeed --include localhost:0 predict.py\n```\n\n```python\nimport torch\nfrom PIL import Image\nfrom moellava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN\nfrom moellava.conversation import conv_templates, SeparatorStyle\nfrom moellava.model.builder import load_pretrained_model\nfrom moellava.utils import disable_torch_init\nfrom moellava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria\n\ndef main():\n    disable_torch_init()\n    image = 'moellava\u002Fserve\u002Fexamples\u002Fextreme_ironing.jpg'\n    inp = 'What is unusual about this image?'\n    model_path = 'LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e'  # LanguageBind\u002FMoE-LLaVA-Qwen-1.8B-4e 或 LanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e\n    device = 'cuda'\n    load_4bit, load_8bit = False, False  # FIXME: Deepspeed 是否支持 4bit 或 8bit？\n    model_name = get_model_name_from_path(model_path)\n    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device)\n    image_processor = processor['image']\n    conv_mode = \"phi\"  # qwen 或 stablelm\n    conv = conv_templates[conv_mode].copy()\n    roles = conv.roles\n    image_tensor = image_processor.preprocess(Image.open(image).convert('RGB'), return_tensors='pt')['pixel_values'].to(model.device, dtype=torch.float16)\n\n    print(f\"{roles[1]}: {inp}\")\n    inp = DEFAULT_IMAGE_TOKEN + '\\n' + inp\n    conv.append_message(conv.roles[0], inp)\n    conv.append_message(conv.roles[1], None)\n    prompt = conv.get_prompt()\n    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()\n    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2\n    keywords = [stop_str]\n    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)\n\n    with torch.inference_mode():\n        output_ids = model.generate(\n            input_ids,\n            images=image_tensor,\n            do_sample=True,\n            temperature=0.2,\n            max_new_tokens=1024,\n            use_cache=True,\n            stopping_criteria=[stopping_criteria])\n\n    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=True).strip()\n    print(outputs)\n\nif __name__ == '__main__':\n    main()\n```\n\n## 🙌 相关项目\n* [Video-LLaVA](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA) 该框架使模型能够高效利用统一的视觉标记。\n* [LanguageBind](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FLanguageBind) 一个开源的五模态语言检索框架。\n\n## 👍 致谢\n* [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) 我们在此基础上构建了代码库，它是一个高效的大规模语言和视觉助手。\n\n## 🔒 许可证\n* 本项目的大部分内容根据 [LICENSE](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fblob\u002Fmain\u002FLICENSE) 文件中的 Apache 2.0 许可证发布。\n* 该服务为研究预览版，仅供非商业用途，受 LLaMA 模型的 [许可证](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fllama\u002Fblob\u002Fmain\u002FMODEL_CARD.md)、由 OpenAI 生成的数据的 [使用条款](https:\u002F\u002Fopenai.com\u002Fpolicies\u002Fterms-of-use)以及 ShareGPT 的 [隐私政策](https:\u002F\u002Fchrome.google.com\u002Fwebstore\u002Fdetail\u002Fsharegpt-share-your-chatg\u002Fdaiacboceoaocpibfodeljbdfacokfjb)约束。如发现任何潜在违规行为，请与我们联系。\n\n\n\n## ✏️ 引用\n如果您在研究中发现我们的论文和代码有用，请考虑给个赞 :star: 和引用 :pencil:。\n\n```BibTeX\n@article{lin2024moe,\n  title={MoE-LLaVA: Mixture of Experts for Large Vision-Language Models},\n  author={Lin, Bin and Tang, Zhenyu and Ye, Yang and Cui, Jiaxi and Zhu, Bin and Jin, Peng and Zhang, Junwu and Ning, Munan and Yuan, Li},\n  journal={arXiv preprint arXiv:2401.15947},\n  year={2024}\n}\n```\n\n```BibTeX\n@article{lin2023video,\n  title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},\n  author={Lin, Bin and Zhu, Bin and Ye, Yang and Ning, Munan and Jin, Peng and Yuan, Li},\n  journal={arXiv preprint arXiv:2311.10122},\n  year={2023}\n}\n```\n\n\n\n## ✨ 星标历史\n[![星标历史](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_d0fd5c820d1d.png)](https:\u002F\u002Fstar-history.com\u002F#PKU-YuanGroup\u002FMoE-LLaVA&Date)\n\n\n## 🤝 贡献者\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_readme_130ad8f89176.png\" \u002F>\n\u003C\u002Fa>","# MoE-LLaVA 快速上手指南\n\nMoE-LLaVA 是一个基于混合专家（Mixture of Experts, MoE）架构的大型视觉 - 语言模型。它在仅激活少量参数（如 3B）的情况下，即可在多项视觉理解任务中达到甚至超越 LLaVA-1.5-7B\u002F13B 的性能。\n\n## 1. 环境准备\n\n### 系统要求\n- **GPU**: 推荐 NVIDIA A100 或同等算力显卡（推理可根据模型大小调整，训练建议 8x A100）。\n- **CUDA**: 11.8 或更高版本。\n- **Python**: 3.10 或更高版本。\n\n### 前置依赖\n确保已安装 `git`、`conda` 或 `venv`。本项目深度依赖 `deepspeed` 进行推理和训练。\n\n## 2. 安装步骤\n\n### 步骤一：克隆项目\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA.git\ncd MoE-LLaVA\n```\n\n### 步骤二：创建虚拟环境并安装依赖\n推荐使用 Conda 管理环境：\n\n```bash\nconda create -n moellava python=3.10 -y\nconda activate moellava\n\n# 安装 PyTorch (根据实际 CUDA 版本调整，此处以 11.8 为例)\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n\n# 安装项目核心依赖\npip install -e .\n```\n\n> **国内加速提示**：如果下载缓慢，可使用清华或阿里镜像源：\n> ```bash\n> pip install -e . -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 3. 基本使用\n\nMoE-LLaVA 提供了多种预训练模型（基于 Phi2, Qwen, StableLM 等）。以下示例以性能较强的 **Phi2** 版本为例。\n\n### 方式一：Gradio Web UI（推荐）\n启动本地网页界面，支持上传图片并进行多轮对话。\n\n```bash\n# 使用 Phi2 底座模型 (激活参数约 3.6B)\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fgradio_web_server.py --model-path \"LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e\"\n```\n*启动后在浏览器访问显示的本地地址（通常为 `http:\u002F\u002F127.0.0.1:7860`）即可使用。*\n\n其他可用模型路径：\n- Qwen 底座: `\"LanguageBind\u002FMoE-LLaVA-Qwen-1.8B-4e\"`\n- StableLM 底座: `\"LanguageBind\u002FMoE-LLaVA-StableLM-1.6B-4e\"`\n\n### 方式二：命令行推理 (CLI)\n直接在终端对单张图片进行问答。\n\n```bash\ndeepspeed --include localhost:0 moellava\u002Fserve\u002Fcli.py --model-path \"LanguageBind\u002FMoE-LLaVA-Phi2-2.7B-4e\" --image-file \"image.jpg\"\n```\n*请将 `image.jpg` 替换为你本地的图片路径。运行后输入问题即可获取回答。*\n\n### 模型下载提示\n- **Hugging Face**: 模型会自动从 HF 下载。\n- **国内镜像**: 如果无法访问 Hugging Face，项目支持通过 **ModelScope (魔搭)** 下载。\n  - 例如 Phi2 模型地址：`PKU-YuanLab\u002FMoE-LLaVA-Phi2-2.7B-4e`\n  - 下载后可将 `--model-path` 指向本地文件夹路径。","某电商平台的智能客服团队正试图升级系统，使其能直接理解用户上传的商品实拍图并回答复杂的细节咨询。\n\n### 没有 MoE-LLaVA 时\n- **响应延迟高**：传统大视觉语言模型参数量巨大，每次处理图片都需要激活全部计算资源，导致用户提问后需等待数秒才能收到回复，严重影响购物体验。\n- **硬件成本昂贵**：为了维持高并发下的流畅度，公司不得不部署大量高性能 GPU 服务器，运维预算居高不下。\n- **细节识别模糊**：面对用户询问“这件衣服领口的刺绣图案是什么”等细粒度问题时，模型往往只能给出泛泛的描述，无法精准定位图像局部特征。\n- **场景适应性差**：模型难以在“通用闲聊”与“专业商品分析”之间灵活切换，要么过于啰嗦，要么缺乏专业深度。\n\n### 使用 MoE-LLaVA 后\n- **推理速度显著提升**：借助混合专家（MoE）架构，MoE-LLaVA 仅激活部分专家网络处理特定任务，将单次响应时间从秒级降低至毫秒级，实现近乎实时的交互。\n- **算力资源大幅节约**：在保持同等甚至更高智能水平的前提下，显著减少了活跃参数量，使原有硬件集群能支撑更高的并发流量，降低了部署门槛。\n- **细粒度理解更精准**：针对商品细节的提问，MoE-LLaVA 能调动专门的视觉专家模块，准确识别并描述领口刺绣、材质纹理等微小特征，回答专业度大幅提升。\n- **动态任务适配**：模型能根据用户问题类型自动路由到最合适的“专家”子网络，无论是轻松闲聊还是严谨的参数对比，都能给出恰到好处的回复。\n\nMoE-LLaVA 通过“按需调用”专家网络的机制，完美解决了大型视觉语言模型在落地应用中成本高、速度慢与精度难以兼得的核心矛盾。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPKU-YuanGroup_MoE-LLaVA_2b0bdf10.png","PKU-YuanGroup","PKU-YUAN-Lab (袁粒课题组-北大深研院)","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FPKU-YuanGroup_d1788368.jpg","Open codes from YUAN Lab at PKU",null,"postmaster@pku-yuan.com","pku-yuan.com","https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup",[85,89,93,97,101],{"name":86,"color":87,"percentage":88},"Python","#3572A5",94,{"name":90,"color":91,"percentage":92},"Shell","#89e051",4.3,{"name":94,"color":95,"percentage":96},"JavaScript","#f1e05a",0.9,{"name":98,"color":99,"percentage":100},"HTML","#e34c26",0.7,{"name":102,"color":103,"percentage":104},"CSS","#663399",0.2,2312,142,"2026-04-18T18:02:40","Apache-2.0",4,"Linux","必需 NVIDIA GPU。训练需求：8x A100 GPU。推理需求：未明确具体显存大小，但模型参数量在 2.0B-3.6B (稀疏激活) 之间，建议使用支持 CUDA 的中高端显卡。","未说明",{"notes":114,"python":112,"dependencies":115},"该工具主要使用 DeepSpeed 进行分布式训练和推理（命令行示例均包含 deepspeed 指令）。官方提到可在 8 张 A100 GPU 上于 1 天内完成训练。提供多种预训练模型变体（基于 Phi2, Qwen, StableLM），参数量从 1.6B 到 2.7B 不等。支持通过 Hugging Face Spaces、Colab 和 Replicate 在线体验。国内用户可通过 ModelScope（魔搭）下载模型。",[116,117,118,119],"deepspeed","gradio","transformers","torch",[15],[122,123,124,125],"large-vision-language-model","mixture-of-experts","moe","multi-modal","2026-03-27T02:49:30.150509","2026-04-19T09:39:18.230138",[129,134,139,144,149,153],{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},42314,"在评估 Qwen1.5 模型时遇到 'RuntimeError: The size of tensor a must match the size of tensor b' 错误，如何解决？","该错误通常是因为使用了 Qwen1.5 的 Chat 版本（chat version），其不支持右侧填充（padding='right'），而 Flash Attention 需要特定的填充方式。解决方案是改用非 Chat 版本的模型，例如使用 `Qwen\u002FQwen1.5-1.8B` 而不是对话版。可以通过在运行命令中添加 `--save_steps 5` 来快速检查生成的文件是否符合预期。","https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fissues\u002F39",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},42315,"复现 Stage 2 微调后的模型时，VQAv2 或 TextVQA 的准确率低于论文报告值，可能的原因是什么？","准确率差异可能源于训练时的梯度累积（gradient accumulation）设置不同。维护者建议在复现时增加梯度累积步数。此外，确保完全按照提供的脚本和数据处理流程操作，并可以在 TextVQA 数据集上进行离线评估以验证模型性能。如果可能，对比官方提供的检查点（checkpoint）的评估指标。","https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fissues\u002F27",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},42316,"运行 MoE 微调脚本（finetune_moe.sh）时报错：'AssertionError: The model has moe layers, but None of the param groups are marked as MoE'，如何解决？","此错误表明优化器创建时未正确标记 MoE 参数组。请确保使用项目最新更新的运行命令脚本（如 `scripts\u002Fv1\u002Fqwen\u002Ffinetune_moe.sh`）。在命令行参数中，必须显式启用 MoE 相关配置，例如包含 `--moe_enable True`、`--num_experts`、`--top_k_experts` 以及 `--train_modules` 等参数。如果问题依旧，可能是自定义的优化器函数需要明确将 'moe' 键设置为 True。","https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fissues\u002F17",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},42317,"当前模型对中文 OCR（光学字符识别）的支持能力如何？是否有提升计划？","目前的 LLaVA v1.6 架构对中文字符的 OCR 能力较弱，因为训练数据主要针对英文。为了提升中文能力，建议基于支持中文更好的大语言模型（如 Qwen-7B）重新训练，并遵循 LLaVA-1.6 的数据集格式。社区推荐参考 `InstructDoc` 等项目获取适合文档理解的数据集。开发团队正在尝试替换 LLM 为 Qwen 以增强中文支持。","https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fissues\u002F26",{"id":150,"question_zh":151,"answer_zh":152,"source_url":138},42318,"如何在本地评估微调后的模型在 TextVQA 数据集上的表现？","可以使用项目提供的评估脚本在本地进行离线评估。确保加载正确的微调后检查点（checkpoint），并在运行评估命令时指定 TextVQA 数据集路径。如果在复现过程中发现结果与论文不符，请检查是否增加了梯度累积（More gradient accumulation），这可能会影响最终的收敛效果和准确率。",{"id":154,"question_zh":155,"answer_zh":156,"source_url":133},42319,"集成 Qwen1.5 模型时，生成配置（GenerationConfig）应如何设置以避免解码错误？","在集成代码中（如 `builder.py`），需要手动设置生成配置以适应 Qwen1.5。关键设置包括：将 `eos_token_id` 设为 tokenizer 的 `eos_token_id`；从预训练模型加载 `GenerationConfig` 并设置 `pad_token_id`；建议将 `do_sample` 设为 `False` 以使用贪婪解码（greedy decoding），并将 `repetition_penalty` 设为 `1.0` 以禁用重复惩罚，从而避免潜在的张量尺寸不匹配问题。",[158],{"id":159,"version":160,"summary_zh":161,"released_at":162},334424,"v1.0.0","- 支持更高分辨率的输入，使用 `google\u002Fsiglip-so400m-patch14-384` 作为视觉编码器，以实现更细致的视觉理解。\n- 将 `capacity_factor` 调整为 1.5，以支持更强的 MoE-LLaVA 模型。\n- 新增了 [MME](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FEvaluation) 基准测试结果及 [评估流程](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA\u002Fblob\u002Fmain\u002Fdocs\u002FEVAL.md#mme)。\n- 完善了文档。\n- 修复了拼写错误。\n\n\n_我们希望社区研究者能够关注到：大型视觉-语言模型同样可以进行稀疏化处理，甚至在性能上有所提升。_","2024-02-04T08:53:22"]