[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-salesforce--LAVIS":3,"tool-salesforce--LAVIS":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":79,"owner_website":81,"owner_url":82,"languages":83,"stars":96,"forks":97,"last_commit_at":98,"license":99,"difficulty_score":10,"env_os":100,"env_gpu":101,"env_ram":100,"env_deps":102,"category_tags":110,"github_topics":111,"view_count":122,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":123,"updated_at":124,"faqs":125,"releases":155},2566,"salesforce\u002FLAVIS","LAVIS","LAVIS - A One-stop Library for Language-Vision Intelligence","LAVIS 是由 Salesforce 开源的一站式语言 - 视觉智能库，旨在为开发者提供统一、高效的工具集，以构建和部署多模态 AI 应用。它主要解决了当前视觉与语言模型种类繁多、接口不一、复现困难的问题，让用户无需从零开始搭建复杂架构，即可轻松调用业界领先的预训练模型。\n\n无论是从事多模态研究的研究人员，还是希望快速集成图像理解、文生图或视频分析功能的开发者，LAVIS 都能提供极大的便利。其核心亮点在于集成了 BLIP-2、InstructBLIP、BLIP-Diffusion 及最新的 X-InstructBLIP 等前沿模型。这些模型不仅支持高质量的零样本图像描述、视觉问答和指令跟随生成，还能在冻结大型语言模型（LLM）的基础上，高效融合图像、视频、音频甚至 3D 数据，大幅降低了跨模态任务的开发门槛。此外，LAVIS 提供了完善的文档、基准测试对比以及丰富的 Jupyter Notebook 示例，帮助用户快速上手并验证想法。如果你正在探索人工智能在“看”与“说”结合领域的潜力，LAVIS 将是一个值得信赖的起点。","\u003Cp align=\"center\">\n    \u003Cbr>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsalesforce_LAVIS_readme_0857ac3abf12.png\" width=\"400\"\u002F>\n    \u003Cbr>\n\u003Cp>\n\n\u003Cdiv align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Freleases\">\u003Cimg alt=\"Latest Release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Fsalesforce\u002FLAVIS.svg\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002Findex.html\">\n  \u003Cimg alt=\"docs\" src=\"https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Factions\u002Fworkflows\u002Fdocs.yaml\u002Fbadge.svg\"\u002F>\n  \u003Ca href=\"https:\u002F\u002Fopensource.org\u002Flicenses\u002FBSD-3-Clause\">\n  \u003Cimg alt=\"license\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-BSD_3--Clause-blue.svg\"\u002F>\n  \u003C\u002Fa> \n  \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fproject\u002Fsalesforce-lavis\">\n  \u003Cimg alt=\"Downloads\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsalesforce_LAVIS_readme_f617c87ed07f.png\">\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Fbenchmark.html\">Benchmark\u003C\u002Fa>,\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.09019\">Technical Report\u003C\u002Fa>,\n\u003Ca href=\"https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Findex.html\">Documentation\u003C\u002Fa>,\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fexamples\">Jupyter Notebook Examples\u003C\u002Fa>,\n\u003Ca href=\"https:\u002F\u002Fblog.salesforceairesearch.com\u002Flavis-language-vision-library\u002F\">Blog\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n# LAVIS - A Library for Language-Vision Intelligence\n\n## What's New: 🎉 \n  * [Model Release] November 2023, released implementation of **X-InstructBLIP** \u003Cbr>\n  [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18799.pdf), [Project Page](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fxinstructblip), [Website](https:\u002F\u002Fartemisp.github.io\u002FX-InstructBLIP-page\u002F), [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsalesforce\u002FLAVIS\u002Fblob\u002Fmain\u002Fprojects\u002Fxinstructblip\u002Fdemo\u002Frun_demo.ipynb)\n  > A simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities (image, video, audio, 3D) without extensive modality-specific customization.\n  * [Model Release] July 2023, released implementation of **BLIP-Diffusion** \u003Cbr>\n  [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06500), [Project Page](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fblip-diffusion), [Website](https:\u002F\u002Fdxli94.github.io\u002FBLIP-Diffusion-website\u002F)\n  > A text-to-image generation model that trains 20x than DreamBooth. Also facilitates zero-shot subject-driven generation and editing.\n  * [Model Release] May 2023, released implementation of **InstructBLIP** \u003Cbr>\n  [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06500), [Project Page](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Finstructblip)    \n  > A new vision-language instruction-tuning framework using BLIP-2 models, achieving state-of-the-art zero-shot generalization performance on a wide range of vision-language tasks.\n  * [Model Release] Jan 2023, released implementation of **BLIP-2** \u003Cbr>\n  [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.12597), [Project Page](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fblip2), [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsalesforce\u002FLAVIS\u002Fblob\u002Fmain\u002Fexamples\u002Fblip2_instructed_generation.ipynb)\n  > A generic and efficient pre-training strategy that easily harvests development of pretrained vision models and large language models (LLMs) for vision-language pretraining. BLIP-2 beats Flamingo on zero-shot VQAv2 (**65.0** vs **56.3**), establishing new state-of-the-art on zero-shot captioning (on NoCaps **121.6** CIDEr score vs previous best **113.2**). In addition, equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the new **zero-shot instructed vision-to-language generation** capabilities for various interesting applications!\n  * Jan 2023, LAVIS is now available on [PyPI](https:\u002F\u002Fpypi.org\u002Fproject\u002Fsalesforce-lavis\u002F) for installation!\n  * [Model Release] Dec 2022, released implementation of **Img2LLM-VQA** (**CVPR 2023**, _\"From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models\"_, by Jiaxian Guo et al) \u003Cbr>\n  [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10846.pdf), [Project Page](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fimg2llm-vqa), [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsalesforce\u002FLAVIS\u002Fblob\u002Fmain\u002Fprojects\u002Fimg2llm-vqa\u002Fimg2llm_vqa.ipynb)\n  > A plug-and-play module that enables off-the-shelf use of Large Language Models (LLMs) for visual question answering (VQA). Img2LLM-VQA surpasses Flamingo on zero-shot VQA on VQAv2 (61.9 vs 56.3), while in contrast requiring no end-to-end training! \n  * [Model Release] Oct 2022, released implementation of **PNP-VQA** (**EMNLP Findings 2022**, _\"Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training\"_, by Anthony T.M.H. et al), \u003Cbr> \n  [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.08773), [Project Page](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fpnp-vqa), [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsalesforce\u002FLAVIS\u002Fblob\u002Fmain\u002Fprojects\u002Fpnp-vqa\u002Fpnp_vqa.ipynb))\n  >  A modular zero-shot VQA framework that requires no PLMs training, achieving SoTA zero-shot VQA performance. \n\n## Technical Report and Citing LAVIS\nYou can find more details in our [technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.09019).\n\n**If you're using LAVIS in your research or applications, please cite it using this BibTeX**:\n```bibtex\n@inproceedings{li-etal-2023-lavis,\n    title = \"{LAVIS}: A One-stop Library for Language-Vision Intelligence\",\n    author = \"Li, Dongxu  and\n      Li, Junnan  and\n      Le, Hung  and\n      Wang, Guangsen  and\n      Savarese, Silvio  and\n      Hoi, Steven C.H.\",\n    booktitle = \"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)\",\n    month = jul,\n    year = \"2023\",\n    address = \"Toronto, Canada\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Faclanthology.org\u002F2023.acl-demo.3\",\n    pages = \"31--41\",\n    abstract = \"We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications. LAVIS aims to serve as a one-stop comprehensive library that brings recent advancements in the language-vision field accessible for researchers and practitioners, as well as fertilizing future research and development. It features a unified interface to easily access state-of-the-art image-language, video-language models and common datasets. LAVIS supports training, evaluation and benchmarking on a rich variety of tasks, including multimodal classification, retrieval, captioning, visual question answering, dialogue and pre-training. In the meantime, the library is also highly extensible and configurable, facilitating future development and customization. In this technical report, we describe design principles, key components and functionalities of the library, and also present benchmarking results across common language-vision tasks.\",\n}\n```\n\n\n## Table of Contents\n  - [Introduction](#introduction)\n  - [Installation](#installation)\n  - [Getting Started](#getting-started)\n    - [Model Zoo](#model-zoo)\n    - [Image Captioning](#image-captioning)\n    - [Visual question answering (VQA)](#visual-question-answering-vqa)\n    - [Unified Feature Extraction Interface](#unified-feature-extraction-interface)\n    - [Load Datasets](#load-datasets)\n  - [Jupyter Notebook Examples](#jupyter-notebook-examples)\n  - [Resources and Tools](#resources-and-tools)\n  - [Documentations](#documentations)\n  - [Ethical and Responsible Use](#ethical-and-responsible-use)\n  - [Technical Report and Citing LAVIS](#technical-report-and-citing-lavis)\n  - [License](#license)\n\n## Introduction\nLAVIS is a Python deep learning library for LAnguage-and-VISion intelligence research and applications. This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal scenarios, and benchmark them across standard and customized datasets.\nIt features a unified interface design to access\n- **10+** tasks\n(retrieval, captioning, visual question answering, multimodal classification etc.);\n- **20+** datasets (COCO, Flickr, Nocaps, Conceptual\nCommons, SBU, etc.);\n- **30+** pretrained weights of state-of-the-art foundation language-vision models and their task-specific adaptations, including [ALBEF](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.07651.pdf),\n[BLIP](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12086.pdf), [ALPRO](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09583.pdf), [CLIP](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.00020.pdf).\n\u003Cp align=\"center\">\n    \u003Cbr>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsalesforce_LAVIS_readme_606114eff116.png\"\u002F>\n    \u003Cbr>\n\u003Cp>\n\nKey features of LAVIS include:\n\n- **Unified and Modular Interface**: facilitating to easily leverage and repurpose existing modules (datasets, models, preprocessors), also to add new modules.\n\n- **Easy Off-the-shelf Inference and Feature Extraction**: readily available pre-trained models let you take advantage of state-of-the-art multimodal understanding and generation capabilities on your own data.\n\n- **Reproducible Model Zoo and Training Recipes**: easily replicate and extend state-of-the-art models on existing and new tasks.\n\n- **Dataset Zoo and Automatic Downloading Tools**: it can be a hassle to prepare the many language-vision datasets. LAVIS provides automatic downloading scripts to help prepare a large variety of datasets and their annotations.\n\n\nThe following table shows the supported tasks, datasets and models in our library. This is a continuing effort and we are working on further growing the list.\n\n|                  Tasks                   |     Supported Models     |             Supported Datasets             |\n| :--------------------------------------: | :----------------------: | :----------------------------------------: |\n|         Image-text Pre-training          |       ALBEF, BLIP        | COCO, VisualGenome, SBU ConceptualCaptions |\n|           Image-text Retrieval           |    ALBEF, BLIP, CLIP     |              COCO, Flickr30k               |\n|           Text-image Retrieval           |    ALBEF, BLIP, CLIP     |              COCO, Flickr30k               |\n|        Visual Question Answering         |       ALBEF, BLIP        |           VQAv2, OKVQA, A-OKVQA            |\n|             Image Captioning             |           BLIP           |                COCO, NoCaps                |\n|           Image Classification           |           CLIP           |                  ImageNet                  |\n| Natural Language Visual Reasoning (NLVR) |       ALBEF, BLIP        |                   NLVR2                    |\n|          Visual Entailment (VE)          |          ALBEF           |                  SNLI-VE                   |\n|             Visual Dialogue              |           BLIP           |                  VisDial                   |\n|           Video-text Retrieval           |       BLIP, ALPRO        |               MSRVTT, DiDeMo               |\n|           Text-video Retrieval           |       BLIP, ALPRO        |               MSRVTT, DiDeMo               |\n|    Video Question Answering (VideoQA)    |       BLIP, ALPRO        |                MSRVTT, MSVD                |\n|              Video Dialogue              |         VGD-GPT          |                    AVSD                    |\n|      Multimodal Feature Extraction       | ALBEF, CLIP, BLIP, ALPRO |                 customized                 |\n|         Text-to-image Generation         |      [COMING SOON]       |                                            |\n\n## Installation\n\n1. (Optional) Creating conda environment\n\n```bash\nconda create -n lavis python=3.8\nconda activate lavis\n```\n\n2. install from [PyPI](https:\u002F\u002Fpypi.org\u002Fproject\u002Fsalesforce-lavis\u002F)\n```bash\npip install salesforce-lavis\n```\n    \n3. Or, for development, you may build from source\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS.git\ncd LAVIS\npip install -e .\n```\n\n## Getting Started\n### Model Zoo\nModel zoo summarizes supported models in LAVIS, to view:\n```python\nfrom lavis.models import model_zoo\nprint(model_zoo)\n# ==================================================\n# Architectures                  Types\n# ==================================================\n# albef_classification           ve\n# albef_feature_extractor        base\n# albef_nlvr                     nlvr\n# albef_pretrain                 base\n# albef_retrieval                coco, flickr\n# albef_vqa                      vqav2\n# alpro_qa                       msrvtt, msvd\n# alpro_retrieval                msrvtt, didemo\n# blip_caption                   base_coco, large_coco\n# blip_classification            base\n# blip_feature_extractor         base\n# blip_nlvr                      nlvr\n# blip_pretrain                  base\n# blip_retrieval                 coco, flickr\n# blip_vqa                       vqav2, okvqa, aokvqa\n# clip_feature_extractor         ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50\n# clip                           ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50\n# gpt_dialogue                   base\n```\n\nLet’s see how to use models in LAVIS to perform inference on example data. We first load a sample image from local.\n\n```python\nimport torch\nfrom PIL import Image\n# setup device to use\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n# load sample image\nraw_image = Image.open(\"docs\u002F_static\u002Fmerlion.png\").convert(\"RGB\")\n```\n\nThis example image shows [Merlion park](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMerlion) ([source](https:\u002F\u002Ftheculturetrip.com\u002Fasia\u002Fsingapore\u002Farticles\u002Fwhat-exactly-is-singapores-merlion-anyway\u002F)), a landmark in Singapore.\n\n\n### Image Captioning\nIn this example, we use the BLIP model to generate a caption for the image. To make inference even easier, we also associate each\npre-trained model with its preprocessors (transforms), accessed via ``load_model_and_preprocess()``.\n\n```python\nimport torch\nfrom lavis.models import load_model_and_preprocess\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.\n# this also loads the associated image processors\nmodel, vis_processors, _ = load_model_and_preprocess(name=\"blip_caption\", model_type=\"base_coco\", is_eval=True, device=device)\n# preprocess the image\n# vis_processors stores image transforms for \"train\" and \"eval\" (validation \u002F testing \u002F inference)\nimage = vis_processors[\"eval\"](raw_image).unsqueeze(0).to(device)\n# generate caption\nmodel.generate({\"image\": image})\n# ['a large fountain spewing water into the air']\n```\n\n### Visual question answering (VQA)\nBLIP model is able to answer free-form questions about images in natural language.\nTo access the VQA model, simply replace the ``name`` and ``model_type`` arguments\npassed to ``load_model_and_preprocess()``.\n\n```python\nfrom lavis.models import load_model_and_preprocess\nmodel, vis_processors, txt_processors = load_model_and_preprocess(name=\"blip_vqa\", model_type=\"vqav2\", is_eval=True, device=device)\n# ask a random question.\nquestion = \"Which city is this photo taken?\"\nimage = vis_processors[\"eval\"](raw_image).unsqueeze(0).to(device)\nquestion = txt_processors[\"eval\"](question)\nmodel.predict_answers(samples={\"image\": image, \"text_input\": question}, inference_method=\"generate\")\n# ['singapore']\n```\n\n### Unified Feature Extraction Interface\n\nLAVIS provides a unified interface to extract features from each architecture. \nTo extract features, we load the feature extractor variants of each model.\nThe multimodal feature can be used for multimodal classification.\nThe low-dimensional unimodal features can be used to compute cross-modal similarity.\n\n\n```python\nfrom lavis.models import load_model_and_preprocess\nmodel, vis_processors, txt_processors = load_model_and_preprocess(name=\"blip_feature_extractor\", model_type=\"base\", is_eval=True, device=device)\ncaption = \"a large fountain spewing water into the air\"\nimage = vis_processors[\"eval\"](raw_image).unsqueeze(0).to(device)\ntext_input = txt_processors[\"eval\"](caption)\nsample = {\"image\": image, \"text_input\": [text_input]}\n\nfeatures_multimodal = model.extract_features(sample)\nprint(features_multimodal.multimodal_embeds.shape)\n# torch.Size([1, 12, 768]), use features_multimodal[:,0,:] for multimodal classification tasks\n\nfeatures_image = model.extract_features(sample, mode=\"image\")\nfeatures_text = model.extract_features(sample, mode=\"text\")\nprint(features_image.image_embeds.shape)\n# torch.Size([1, 197, 768])\nprint(features_text.text_embeds.shape)\n# torch.Size([1, 12, 768])\n\n# low-dimensional projected features\nprint(features_image.image_embeds_proj.shape)\n# torch.Size([1, 197, 256])\nprint(features_text.text_embeds_proj.shape)\n# torch.Size([1, 12, 256])\nsimilarity = features_image.image_embeds_proj[:,0,:] @ features_text.text_embeds_proj[:,0,:].t()\nprint(similarity)\n# tensor([[0.2622]])\n```\n\n### Load Datasets\nLAVIS inherently supports a wide variety of common language-vision datasets by providing [automatic download tools](https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Fbenchmark) to help download and organize these datasets. After downloading, to load the datasets, use the following code:\n\n```python\nfrom lavis.datasets.builders import dataset_zoo\ndataset_names = dataset_zoo.get_names()\nprint(dataset_names)\n# ['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m',\n#  'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi',\n#  'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr',\n#  'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']\n```\nAfter downloading the images, we can use ``load_dataset()`` to obtain the dataset.\n```python\nfrom lavis.datasets.builders import load_dataset\ncoco_dataset = load_dataset(\"coco_caption\")\nprint(coco_dataset.keys())\n# dict_keys(['train', 'val', 'test'])\nprint(len(coco_dataset[\"train\"]))\n# 566747\nprint(coco_dataset[\"train\"][0])\n# {'image': \u003CPIL.Image.Image image mode=RGB size=640x480>,\n#  'text_input': 'A woman wearing a net on her head cutting a cake. ',\n#  'image_id': 0}\n```\n\nIf you already host a local copy of the dataset, you can pass in the ``vis_path`` argument to change the default location to load images.\n\n```python\ncoco_dataset = load_dataset(\"coco_caption\", vis_path=YOUR_LOCAL_PATH)\n```\n\n## Jupyter Notebook Examples\nSee [examples](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fexamples) for more inference examples, e.g. captioning, feature extraction, VQA, GradCam, zeros-shot classification.\n\n## Resources and Tools\n- **Benchmarks**: see [Benchmark](https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Fbenchmark) for instructions to evaluate and train supported models.\n- **Dataset Download and Browsing**: see [Dataset Download](https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Fbenchmark) for instructions and automatic tools on download common language-vision datasets.\n- **GUI Demo**: to run the demo locally, run ```bash run_scripts\u002Frun_demo.sh``` and then follow the instruction on the prompts to view in browser. A web demo is coming soon.\n\n\n## Documentations\nFor more details and advanced usages, please refer to\n[documentation](https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Findex.html#).\n\n## Ethical and Responsible Use\nWe note that models in LAVIS provide no guarantees on their multimodal abilities; incorrect or biased predictions may be observed. In particular, the datasets and pretrained models utilized in LAVIS may contain socioeconomic biases which could result in misclassification and other unwanted behaviors such as offensive or inappropriate speech. We strongly recommend that users review the pre-trained models and overall system in LAVIS before practical adoption. We plan to improve the library by investigating and mitigating these potential biases and\ninappropriate behaviors in the future.\n\n\n## Contact us\nIf you have any questions, comments or suggestions, please do not hesitate to contact us at lavis@salesforce.com.\n\n## License\n[BSD 3-Clause License](LICENSE.txt)\n","\u003Cp align=\"center\">\n    \u003Cbr>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsalesforce_LAVIS_readme_0857ac3abf12.png\" width=\"400\"\u002F>\n    \u003Cbr>\n\u003Cp>\n\n\u003Cdiv align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Freleases\">\u003Cimg alt=\"最新版本\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Fsalesforce\u002FLAVIS.svg\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002Findex.html\">\n  \u003Cimg alt=\"文档\" src=\"https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Factions\u002Fworkflows\u002Fdocs.yaml\u002Fbadge.svg\"\u002F>\n  \u003Ca href=\"https:\u002F\u002Fopensource.org\u002Flicenses\u002FBSD-3-Clause\">\n  \u003Cimg alt=\"许可证\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-BSD_3--Clause-blue.svg\"\u002F>\n  \u003C\u002Fa> \n  \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fproject\u002Fsalesforce-lavis\">\n  \u003Cimg alt=\"下载量\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsalesforce_LAVIS_readme_f617c87ed07f.png\">\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Fbenchmark.html\">基准测试\u003C\u002Fa>,\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.09019\">技术报告\u003C\u002Fa>,\n\u003Ca href=\"https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Findex.html\">文档\u003C\u002Fa>,\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fexamples\">Jupyter Notebook 示例\u003C\u002Fa>,\n\u003Ca href=\"https:\u002F\u002Fblog.salesforceairesearch.com\u002Flavis-language-vision-library\u002F\">博客\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n# LAVIS - 语言-视觉智能库\n\n## 最新动态: 🎉 \n  * [模型发布] 2023年11月，发布了 **X-InstructBLIP** 的实现 \u003Cbr>\n  [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18799.pdf), [项目页面](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fxinstructblip), [官网](https:\u002F\u002Fartemisp.github.io\u002FX-InstructBLIP-page\u002F), [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsalesforce\u002FLAVIS\u002Fblob\u002Fmain\u002Fprojects\u002Fxinstructblip\u002Fdemo\u002Frun_demo.ipynb)\n  > 这是一个简单而高效的跨模态框架，基于冻结的大语言模型构建，能够在无需大量特定模态定制的情况下，整合多种模态（图像、视频、音频、3D）。\n  * [模型发布] 2023年7月，发布了 **BLIP-Diffusion** 的实现 \u003Cbr>\n  [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06500), [项目页面](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fblip-diffusion), [官网](https:\u002F\u002Fdxli94.github.io\u002FBLIP-Diffusion-website\u002F)\n  > 这是一种文本到图像生成模型，其训练效率是 DreamBooth 的 20 倍。同时支持零样本的主体驱动生成与编辑。\n  * [模型发布] 2023年5月，发布了 **InstructBLIP** 的实现 \u003Cbr>\n  [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06500), [项目页面](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Finstructblip)    \n  > 这是一个基于 BLIP-2 模型的新型视觉-语言指令微调框架，在广泛的视觉-语言任务中实现了最先进的零样本泛化性能。\n  * [模型发布] 2023年1月，发布了 **BLIP-2** 的实现 \u003Cbr>\n  [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.12597), [项目页面](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fblip2), [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsalesforce\u002FLAVIS\u002Fblob\u002Fmain\u002Fexamples\u002Fblip2_instructed_generation.ipynb)\n  > 这是一种通用且高效的预训练策略，能够轻松利用预训练的视觉模型和大型语言模型（LLMs）进行视觉-语言预训练。BLIP-2 在 VQAv2 零样本任务上超越了 Flamingo（65.0 对 56.3），并在 NoCaps 数据集上的零样本字幕生成任务中创下了新的 SOTA 记录（CIDEr 分数为 121.6，而此前最佳为 113.2）。此外，结合强大的 LLMs（如 OPT、FlanT5），BLIP-2 还解锁了全新的 **零样本指令式视觉到语言生成** 能力，适用于各种有趣的应用场景！\n  * 2023年1月，LAVIS 现已在 [PyPI](https:\u002F\u002Fpypi.org\u002Fproject\u002Fsalesforce-lavis\u002F) 上架，可直接安装！\n  * [模型发布] 2022年12月，发布了 **Img2LLM-VQA** 的实现（CVPR 2023，“从图像到文本提示：使用冻结大型语言模型的零样本 VQA”，作者：郭家贤等） \u003Cbr>\n  [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10846.pdf), [项目页面](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fimg2llm-vqa), [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsalesforce\u002FLAVIS\u002Fblob\u002Fmain\u002Fprojects\u002Fimg2llm-vqa\u002Fimg2llm_vqa.ipynb)\n  > 这是一个即插即用模块，使大型语言模型（LLMs）能够直接用于视觉问答（VQA）。Img2LLM-VQA 在 VQAv2 数据集的零样本 VQA 任务上超越了 Flamingo（61.9 对 56.3），而且完全不需要端到端的训练！\n  * [模型发布] 2022年10月，发布了 **PNP-VQA** 的实现（EMNLP Findings 2022，“即插即用 VQA：通过组合大型预训练模型实现零训练的零样本 VQA”，作者：Anthony T.M.H. 等）， \u003Cbr> \n  [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.08773), [项目页面](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fpnp-vqa), [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsalesforce\u002FLAVIS\u002Fblob\u002Fmain\u002Fprojects\u002Fpnp-vqa\u002Fpnp_vqa.ipynb))\n  > 这是一个模块化的零样本 VQA 框架，无需对 PLMs 进行训练，即可达到 SOTA 的零样本 VQA 性能。\n\n## 技术报告与引用 LAVIS\n您可以在我们的[技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.09019)中找到更多详细信息。\n\n**如果您在研究或应用中使用了 LAVIS，请使用以下 BibTeX 格式引用它**：\n```bibtex\n@inproceedings{li-etal-2023-lavis,\n    title = \"{LAVIS}: 一站式语言-视觉智能库\",\n    author = \"Li, Dongxu  and\n      Li, Junnan  and\n      Le, Hung  and\n      Wang, Guangsen  and\n      Savarese, Silvio  and\n      Hoi, Steven C.H.\",\n    booktitle = \"第61届计算语言学协会年会论文集（第3卷：系统演示）\",\n    month = jul,\n    year = \"2023\",\n    address = \"多伦多，加拿大\",\n    publisher = \"计算语言学协会\",\n    url = \"https:\u002F\u002Faclanthology.org\u002F2023.acl-demo.3\",\n    pages = \"31--41\",\n    abstract = \"我们介绍了 LAVIS，一个用于语言-视觉研究和应用的开源深度学习库。LAVIS 致力于成为一个一站式的综合库，使语言-视觉领域的最新进展更容易被研究人员和从业者所使用，并为未来的研究与发展提供助力。该库提供统一的接口，方便用户访问最先进的图像-语言、视频-语言模型以及常用数据集。LAVIS 支持在多种任务上的训练、评估和基准测试，包括多模态分类、检索、图像字幕生成、视觉问答、对话以及预训练等。同时，该库还具有高度可扩展性和可配置性，便于未来的开发与定制。在本技术报告中，我们描述了该库的设计原则、关键组件和功能，并展示了在常见语言-视觉任务上的基准测试结果。\",\n}\n```\n\n\n## 目录\n  - [简介](#introduction)\n  - [安装](#installation)\n  - [快速入门](#getting-started)\n    - [模型库](#model-zoo)\n    - [图像字幕生成](#image-captioning)\n    - [视觉问答 (VQA)](#visual-question-answering-vqa)\n    - [统一特征提取接口](#unified-feature-extraction-interface)\n    - [加载数据集](#load-datasets)\n  - [Jupyter Notebook 示例](#jupyter-notebook-examples)\n  - [资源与工具](#resources-and-tools)\n  - [文档](#documentations)\n  - [伦理与负责任使用](#ethical-and-responsible-use)\n  - [技术报告与引用 LAVIS](#technical-report-and-citing-lavis)\n  - [许可证](#license)\n\n## 简介\nLAVIS 是一个用于语言-视觉智能研究和应用的 Python 深度学习库。该库旨在为工程师和研究人员提供一站式的解决方案，以快速开发适用于其特定多模态场景的模型，并在标准及自定义数据集上进行基准测试。\n它采用统一的接口设计，支持访问：\n- **10+** 种任务\n（检索、字幕生成、视觉问答、多模态分类等）；\n- **20+** 种数据集（COCO、Flickr、Nocaps、Conceptual Commons、SBU 等）；\n- **30+** 套最先进的基础语言-视觉模型及其任务特定适配的预训练权重，包括 [ALBEF](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.07651.pdf),\n[BLIP](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12086.pdf), [ALPRO](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.09583.pdf), [CLIP](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.00020.pdf)。\n\u003Cp align=\"center\">\n    \u003Cbr>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsalesforce_LAVIS_readme_606114eff116.png\"\u002F>\n    \u003Cbr>\n\u003Cp>\n\nLAVIS 的主要特点包括：\n\n- **统一且模块化的接口**：便于轻松利用和复用现有模块（数据集、模型、预处理工具），同时也支持添加新模块。\n\n- **开箱即用的推理与特征提取**：现成的预训练模型让您能够直接在自己的数据上使用最先进的多模态理解与生成能力。\n\n- **可复现的模型库与训练配方**：您可以轻松复制并扩展现有及新任务上的最先进模型。\n\n- **数据集库与自动下载工具**：准备众多语言-视觉数据集往往非常繁琐。LAVIS 提供自动下载脚本，帮助您快速准备各种数据集及其标注。\n\n\n下表展示了我们库中支持的任务、数据集和模型。这是一项持续的工作，我们正在不断扩充列表。\n\n|                  任务                   |     支持的模型     |             支持的数据集             |\n| :--------------------------------------: | :----------------------: | :----------------------------------------: |\n|         图像-文本预训练          |       ALBEF, BLIP        | COCO, VisualGenome, SBU ConceptualCaptions |\n|           图像-文本检索           |    ALBEF, BLIP, CLIP     |              COCO, Flickr30k               |\n|           文本-图像检索           |    ALBEF, BLIP, CLIP     |              COCO, Flickr30k               |\n|        视觉问答         |       ALBEF, BLIP        |           VQAv2, OKVQA, A-OKVQA            |\n|             图像字幕生成             |           BLIP           |                COCO, NoCaps                |\n|           图像分类           |           CLIP           |                  ImageNet                  |\n| 自然语言视觉推理 (NLVR) |       ALBEF, BLIP        |                   NLVR2                    |\n|          视觉蕴含 (VE)          |          ALBEF           |                  SNLI-VE                   |\n|             视觉对话              |           BLIP           |                  VisDial                   |\n|           视频-文本检索           |       BLIP, ALPRO        |               MSRVTT, DiDeMo               |\n|           文本-视频检索           |       BLIP, ALPRO        |               MSRVTT, DiDeMo               |\n|    视频问答 (VideoQA)    |       BLIP, ALPRO        |                MSRVTT, MSVD                |\n|              视频对话              |         VGD-GPT          |                    AVSD                    |\n|      多模态特征提取       | ALBEF, CLIP, BLIP, ALPRO |                 定制数据集                 |\n|         文本-图像生成         |      [即将推出]       |                                            |\n\n## 安装\n\n1. （可选）创建 conda 环境\n\n```bash\nconda create -n lavis python=3.8\nconda activate lavis\n```\n\n2. 从 [PyPI](https:\u002F\u002Fpypi.org\u002Fproject\u002Fsalesforce-lavis\u002F) 安装\n\n```bash\npip install salesforce-lavis\n```\n    \n3. 或者，如果您需要进行开发，可以从源代码构建\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS.git\ncd LAVIS\npip install -e .\n```\n\n## 快速入门\n### 模型库\n模型库汇总了 LAVIS 中支持的所有模型，查看方法如下：\n```python\nfrom lavis.models import model_zoo\nprint(model_zoo)\n# ==================================================\n\n# 架构                  类型\n# ==================================================\n# albef_classification           ve\n# albef_feature_extractor        base\n# albef_nlvr                     nlvr\n# albef_pretrain                 base\n# albef_retrieval                coco, flickr\n# albef_vqa                      vqav2\n# alpro_qa                       msrvtt, msvd\n# alpro_retrieval                msrvtt, didemo\n# blip_caption                   base_coco, large_coco\n# blip_classification            base\n# blip_feature_extractor         base\n# blip_nlvr                      nlvr\n# blip_pretrain                  base\n# blip_retrieval                 coco, flickr\n# blip_vqa                       vqav2, okvqa, aokvqa\n# clip_feature_extractor         ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50\n# clip                           ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50\n# gpt_dialogue                   base\n让我们看看如何在 LAVIS 中使用模型对示例数据进行推理。首先，我们从本地加载一张示例图像。\n\n```python\nimport torch\nfrom PIL import Image\n# 设置使用的设备\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n# 加载示例图像\nraw_image = Image.open(\"docs\u002F_static\u002Fmerlion.png\").convert(\"RGB\")\n```\n\n这张示例图像展示了[鱼尾狮公园](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMerlion)（[来源](https:\u002F\u002Ftheculturetrip.com\u002Fasia\u002Fsingapore\u002Farticles\u002Fwhat-exactly-is-singapores-merlion-anyway\u002F))，这是新加坡的一个地标性建筑。\n\n\n### 图像字幕生成\n在这个例子中，我们使用 BLIP 模型为该图像生成字幕。为了使推理更加简便，我们还通过 ``load_model_and_preprocess()`` 将每个预训练模型与其预处理工具（变换）关联起来。\n\n```python\nimport torch\nfrom lavis.models import load_model_and_preprocess\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n# 加载 BLIP 字幕基础模型，并在其上进行了 MSCOCO 字幕数据集的微调。\n# 同时也加载了相关的图像处理器\nmodel, vis_processors, _ = load_model_and_preprocess(name=\"blip_caption\", model_type=\"base_coco\", is_eval=True, device=device)\n# 对图像进行预处理\n# vis_processors 存储了用于“训练”和“评估”（验证\u002F测试\u002F推理）的图像变换\nimage = vis_processors[\"eval\"](raw_image).unsqueeze(0).to(device)\n# 生成字幕\nmodel.generate({\"image\": image})\n# ['a large fountain spewing water into the air']\n```\n\n### 视觉问答（VQA）\nBLIP 模型能够用自然语言回答关于图像的自由式问题。要访问 VQA 模型，只需替换传递给 ``load_model_and_preprocess()`` 的 ``name`` 和 ``model_type`` 参数即可。\n\n```python\nfrom lavis.models import load_model_and_preprocess\nmodel, vis_processors, txt_processors = load_model_and_preprocess(name=\"blip_vqa\", model_type=\"vqav2\", is_eval=True, device=device)\n# 提出一个随机问题。\nquestion = \"Which city is this photo taken?\"\nimage = vis_processors[\"eval\"](raw_image).unsqueeze(0).to(device)\nquestion = txt_processors[\"eval\"](question)\nmodel.predict_answers(samples={\"image\": image, \"text_input\": question}, inference_method=\"generate\")\n# ['singapore']\n```\n\n### 统一特征提取接口\n\nLAVIS 提供了一个统一的接口来提取每种架构的特征。为了提取特征，我们加载每个模型的特征提取变体。多模态特征可用于多模态分类。低维单模态特征则可用于计算跨模态相似度。\n\n\n```python\nfrom lavis.models import load_model_and_preprocess\nmodel, vis_processors, txt_processors = load_model_and_preprocess(name=\"blip_feature_extractor\", model_type=\"base\", is_eval=True, device=device)\ncaption = \"a large fountain spewing water into the air\"\nimage = vis_processors[\"eval\"](raw_image).unsqueeze(0).to(device)\ntext_input = txt_processors[\"eval\"](caption)\nsample = {\"image\": image, \"text_input\": [text_input]}\n\nfeatures_multimodal = model.extract_features(sample)\nprint(features_multimodal.multimodal_embeds.shape)\n# torch.Size([1, 12, 768]), use features_multimodal[:,0,:] for multimodal classification tasks\n\nfeatures_image = model.extract_features(sample, mode=\"image\")\nfeatures_text = model.extract_features(sample, mode=\"text\")\nprint(features_image.image_embeds.shape)\n# torch.Size([1, 197, 768])\nprint(features_text.text_embeds.shape)\n# torch.Size([1, 12, 768])\n\n# low-dimensional projected features\nprint(features_image.image_embeds_proj.shape)\n# torch.Size([1, 197, 256])\nprint(features_text.text_embeds_proj.shape)\n# torch.Size([1, 12, 256])\nsimilarity = features_image.image_embeds_proj[:,0,:] @ features_text.text_embeds_proj[:,0,:].t()\nprint(similarity)\n# tensor([[0.2622]])\n```\n\n### 加载数据集\nLAVIS 内置支持多种常见的语言-视觉数据集，提供了[自动下载工具](https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Fbenchmark)，帮助下载和整理这些数据集。下载完成后，可以使用以下代码加载数据集：\n\n```python\nfrom lavis.datasets.builders import dataset_zoo\ndataset_names = dataset_zoo.get_names()\nprint(dataset_names)\n# ['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m',\n#  'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi',\n#  'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr',\n#  'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']\n```\n下载完图像后，我们可以使用 ``load_dataset()`` 来获取数据集。\n```python\nfrom lavis.datasets.builders import load_dataset\ncoco_dataset = load_dataset(\"coco_caption\")\nprint(coco_dataset.keys())\n# dict_keys(['train', 'val', 'test'])\nprint(len(coco_dataset[\"train\"]))\n# 566747\nprint(coco_dataset[\"train\"][0])\n# {'image': \u003CPIL.Image.Image image mode=RGB size=640x480>,\n#  'text_input': 'A woman wearing a net on her head cutting a cake. ',\n#  'image_id': 0}\n```\n\n如果你已经托管了数据集的本地副本，可以通过传递 ``vis_path`` 参数来更改默认的图像加载路径。\n\n```python\ncoco_dataset = load_dataset(\"coco_caption\", vis_path=YOUR_LOCAL_PATH)\n```\n\n## Jupyter Notebook 示例\n更多推理示例，例如字幕生成、特征提取、VQA、GradCam、零样本分类等，请参阅[示例](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fexamples)。\n\n## 资源与工具\n- **基准测试**：请参阅[基准测试](https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Fbenchmark)以获取评估和训练支持模型的说明。\n- **数据集下载与浏览**：请参阅[数据集下载](https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Fbenchmark)以获取常见语言-视觉数据集的下载说明及自动化工具。\n- **GUI 演示**：要在本地运行演示，请执行 ```bash run_scripts\u002Frun_demo.sh```，然后按照提示在浏览器中查看。网页版演示即将推出。\n\n## 文档\n更多详细信息和高级用法，请参阅\n[文档](https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Findex.html#)。\n\n## 伦理与负责任的使用\n我们注意到，LAVIS 中的模型对其多模态能力不提供任何保证；可能会出现错误或有偏见的预测。特别是，LAVIS 中使用的数据集和预训练模型可能包含社会经济偏见，从而导致分类错误以及其他不良行为，例如冒犯性或不恰当的言论。我们强烈建议用户在实际应用之前，仔细审查 LAVIS 中的预训练模型和整个系统。我们计划在未来通过研究和缓解这些潜在的偏见及不当行为来改进该库。\n\n\n## 联系我们\n如果您有任何问题、意见或建议，请随时通过 lavis@salesforce.com 与我们联系。\n\n## 许可证\n[BSD 3-Clause 许可证](LICENSE.txt)","# LAVIS 快速上手指南\n\nLAVIS (Language-Vision Intelligence) 是由 Salesforce 开源的一站式语言 - 视觉智能库。它提供了统一的接口，支持图像描述、视觉问答 (VQA)、图文检索等 10+ 种任务，集成了 BLIP-2、InstructBLIP、CLIP 等 30+ 个预训练模型，并支持自动下载 20+ 个常用数据集。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**: Linux 或 macOS (Windows 用户建议使用 WSL2 或 Docker)\n*   **Python 版本**: 推荐 Python 3.8 或更高版本\n*   **依赖管理**: 推荐使用 `conda` 进行环境隔离\n*   **GPU**: 可选，但运行大模型推理或训练时强烈建议配备 NVIDIA GPU 及对应的 CUDA 环境\n\n## 安装步骤\n\n### 方法一：通过 PyPI 安装（推荐）\n\n这是最快捷的安装方式，适合直接使用预训练模型进行推理的用户。\n\n```bash\n# 1. (可选) 创建并激活 conda 环境\nconda create -n lavis python=3.8\nconda activate lavis\n\n# 2. 安装 salesforce-lavis\npip install salesforce-lavis\n```\n\n> **国内加速提示**：如果下载速度较慢，可使用清华或阿里镜像源：\n> `pip install salesforce-lavis -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n\n### 方法二：从源码安装（开发模式）\n\n如果您需要修改源码或运行最新的实验性功能，请选择此方式。\n\n```bash\n# 1. 克隆仓库\ngit clone https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS.git\ncd LAVIS\n\n# 2. (可选) 创建并激活 conda 环境\nconda create -n lavis python=3.8\nconda activate lavis\n\n# 3. 以可编辑模式安装\npip install -e .\n```\n\n## 基本使用\n\nLAVIS 的核心优势在于其统一的接口，只需几行代码即可加载预训练模型并进行推理。以下以**图像描述生成 (Image Captioning)** 为例。\n\n### 1. 最简单的使用示例\n\n确保已安装 `Pillow` 用于图像处理 (`pip install pillow`)，然后运行以下 Python 代码：\n\n```python\nimport torch\nfrom PIL import Image\nfrom lavis.models import load_model_and_preprocess\n\n# 设置设备\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n\n# 加载模型和预处理工具\n# 这里以 blip_caption 为例，也可替换为 blip2_feature_extractor, instruct_blip_vicuna7b 等\nmodel, vis_processors, _ = load_model_and_preprocess(\n    name=\"blip_caption\", \n    model_type=\"base_coco\", \n    is_eval=True, \n    device=device\n)\n\n# 加载图片\nraw_image = Image.open(\"your_image.jpg\").convert(\"RGB\")\n\n# 预处理图片\nimage = vis_processors[\"eval\"](raw_image).unsqueeze(0).to(device)\n\n# 生成描述\ncaption = model.generate({\"image\": image})\nprint(caption[0])\n```\n\n### 2. 视觉问答 (VQA) 示例\n\n```python\nfrom lavis.models import load_model_and_preprocess\nfrom PIL import Image\nimport torch\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n\n# 加载 VQA 模型\nmodel, vis_processors, txt_processors = load_model_and_preprocess(\n    name=\"blip_vqa\", \n    model_type=\"vqav2\", \n    is_eval=True, \n    device=device\n)\n\n# 准备数据\nraw_image = Image.open(\"your_image.jpg\").convert(\"RGB\")\nquestion = \"How many people are in the image?\"\n\n# 预处理\nimage = vis_processors[\"eval\"](raw_image).unsqueeze(0).to(device)\ntext_input = txt_processors[\"eval\"](question)\n\n# 推理\nanswer = model.predict_answers(samples={\"image\": image, \"text_input\": [text_input]})\nprint(answer[0])\n```\n\n### 3. 可用模型速查\n\n您可以通过 `load_model_and_preprocess` 函数轻松切换不同的模型架构。常用模型名称包括：\n\n*   **图像描述**: `blip_caption`, `instruct_blip_flant5xl`\n*   **视觉问答**: `blip_vqa`, `instruct_blip_vicuna7b`\n*   **特征提取**: `blip2_feature_extractor`, `clip_feature_extractor`\n*   **图文检索**: `blip_retrieval`, `albeft_retrieval`\n\n更多模型详情及参数配置，请参考官方文档或运行 `lavis\u002Fmodels\u002F__init__.py` 查看支持的模型列表。","某电商平台的运营团队需要快速处理海量商品图片，自动生成包含细节描述和营销卖点的高质量文案，以应对大促期间的上新需求。\n\n### 没有 LAVIS 时\n- **开发门槛极高**：团队需分别寻找并整合独立的视觉编码器与大语言模型，编写复杂的对齐代码，耗时数周才能跑通原型。\n- **泛化能力不足**：传统模型只能输出刻板的“这是一双鞋”，无法理解“适合雨天穿着”或“复古风格”等深层语义指令。\n- **多任务维护困难**：针对图片描述、视觉问答、内容编辑等不同需求，需训练和维护多套独立模型，服务器资源消耗巨大。\n- **冷启动成本高**：面对新类目的商品，缺乏零样本（Zero-shot）能力，必须收集大量标注数据重新微调模型才能使用。\n\n### 使用 LAVIS 后\n- **一站式快速集成**：直接调用 LAVIS 中预训练的 BLIP-2 或 InstructBLIP 模型，几行代码即可在一天内部署具备图文理解能力的服务。\n- **指令跟随能力强**：利用指令微调框架，模型能精准响应“写一段突出透气性的小红书风格文案”等复杂自然语言指令。\n- **统一架构降本增效**：通过一个库支持图像描述、视觉问答及主体驱动生成（BLIP-Diffusion）等多种任务，显著降低算力与维护成本。\n- **强大的零样本泛化**：无需额外训练，模型即可直接理解从未见过的新奇商品特征，立即投入生产环境使用。\n\nLAVIS 将原本繁琐的多模态算法研发转化为标准化的 API 调用，让业务团队能专注于创意策略而非底层模型构建。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsalesforce_LAVIS_0857ac3a.png","salesforce","Salesforce","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fsalesforce_6ff2d82a.png","Premier Open Source projects released and supported by Salesforce.",null,"osscore@salesforce.com","https:\u002F\u002Fopensource.salesforce.com","https:\u002F\u002Fgithub.com\u002Fsalesforce",[84,88,92],{"name":85,"color":86,"percentage":87},"Jupyter Notebook","#DA5B0B",74.3,{"name":89,"color":90,"percentage":91},"Python","#3572A5",25.6,{"name":93,"color":94,"percentage":95},"Shell","#89e051",0.1,11194,1102,"2026-04-03T04:44:03","BSD-3-Clause","未说明","未说明（但涉及深度学习模型训练与推理，通常建议配备 NVIDIA GPU）",{"notes":103,"python":104,"dependencies":105},"README 中未明确列出详细的硬件配置（如显存大小、CUDA 版本）和具体依赖库版本号。建议使用 conda 创建 Python 3.8 环境进行安装。该库支持多种多模态任务（如图像描述、视觉问答等），运行不同模型（如 BLIP-2, InstructBLIP）时对显存的需求差异较大，大模型推理通常需要较高显存。","3.8+",[106,107,108,109],"torch","transformers","accelerate","salesforce-lavis",[26,54,14,51,13],[112,113,114,75,115,116,117,118,119,120,121],"deep-learning","deep-learning-library","image-captioning","vision-and-language","vision-framework","vision-language-pretraining","vision-language-transformer","visual-question-anwsering","multimodal-datasets","multimodal-deep-learning",7,"2026-03-27T02:49:30.150509","2026-04-06T08:17:38.429460",[126,131,136,141,145,150],{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},11864,"如何解决 'cannot import name _expand_mask' 的导入错误？","该问题通常由 transformers 库版本不兼容引起。解决方案有两种：\n1. 将 transformers 版本降级或升级到兼容版本，如 4.31.0 或 4.33.0（命令：pip install transformers==4.33.0）。\n2. 手动复制缺失的函数定义到代码中：\n```python\n# Copied from transformers.models.bart.modeling_bart._expand_mask\ndef _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):\n    bsz, src_len = mask.size()\n    tgt_len = tgt_len if tgt_len is not None else src_len\n    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)\n    inverted_mask = 1.0 - expanded_mask\n    return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)\n```\n然后重新安装项目：python setup.py install。","https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Fissues\u002F571",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},11865,"如何在 Mac (M1\u002FM2) 上解决 Decord 依赖安装失败的问题？","Mac 用户（特别是 Python 3.9+）常因缺少预编译二进制文件而无法安装 decord。解决方法如下：\n1. 使用 Homebrew 安装特定版本的 ffmpeg：brew install ffmpeg@4（注意：ffmpeg@5 不兼容）。\n2. 如果已安装 ffmpeg@5，需覆盖链接：brew link --overwrite ffmpeg@4。\n3. 从源码安装 decord：pip install decord --no-binary decord 或克隆仓库后编译。\n4. 从源码安装 LAVIS：pip install -e .\n注意：可能会遇到 open3d 与 Python 版本的冲突，需仔细调整环境版本。","https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Fissues\u002F15",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},11866,"BLIP-2 第一阶段预训练完成后，如何进行评估？","官方脚本中可能未直接包含第一阶段的评估脚本，但可以通过以下方式操作：\n1. 检查数据集配置文件（如 lavis\u002Fconfigs\u002Fdatasets\u002Fcoco\u002Fdefaults_cap.yaml），确认其中包含了 val 或 test 分割的标注和图像路径。\n2. 使用现有的评估流程，指定对应的数据集配置和模型检查点。\n3. 对于自定义修改的模型架构，可能需要参考 defaults_cap.yaml 中的结构自行编写评估脚本，加载验证集数据进行推理并计算指标。","https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Fissues\u002F774",{"id":142,"question_zh":143,"answer_zh":144,"source_url":140},11867,"在哪里可以找到 BLIP-2 预训练的 model_name 和 model_type 配置，以及如何加载训练好的检查点？","model_name 和 model_type 分别对应模型的架构和类型，详细信息可在 BLIP-2 Model Zoo 页面查看。\n若要加载第二阶段训练完成的检查点（例如 Pretrain_stage2\u002F...\u002Fcheckpoint_276.pth），需要修改对应的 YAML 配置文件。\n例如，如果使用 pretrain_opt6.7b 模型，配置文件位于：lavis\u002Fconfigs\u002Fmodels\u002Fblip2\u002Fblip2_pretrain_opt6.7b.yaml。\n在该文件中指定 checkpoint 路径即可加载权重进行推理或微调。",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},11868,"为什么我在 GQA 数据集上的 VQA 结果远低于论文报告的值（如 33.55 vs 44.7）？","性能差异通常由生成参数设置不当引起。请确保使用以下关键设置：\n1. Prompt 格式：\"Question: {} Short answer:\"\n2. Beam search width 设置为 5。\n3. Length penalty 设置为 -1。\n4. 最大生成长度（max_len）应设置为 10（默认可能是 30，这会显著影响准确率）。\n此外，确保使用了正确的评估脚本和数据处理流程。官方已在后续提交中修复了 GQA 评估的相关问题，建议拉取最新代码。","https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Fissues\u002F99",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},11869,"从头预训练 BLIP-2 时损失难以下降，有什么建议？","损失波动在短期窗口内是正常的，但如果长期不下降，可能与语言模型规模有关。\n经验表明，使用过小的语言模型（如 1B 参数）很难在不进行微调的情况下获得良好结果。\n建议将语言模型替换为更大规模的模型（如 6B 参数或以上），这能显著提升预训练效果和收敛速度。\n此外，若要在中文数据集上预训练，需将英文语言模型替换为支持中文的预训练语言模型。","https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Fissues\u002F149",[156,161,166,171,176],{"id":157,"version":158,"summary_zh":159,"released_at":160},62309,"v1.0.2","修复 BLIP-2 在 CPU 上的混合精度问题。\n\n已在 PyPI 上发布：https:\u002F\u002Fpypi.org\u002Fproject\u002Fsalesforce-lavis\u002F","2023-03-06T12:19:44",{"id":162,"version":163,"summary_zh":164,"released_at":165},62310,"v1.0.1","已修复 BLIP2 中的已知问题。感谢社区的反馈。","2023-03-03T09:03:21",{"id":167,"version":168,"summary_zh":169,"released_at":170},62311,"v1.0.0","* 添加了 BLIP-2 模型；\n* 在 PyPI 上发布；\n* 修复了已知问题。","2023-01-30T17:06:54",{"id":172,"version":173,"summary_zh":174,"released_at":175},62312,"v0.1.1","- 将 PnP-VQA 集成到 LAVIS 中。\n- 修复已知问题。","2022-10-25T06:28:39",{"id":177,"version":178,"summary_zh":179,"released_at":180},62313,"v0.1.0","LAVIS库的第一个版本。","2022-09-19T17:33:35"]