[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-YingqingHe--Awesome-LLMs-meet-Multimodal-Generation":3,"tool-YingqingHe--Awesome-LLMs-meet-Multimodal-Generation":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",158594,2,"2026-04-16T23:34:05",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":78,"owner_email":78,"owner_twitter":78,"owner_website":78,"owner_url":79,"languages":80,"stars":85,"forks":86,"last_commit_at":87,"license":78,"difficulty_score":88,"env_os":89,"env_gpu":90,"env_ram":90,"env_deps":91,"category_tags":94,"github_topics":98,"view_count":32,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":116,"updated_at":117,"faqs":118,"releases":149},8347,"YingqingHe\u002FAwesome-LLMs-meet-Multimodal-Generation","Awesome-LLMs-meet-Multimodal-Generation","🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).","Awesome-LLMs-meet-Multimodal-Generation 是一个专注于“大语言模型（LLM）与多模态生成”领域的精选论文资源库。它系统性地整理了利用大语言模型进行图像、视频、3D 内容以及音频（含语音、音乐）生成与编辑的前沿学术成果。\n\n在 AI 技术快速迭代的当下，研究人员往往难以从海量文献中精准定位到结合 LLM 推理能力与多模态生成的关键论文。Awesome-LLMs-meet-Multimodal-Generation 正是为了解决这一痛点而生，它将分散的研究成果按模态（如图像、视频）、任务类型（生成、编辑、智能体）以及技术路线（基于 LLM 或非 LLM）进行了细致分类，并提供了关于安全性、数据集及基准测试的全面综述。\n\n该资源库特别适合人工智能领域的研究人员、算法工程师及技术开发者使用。无论是希望追踪最新学术动态的学者，还是正在寻找技术灵感以构建多模态应用的专业人士，都能从中获益。其独特亮点在于不仅收录了纯 LLM 驱动的方法，还对比了传统方案，并支持通过作者名、标签（如“交互式”、“人体运动生成”）等多种方式快速检索论文，极大地提升了文献调研的效率。","Awesome-LLMs-meet-Multimodal-Generation 是一个专注于“大语言模型（LLM）与多模态生成”领域的精选论文资源库。它系统性地整理了利用大语言模型进行图像、视频、3D 内容以及音频（含语音、音乐）生成与编辑的前沿学术成果。\n\n在 AI 技术快速迭代的当下，研究人员往往难以从海量文献中精准定位到结合 LLM 推理能力与多模态生成的关键论文。Awesome-LLMs-meet-Multimodal-Generation 正是为了解决这一痛点而生，它将分散的研究成果按模态（如图像、视频）、任务类型（生成、编辑、智能体）以及技术路线（基于 LLM 或非 LLM）进行了细致分类，并提供了关于安全性、数据集及基准测试的全面综述。\n\n该资源库特别适合人工智能领域的研究人员、算法工程师及技术开发者使用。无论是希望追踪最新学术动态的学者，还是正在寻找技术灵感以构建多模态应用的专业人士，都能从中获益。其独特亮点在于不仅收录了纯 LLM 驱动的方法，还对比了传统方案，并支持通过作者名、标签（如“交互式”、“人体运动生成”）等多种方式快速检索论文，极大地提升了文献调研的效率。作为一个开放社区项目，它也欢迎全球开发者共同贡献，是探索大模型如何赋能视觉与听觉内容创作的宝贵指南。","\n\u003Cdiv align=\"center\">\n\u003Ch2> LLMs Meet Multimodal Generation and Editing: A Survey \u003C\u002Fh2> \n\u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.19334'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArXiv-2405.19334-red'>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n# 🤗 Introduction\n- This repository contains a curated list of **LLMs meet multimodal generation**. Modalities consist of visual (including image, video and 3D) and audio (including sound, speech and music). \n  \u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FYingqingHe_Awesome-LLMs-meet-Multimodal-Generation_readme_73831b1439a4.jpg\" width=300\"\">\n\u003C\u002Fp>\n\n- We welcome any contributions and suggestions to our repository or the addition of your own work. Feel free to make a pull request or leave your comments!!\n\n\n# 📋 Contents\n- [🤗 Introduction](#-introduction)\n- [📋 Contents](#-contents)\n- [💘 Tips](#-tips)\n- [📍 Multimodal Generation](#-multimodal-generation)\n  - [Image Generation](#image-generation)\n    - [🔅 LLM-based](#-llm-based)\n    - [Non-LLM-based (Clip\u002FT5)](#non-llm-based-clipt5)\n    - [Datasets](#datasets)\n  - [Video Generation](#video-generation)\n    - [🔅 LLM-based](#-llm-based-1)\n    - [Non-LLM-based](#non-llm-based)\n    - [Video VAE\u002FTokenizers](#video-vaetokenizers)\n    - [Audio-Video](#audio-video)\n    - [Benchmarks](#benchmarks)\n    - [Datasets](#datasets-1)\n  - [3D Generation](#3d-generation)\n    - [🔅 LLM-based](#-llm-based-2)\n    - [Non-LLM-based (Clip\u002FT5)](#non-llm-based-clipt5-1)\n    - [Datasets](#datasets-2)\n  - [Audio Generation](#audio-generation)\n    - [🔅 LLM-based](#-llm-based-3)\n    - [Non-LLM-based](#non-llm-based-1)\n    - [Datasets](#datasets-3)\n  - [Generation with Multiple Modalities](#generation-with-multiple-modalities)\n    - [🔅 LLM-based](#-llm-based-4)\n    - [Non-LLM-based](#non-llm-based-2)\n- [📍 Multimodal Editing](#-multimodal-editing)\n  - [Image Editing](#image-editing)\n    - [🔅 LLM-based](#-llm-based-5)\n    - [Non-LLM-based (Clip\u002FT5)](#non-llm-based-clipt5-2)\n  - [Video Editing](#video-editing)\n    - [🔅 LLM-based](#-llm-based-6)\n    - [Non-LLM-based (Clip\u002FT5)](#non-llm-based-clipt5-3)\n  - [3D Editing](#3d-editing)\n    - [🔅 LLM-based](#-llm-based-7)\n    - [Non-LLM-based (Clip\u002FT5)](#non-llm-based-clipt5-4)\n  - [Audio Editing](#audio-editing)\n    - [🔅 LLM-based](#-llm-based-8)\n    - [Non-LLM-based (Clip\u002FT5)](#non-llm-based-clipt5-5)\n- [📍 Multimodal Agents](#-multimodal-agents)\n- [📍 Multimodal Understanding with LLMs](#-multimodal-understanding-with-llms)\n  - [Multiple modalities](#multiple-modalities)\n  - [Image Understanding](#image-understanding)\n  - [Video Understanding](#video-understanding)\n  - [3D Understanding](#3d-understanding)\n  - [Audio Understanding](#audio-understanding)\n- [📍 Multimodal LLM Safety](#-multimodal-llm-safety)\n  - [Attack](#attack)\n  - [Defense and Detect](#defense-and-detect)\n  - [Alignment](#alignment)\n  - [Datasets](#datasets-4)\n  - [3D, Video and Audio Safety](#3d-video-and-audio-safety)\n- [📍 Related Surveys](#-related-surveys)\n  - [LLM](#llm)\n  - [Vision](#vision)\n- [👨‍💻 Team](#-team)\n- [😉 Citation](#-citation)\n- [⭐️ Star History](#️-star-history)\n\n\n\n# 💘 Tips\n- **✅ Paper searching via catatogue**: directly clicking the content of the catatogue to select the area of your research and browse related papers.\n- **✅ Paper searching via author name**: Free feel to search papers of a specific author via `ctrl + F` and then type the author name. The dropdown list of authors will automatically expand when searching.\n- **✅ Paper searching via tag**: You can also search the related papers via the following tags: `customization`, `iteractive`, `human motion generation` `tokenizer`. (More tags are ongoing)\n\n\n# 📍 Multimodal Generation\n\n## Image Generation\n\n### 🔅 LLM-based\n\n\n+ **I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models** (12 Feb 2025)\u003Cdetails>\u003Csummary>Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, et al.\u003C\u002Fsummary>Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10458)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiZhenxing\u002FThinkDiff.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMiZhenxing\u002FThinkDiff?tab=readme-ov-file)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmizhenxing.github.io\u002FThinkDiff\u002F)\n\n+ **MetaMorph: Multimodal Understanding and Generation via Instruction Tuning** (18 Dec 2024)\u003Cdetails>\u003Csummary>Shengbang Tong, David Fan, Jiachen Zhu, et al.\u003C\u002Fsummary>Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, Zhuang Liu\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14164v1)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Ftsb0601.github.io\u002Fmetamorph\u002F)\n\n+ **X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models** (2 Dec 2024)\u003Cdetails>\u003Csummary>Zeyi Sun, Ziyang Chu, Pan Zhang, et al.\u003C\u002Fsummary>Zeyi Sun, Ziyang Chu, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuanjun Xiong, Dahua Lin, Jiaqi Wang\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01824)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSunzeY\u002FX-Prompt.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSunzeY\u002FX-Prompt)\n\n\n+ **Cosmos Tokenizer: A suite of image and video neural tokenizers** (06 Nov 2024)\u003Cdetails>\u003Csummary>Fitsum Reda, Jinwei Gu, Xian Liu et al.\u003C\u002Fsummary>Fitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FCosmos-Tokenizer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FCosmos-Tokenizer?tab=readme-ov-file)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fdir\u002Fcosmos-tokenizer\u002F)`tokenizer`\n\n\n\n+ **[ICLR 2025 Spotlight] Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance** (29 Oct 2024)\u003Cdetails>\u003Csummary>Dongmin Park, Sebin Kim, Taehong Moon et al.\u003C\u002Fsummary>Dongmin Park, Sebin Kim, Taehong Moon, Minkyu Kim, Kangwook Lee, Jaewoong Cho\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22376)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLargeWorldModel\u002FElasticTok.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fkrafton-ai\u002FRare-to-Frequent)\n\n\n+ **ElasticTok: Adaptive Tokenization for Image and Video** (10 Oct 2024)\u003Cdetails>\u003Csummary>Wilson Yan, Matei Zaharia, Volodymyr Mnih et al.\u003C\u002Fsummary>Wilson Yan, Matei Zaharia, Volodymyr Mnih, Pieter Abbeel, Aleksandra Faust, Hao Liu\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08368)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLargeWorldModel\u002FElasticTok.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLargeWorldModel\u002FElasticTok)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flargeworldmodel.github.io\u002Felastictok\u002F)`tokenizer`\n\n+ **DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation** (10 Oct 2024)\u003Cdetails>\u003Csummary>Jiatao Gu, Yuyang Wang, Yizhe Zhang et al.\u003C\u002Fsummary>Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang, Navdeep Jaitly, Josh Susskind, Shuangfei Zhai\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08159)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=61bfd967a6ba21e50276e52f353fa74dd68990a6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FDART%3A-Denoising-Autoregressive-Transformer-for-Gu-Wang\u002F61bfd967a6ba21e50276e52f353fa74dd68990a6)\n\n\n\n+ **VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation** (6 Sep 2024)\u003Cdetails>\u003Csummary>Yecheng Wu, Zhuoyang Zhang, Junyu Chen et al.\u003C\u002Fsummary>Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.04429)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fvila-u.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fvila-u)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fhanlab.mit.edu\u002Fprojects\u002Fvila-u)\n\n\n+ **OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation** (13 Jun 2024)\u003Cdetails>\u003Csummary>Junke Wang, Yi Jiang, Zehuan Yuan et al.\u003C\u002Fsummary>Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, Yu-Gang Jiang\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09399)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=8613c1081a6ab34e2f980e35c06a1af461d7314e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FOmniTokenizer%3A-A-Joint-Image-Video-Tokenizer-for-Wang-Jiang\u002F8613c1081a6ab34e2f980e35c06a1af461d7314e)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFoundationVision\u002FLlamaGen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFoundationVision\u002FOmniTokenizer)\n`tokenizer`\n\n+ **InstantUnify: Integrates Multimodal LLM into Diffusion Models** (Aug 2024)\u003Cdetails>\u003Csummary>Qixun Wang, Xu Bai, Rui Wang et al.\u003C\u002Fsummary>Qixun Wang, Xu Bai, Rui Wang, Haofan Wang\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinstantX-research\u002FInstantUnify.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FinstantX-research\u002FInstantUnify)\n\n\n+ **Show-o: One Single Transformer to Unify Multimodal Understanding and Generation** (22 Aug 2024)\u003Cdetails>\u003Csummary>Jinheng Xie, Weijia Mao, Zechen Bai, et al.\u003C\u002Fsummary>Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.12528)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-49-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FShow-o%3A-One-Single-Transformer-to-Unify-Multimodal-Xie-Mao\u002F9dc337bd18bc0201839942931b12aff8ec1c93f5)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FShow-o.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FShow-o)\n\n+ **Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions** (11 Jun 2024)\u003Cdetails>\u003Csummary>Renjie Pi, Jianshu Zhang, Jipeng Zhang et al.\u003C\u002Fsummary> Renjie Pi, Jianshu Zhang, Jipeng Zhang, Rui Pan, Zhekai Chen, Tong Zhang\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07502)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F91b4f447bb06d081a7947b42df57491a04fa46f9)\n\n\n\n+ **T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text** (11 Jun 2024)\u003Cdetails>\u003Csummary>[ACL 2024] Aoxiong Yin, Haoyuan Li, Kai Shen et al.\u003C\u002Fsummary> Aoxiong Yin, Haoyuan Li, Kai Shen, Siliang Tang, Yueting Zhuang\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07119)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F186910d697bf7eb605aa055aee78fd91ce3ce9fe)\n\n\n+ **Open-World Human-Object Interaction Detection via Multi-modal Prompts** (11 Jun 2024)\u003Cdetails>\u003Csummary>Jie Yang, Bingliang Li, Ailing Zeng et al.\u003C\u002Fsummary>Jie Yang, Bingliang Li, Ailing Zeng, Lei Zhang, Ruimao Zhang\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07119)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F186910d697bf7eb605aa055aee78fd91ce3ce9fe)\n\n\n+ **Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?** (11 Jun 2024)\u003Cdetails>\u003Csummary>Xingyu Fu, Muyu He, Yujie Lu et al.\u003C\u002Fsummary>Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, Dan Roth\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07221v1)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff0acf2a2293d963c3786e83bb198c75612adc446)\n\n\n\n+ **An Image is Worth 32 Tokens for Reconstruction and Generation** (11 Jun 2024)\u003Cdetails>\u003Csummary>Qihang Yu, Mark Weber, Xueqing Deng et al.\u003C\u002Fsummary> Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07550)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1e31f4a4ccfc0d1e461be05361d77b5e045f4d37)\n\n\n+ **TRINS: Towards Multimodal Language Models that Can Read** (10 Jun 2024)\u003Cdetails>\u003Csummary>[CVPR 2024] Ruiyi Zhang, Yanzhe Zhang, Jian Chen et al.\u003C\u002Fsummary> Ruiyi Zhang, Yanzhe Zhang, Jian Chen, Yufan Zhou, Jiuxiang Gu, Changyou Chen, Tong Sun\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06730)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F784260e953b2ce34df52086804681e35d57e5843)\n\n+ **[LlamaGen] Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation** (10 Jun 2024)\u003Cdetails>\u003Csummary>Peize Sun, Yi Jiang, Shoufa Chen et al.\u003C\u002Fsummary>Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06525)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-42-blue.svg?paper=b15e6e2b1d81bc110f8fc98c3caf2e25e2512539)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FAutoregressive-Model-Beats-Diffusion%3A-Llama-for-Sun-Jiang\u002Fb15e6e2b1d81bc110f8fc98c3caf2e25e2512539)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFoundationVision\u002FLlamaGen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFoundationVision\u002FLlamaGen)\n\n\n\n\u003C!-- + **Genie: Generative Interactive Environments** (26 Feb 2024)\u003Cdetails>\u003Csummary>Jake Bruce, Michael Dennis, Ashley Edwards, et al.\u003C\u002Fsummary> Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Singh, Tim Rocktäschel\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15391v1) -->\n\n+ **Chameleon: Mixed-Modal Early-Fusion Foundation Models** (16 May 2024)\u003Cdetails>\u003Csummary>Chameleon Team\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.09818)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=32112b798f70faab00e14806f51d46058cf5e597)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FChameleon%3A-Mixed-Modal-Early-Fusion-Foundation-Team\u002F32112b798f70faab00e14806f51d46058cf5e597?utm_source=direct_link)\n\n+ **SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation** (22 Apr 2024)\u003Cdetails>\u003Csummary>Yuying Ge, Sijie Zhao, Jinguo Zhu, et al.\u003C\u002Fsummary>Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14396)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-44-blue.svg?paper=9259b476c31ba52b7e9ed059e5fbce2125092738)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FSEED-X%3A-Multimodal-Models-with-Unified-and-Ge-Zhao\u002Fef03c907ee24e2e6c0f2d24639551c82862d1080)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FSEED-X.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FSEED-X)\n\n\n+ **Graphic Design with Large Multimodal Model** (22 Apr 2024)\u003Cdetails>\u003Csummary>Yutao Cheng, Zhao Zhang, Maoke Yang, et al.\u003C\u002Fsummary> Yutao Cheng, Zhao Zhang, Maoke Yang, Hui Nie, Chunyuan Li, Xinglong Wu, and Jie Shao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14368)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=9259b476c31ba52b7e9ed059e5fbce2125092738)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9259b476c31ba52b7e9ed059e5fbce2125092738)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgraphic-design-ai\u002Fgraphist.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgraphic-design-ai\u002Fgraphist)\n\n\n+ **PMG : Personalized Multimodal Generation with Large Language Models** (7 Apr 2024)\u003Cdetails>\u003Csummary>Xiaoteng Shen, Rui Zhang, Xiaoyan Zhao, et al.\u003C\u002Fsummary>Xiaoteng Shen, Rui Zhang, Xiaoyan Zhao, Jieming Zhu, Xi Xiao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.08677)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=cfb9eba1b5c55bb0052df41eaaff8716f9c420bd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FPMG-%3A-Personalized-Multimodal-Generation-with-Large-Shen-Zhang\u002Fcfb9eba1b5c55bb0052df41eaaff8716f9c420bd)\n\n\n+ **MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control** (19 Mar 2024)\u003Cdetails>\u003Csummary>Enshen Zhou, Yiran Qin, Zhenfei Yin, et al.\u003C\u002Fsummary>Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, Jing Shao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12037)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=ae06df762adcc4221162e83a737ea63cff47e65d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fae06df762adcc4221162e83a737ea63cff47e65d)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZhoues\u002FMineDreamer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FZhoues\u002FMineDreamer)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsites.google.com\u002Fview\u002Fminedreamer\u002Fmain)\n\n\n\n+ **ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment** (8 Mar 2024)\u003Cdetails>\u003Csummary>Xiwei Hu, Rui Wang, Yixiao Fang, et al.\u003C\u002Fsummary> Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, Gang Yu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05135)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fda0d382c7fa981ba185ca633868442b75cb76de6)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FELLA-Diffusion\u002FELLA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FELLA-Diffusion\u002FELLA)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fella-diffusion.github.io\u002F)\n\n\n+ **StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis** (30 Jan 2024)\u003Cdetails>\u003Csummary>Zecheng Tang, Chenfei Wu, Zekai Zhang, et al.\u003C\u002Fsummary>Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, Nan Duan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.17093)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=b2f6830afe63eb477294f17f0d3a6923135950f9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FStrokeNUWA%3A-Tokenizing-Strokes-for-Vector-Graphic-Tang-Wu\u002Fb2f6830afe63eb477294f17f0d3a6923135950f9)\n`tokenizer`\n\n+ **DiffusionGPT: LLM-Driven Text-to-Image Generation System** (18 Jan 2024)\u003Cdetails>\u003Csummary>Jie Qin, Jie Wu, Weifeng Chen, et al.\u003C\u002Fsummary> Jie Qin, Jie Wu, Weifeng Chen, Yuxi Ren, Huixia Li, Hefeng Wu, Xuefeng Xiao, Rui Wang, Shilei Wen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10061)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=d4b1a1c62a03ccffcf24983eb4fe22335cbb89b6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd4b1a1c62a03ccffcf24983eb4fe22335cbb89b6)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDiffusionGPT\u002FDiffusionGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDiffusionGPT\u002FDiffusionGPT)\n\n\n+ **StarVector: Generating Scalable Vector Graphics Code from Images** (17 Dec 2023)\u003Cdetails>\u003Csummary>Juan A. Rodriguez, Shubham Agarwal, Issam H. Laradji, et al.\u003C\u002Fsummary> Juan A. Rodriguez, Shubham Agarwal, Issam H. Laradji, Pau Rodriguez, David Vazquez, Christopher Pal, Marco Pedersoli\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.11556)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=60d3ade5c0085f5de1f5ab944cc058c78706ac66)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F60d3ade5c0085f5de1f5ab944cc058c78706ac66)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjoanrod\u002Fstar-vector.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fjoanrod\u002Fstar-vector)\n\n\n+ **VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation** (14 Dec 2023)\u003Cdetails>\u003Csummary>Jinguo Zhu, Xiaohan Ding, Yixiao Ge, et al.\u003C\u002Fsummary> Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.09251)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=ea6982a936a2b263bbf46ff6eb27fc0b63fddaf7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fea6982a936a2b263bbf46ff6eb27fc0b63fddaf7)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FVL-GPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FVL-GPT)\n\n\n+ **StoryGPT-V: Large Language Models as Consistent Story Visualizers** (13 Dec 2023)\u003Cdetails>\u003Csummary>Xiaoqian Shen, Mohamed Elhoseiny\u003C\u002Fsummary> Xiaoqian Shen, Mohamed Elhoseiny\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02252)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=e49cb2ab3a7990e3d05042197ae8b3fd934453de)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe49cb2ab3a7990e3d05042197ae8b3fd934453de)\n\n\n\n+ **GENIXER: Empowering Multimodal Large Language Models as a Powerful Data Generator** (11 Dec 2023)\u003Cdetails>\u003Csummary>Henry Hengyuan Zhao, Pan Zhou, Mike Zheng Shou\u003C\u002Fsummary> Henry Hengyuan Zhao, Pan Zhou, Mike Zheng Shou\u003C\u002Fdetails> \n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06731)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=cb2295766b2f8f35524f6a9f93ae39d948d50bd4)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcb2295766b2f8f35524f6a9f93ae39d948d50bd4)\n\n\n+ **Customization Assistant for Text-to-image Generation** (5 Dec 2023)\u003Cdetails>\u003Csummary>Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, et al.\u003C\u002Fsummary> Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Tong Sun\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03045)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=f30bb09dbd95845d792bdac217a9a652635ee8a5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff30bb09dbd95845d792bdac217a9a652635ee8a5)`customization`\n\n\n\n+ **ChatIllusion: Efficient-Aligning Interleaved Generation ability with Visual Instruction Model** (29 Nov 2023) \u003Cdetails>\u003Csummary>Xiaowei Chi, Yijiang Liu, Zhengkai Jiang, et al.\u003C\u002Fsummary> Xiaowei Chi, Yijiang Liu, Zhengkai Jiang, Rongyu Zhang, Ziyi Lin, Renrui Zhang, Peng Gao, Chaoyou Fu, Shanghang Zhang, Qifeng Liu, Yike Guo\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17963)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=22d55c52f43f59634586ab95fefbb7dba8c8b190)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F22d55c52f43f59634586ab95fefbb7dba8c8b190)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flitwellchi\u002FChatIllusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Flitwellchi\u002FChatIllusion)\n\n\n+ **DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback** (29 Nov 2023) \u003Cdetails>\u003Csummary>Jiao Sun, Deqing Fu, Yushi Hu, et al.\u003C\u002Fsummary>Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van Steenkiste, Ranjay Krishna, Cyrus Rashtchian\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17946)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=d16f72b7be526dee5eb49e5afffeea2bddba5e66)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd16f72b7be526dee5eb49e5afffeea2bddba5e66)\n\n\n\n\n+ **COLE: A Hierarchical Generation Framework for Graphic Design** (28 Nov 2023) \u003Cdetails>\u003Csummary>Peidong Jia, Chenxuan Li, Zeyu Liu, et al.\u003C\u002Fsummary>Peidong Jia, Chenxuan Li, Zeyu Liu, Yichao Shen, Xingru Chen, Yuhui Yuan, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, Shanghang Zhang, Baining Guo\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16974)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=8441c30ad4abdca9ee380aa6f22ffd731b10231b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8441c30ad4abdca9ee380aa6f22ffd731b10231b)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgraphic-design-generation.github.io\u002F)\n\n\n\n+ **TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering** (28 Nov 2023) \u003Cdetails>\u003Csummary>Jingye Chen, Yupan Huang, Tengchao Lv, et al.\u003C\u002Fsummary>Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16465)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=1c6e2a4da1ead685a95c079751bf4d7a727d8180)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1c6e2a4da1ead685a95c079751bf4d7a727d8180)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjingyechen.github.io\u002Ftextdiffuser2\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Ftextdiffuser-2)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FJingyeChen22\u002FTextDiffuser-2)\n\n+ **LLMGA: Multimodal Large Language Model based Generation Assistant** (27 Nov 2023)\u003Cdetails>\u003Csummary>Bin Xia, Shiyin Wang, Yingfan Tao, et al.\u003C\u002Fsummary> Bin Xia, Shiyin Wang, Yingfan Tao, Yitong Wang, Jiaya Jia\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16500)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=769a924d0af014acec326f50c15c5d70d258a969)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F769a924d0af014acec326f50c15c5d70d258a969)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLMGA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLMGA)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllmga.github.io\u002F)\n\n\n\n\n\n+ **Self-correcting LLM-controlled Diffusion Models** (27 Nov 2023)\u003Cdetails>\u003Csummary>Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, et al.\u003C\u002Fsummary> Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16090)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=42c4315b5d2e33d7d9a0afdf84e6a47ccd7a700e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F42c4315b5d2e33d7d9a0afdf84e6a47ccd7a700e)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftsunghan-wu\u002FSLD.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftsunghan-wu\u002FSLD)\n\n\n+ **[ParaDiffusion] Paragraph-to-Image Generation with Information-Enriched Diffusion Model** (29 Nov 2023)\u003Cdetails>\u003Csummary>Weijia Wu, Zhuang Li, Yefei He, et al.\u003C\u002Fsummary>Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.14284)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-19-blue.svg?paper=ed4f1b0f6c09f59d07e817e532a25f4d25e94dbc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fed4f1b0f6c09f59d07e817e532a25f4d25e94dbc)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fweijiawu\u002FParaDiffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fweijiawu\u002FParaDiffusion)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fweijiawu.github.io\u002FParaDiffusionPage\u002F)\n\n\n+ **Tokenize and Embed ALL for Multi-modal Large Language Models** (8 Nov 2023)\u003Cdetails>\u003Csummary>Zhen Yang, Yingxue Zhang, Fandong Meng, et al.\u003C\u002Fsummary> Zhen Yang, Yingxue Zhang, Fandong Meng, Jie Zhou\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.04589)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=59d716b442ab760a78f58de6748c0fa1d507bfc1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F59d716b442ab760a78f58de6748c0fa1d507bfc1)\n`tokenizer`\n\n\n+ **WordArt Designer: User-Driven Artistic Typography Synthesis using Large Language Models** (20 Oct 2023)\u003Cdetails>\u003Csummary>Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, et al.\u003C\u002Fsummary> Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, Jingdong Sun, Wangmeng Xiang, Xianhui Lin, Xiaoyang Kang, Zengke Jin, Yusen Hu, Bin Luo, Yifeng Geng, Xuansong Xie, Jingren Zhou\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.18332)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=58b77dc0603eb52559d98a383bf9649fd31d0bc5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F58b77dc0603eb52559d98a383bf9649fd31d0bc5)\n\n\n+ **LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts** (16 Oct 2023)\u003Cdetails>\u003Csummary>[ICLR 2024] Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, et al.\u003C\u002Fsummary>Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, Salman Khan, Peter Wonka\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.10640)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=4cb2c262ce34f41974f1b1623fc5a6e32956ded3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4cb2c262ce34f41974f1b1623fc5a6e32956ded3)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhananshafi\u002Fllmblueprint.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhananshafi\u002Fllmblueprint)\n\n\n\n+ **Making Multimodal Generation Easier: When Diffusion Models Meet LLMs** (13 Oct 2023)\u003Cdetails>\u003Csummary>Xiangyu Zhao, Bo Liu, Qijiong Liu, et al.\u003C\u002Fsummary>Xiangyu Zhao, Bo Liu, Qijiong Liu, Guangyuan Shi, Xiao-Ming Wu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08949v1)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=833cdd713c27ab5899bb912a1d511c10af61cefb)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F833cdd713c27ab5899bb912a1d511c10af61cefb)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzxy556677\u002FEasyGen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzxy556677\u002FEasyGen)\n\n\n+ **Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation** (12 Oct 2023)\u003Cdetails>\u003Csummary>Zhengyuan Yang, Jianfeng Wang, Linjie Li, et al.\u003C\u002Fsummary>Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08541)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=1d14a708622917da4b9820ada6d32af24fc1651a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1d14a708622917da4b9820ada6d32af24fc1651a)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fidea2img.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzyang-ur\u002FIdea2Img.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzyang-ur\u002FIdea2Img)\n\n\n+ **OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation** (11 Oct 2023)\u003Cdetails>\u003Csummary>Jie An, Zhengyuan Yang, Linjie Li, et al.\u003C\u002Fsummary>Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, Jiebo Luo\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07749)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=7f1ba5630c3baa09b11cc665b3f71cdb117e5ffb)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7f1ba5630c3baa09b11cc665b3f71cdb117e5ffb)\n\n\n\n+ **Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models** (11 Oct 2023)\u003Cdetails>\u003Csummary>Zeqiang Lai, Xizhou Zhu, Jifeng Dai, et al.\u003C\u002Fsummary>Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07653)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=f669d7a6fab0147253178a6fc854e05e3d92fb3f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff669d7a6fab0147253178a6fc854e05e3d92fb3f)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fminidalle3.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZeqiang-Lai\u002FMini-DALLE3.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FZeqiang-Lai\u002FMini-DALLE3)\n\n\n+ **[DALL-E 3] Improving Image Generation with Better Captions** \u003Cdetails>\u003Csummary>James Betker, Gabriel Goh, Li Jing, et al.\u003C\u002Fsummary>James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, Aditya Ramesh\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fdall-e-3.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-146-blue.svg?paper=cfee1826dd4743eab44c6e27a0cc5970effa4d80)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcfee1826dd4743eab44c6e27a0cc5970effa4d80)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fopenai.com\u002Fdall-e-3)\n\n\n+ **MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens** (3 Oct 2023)\\\nKaizhi Zheng, Xuehai He, Xin Eric Wang.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02239)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-26-blue.svg?paper=e7d09b6f2bc878cf2c993acf675f409d0b55f35a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe7d09b6f2bc878cf2c993acf675f409d0b55f35a)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Feric-ai-lab.github.io\u002Fminigpt-5.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Feric-ai-lab\u002FMiniGPT-5.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Feric-ai-lab\u002FMiniGPT-5)\n\n\n+ **Making LLaMA SEE and Draw with SEED Tokenizer** (2 Oct 2023)\u003Cdetails>\u003Csummary>Yuying Ge, Sijie Zhao, Ziyun Zeng, et al.\u003C\u002Fsummary>Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01218)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-19-blue.svg?paper=5ba1525dc6d382ee0a4a1ca3c64fc5907ca64c67)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5ba1525dc6d382ee0a4a1ca3c64fc5907ca64c67)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fseed\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FSEED.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FSEED)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fdad1ed9a9fb76fe83b.gradio.live\u002F)\n`tokenizer`\n\n\n+ **InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists** (30 Sep 2023)\u003Cdetails>\u003Csummary>Yulu Gan, Sungwoo Park, Alexander Schubert, et al.\u003C\u002Fsummary>Yulu Gan, Sungwoo Park, Alexander Schubert, Anthony Philippakis, Ahmed M. Alaa\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00390)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=819f477065088220a6f706cd9ef76dbcb4b4c134)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F819f477065088220a6f706cd9ef76dbcb4b4c134)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlaaLab\u002FInstructCV.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAlaaLab\u002FInstructCV)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Falaa-lab\u002FInstructCV)\n\n+ **InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition** (26 Sep 2023)\u003Cdetails>\u003Csummary>Pan Zhang, Xiaoyi Dong, Bin Wang, et al.\u003C\u002Fsummary> Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15112)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-48-blue.svg?paper=c1e450284e7d6cac1855330a1197df8537df653f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc1e450284e7d6cac1855330a1197df8537df653f)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer)\n\n\n\n\n+ **Text-to-Image Generation for Abstract Concepts** (26 Sep 2023) \u003Cdetails>\u003Csummary>Jiayi Liao, Xu Chen, Qiang Fu, et al.\u003C\u002Fsummary>Jiayi Liao, Xu Chen, Qiang Fu, Lun Du, Xiangnan He, Xiang Wang, Shi Han, Dongmei Zhang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.14623)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=0d38f1edac66b4645cf5fa05abaf9d92cba5d5d3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0d38f1edac66b4645cf5fa05abaf9d92cba5d5d3)\n\n\n\n\n+ **DreamLLM: Synergistic Multimodal Comprehension and Creation** (20 Sep 2023)\u003Cdetails>\u003Csummary>[ICLR 2024] Runpei Dong, Chunrui Han, Yuang Peng, et al.\u003C\u002Fsummary>Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.11499)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-38-blue.svg?paper=7b689adb8c156d6158660f90d1c86888ee281f63)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7b689adb8c156d6158660f90d1c86888ee281f63)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdreamllm.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRunpeiDong\u002FDreamLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FRunpeiDong\u002FDreamLLM)\n\n\n+ **SwitchGPT: Adapting Large Language Models for Non-Text Outputs** (14 Sep 2023)\\\nWang, Xinyu, Bohan Zhuang, and Qi Wu.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.07623)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=366564d210768814bc880e391b909cfbd95f8964)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F366564d210768814bc880e391b909cfbd95f8964)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxinke-wang\u002FSwitchGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fxinke-wang\u002FSwitchGPT)\n\n+ **NExT-GPT: Any-to-Any Multimodal LLM** (11 Sep 2023)\u003Cdetails>\u003Csummary>Shengqiong Wu, Hao Fei, Leigang Qu, et al.\u003C\u002Fsummary>Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05519)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-94-blue.svg?paper=fa75a55760e6ea49b39b83cb85c99a22e1088254)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffa75a55760e6ea49b39b83cb85c99a22e1088254)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fnext-gpt.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNExT-GPT\u002FNExT-GPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNExT-GPT\u002FNExT-GPT)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002F9704af1b453125102e.gradio.live\u002F)\n\n+ **LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation** (9 Aug 2023)\u003Cdetails>\u003Csummary>Leigang Qu, Shengqiong Wu, Hao Fei, et al. ACM MM 2023\u003C\u002Fsummary>Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, Tat-Seng Chua\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.05095)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=7d78238a9bad60433d616abdd93c735087d99670)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7d78238a9bad60433d616abdd93c735087d99670)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flayoutllm-t2i.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLayoutLLM-T2I\u002FLayoutLLM-T2I.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLayoutLLM-T2I\u002FLayoutLLM-T2I)\n\n+ **Planting a SEED of Vision in Large Language Model** (16 Jul 2023)\u003Cdetails>\u003Csummary>Yuying Ge, Yixiao Ge, Ziyun Zeng, et al.\u003C\u002Fsummary>Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.08041)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-30-blue.svg?paper=40298b8d50109c52fc10763eddc64a07cf8acb31)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F40298b8d50109c52fc10763eddc64a07cf8acb31)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fseed\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FSEED.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FSEED)\n\n+ **Generative Pretraining in Multimodality** (11 Jul 2023)\u003Cdetails>\u003Csummary>Quan Sun, Qiying Yu, Yufeng Cui, et al.\u003C\u002Fsummary>Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.05222)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-43-blue.svg?paper=94053805cd59f2e9a47fe3f080c7e7afefb337cc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F94053805cd59f2e9a47fe3f080c7e7afefb337cc)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](http:\u002F\u002F218.91.113.230:9002)\n\n+ **SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs** (30 Jun 2023) \u003Cdetails>\u003Csummary>[NeurIPS 2023 Spotlight] Lijun Yu, Yong Cheng, Zhiruo Wang, et al.\u003C\u002Fsummary>Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.17842)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=376f494126d1ea4f571ea0263c43ac2b6331800a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F376f494126d1ea4f571ea0263c43ac2b6331800a)\n\n+ **Controllable Text-to-Image Generation with GPT-4** (29 May 2023) \u003Cdetails>\u003Csummary>Tianjun Zhang, Yi Zhang, Vibhav Vineet, et al.\u003C\u002Fsummary>Tianjun Zhang, Yi Zhang, Vibhav Vineet, Neel Joshi, Xin Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18583)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-20-blue.svg?paper=3a79545719fb193a6b4042ef7d1d87cfd267be06)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3a79545719fb193a6b4042ef7d1d87cfd267be06)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002Ftianjunz\u002FControl-GPT) \n\n+ **Generating Images with Multimodal Language Models** (26 May 2023)\\\n[NeurIPS 2023] Koh, Jing Yu, Daniel Fried, and Ruslan Salakhutdinov. \\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17216)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-73-blue.svg?paper=6fb5c0eff3696ef252aca9638e10176ecce7cecb)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6fb5c0eff3696ef252aca9638e10176ecce7cecb)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjykoh.com\u002Fgill)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkohjingyu\u002Fgill.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fkohjingyu\u002Fgill)\n\n+ **LayoutGPT: Compositional Visual Planning and Generation with Large Language Models** (24 May 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023] Weixi Feng, Wanrong Zhu, Tsu-jui Fu, et al.\u003C\u002Fsummary>Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, William Yang Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15393)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-31-blue.svg?paper=66d755730f5d08a6f4fcc5e81f24982ba389dca9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F66d755730f5d08a6f4fcc5e81f24982ba389dca9)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flayoutgpt.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fweixi-feng\u002FLayoutGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fweixi-feng\u002FLayoutGPT)\n\n+ **Visual Programming for Text-to-Image Generation and Evaluation** (24 May 2023)\\\n[NeurIPS 2023] Jaemin Cho, Abhay Zala, Mohit Bansal.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15328)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-24-blue.svg?paper=9837349417e36ef5be06da0fd6c74042148bdaa2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9837349417e36ef5be06da0fd6c74042148bdaa2)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvp-t2i.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fj-min\u002FVPGen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fj-min\u002FVPGen)\n\n+ **LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models** (23 May 2023) \u003Cdetails>\u003Csummary>Long Lian, Boyi Li, Adam Yala, et al.\u003C\u002Fsummary>Long Lian, Boyi Li, Adam Yala, Trevor Darrell\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13655)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-39-blue.svg?paper=e9ae0c76a71b8f302eb17b1c4462b9cc97d87cd0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe9ae0c76a71b8f302eb17b1c4462b9cc97d87cd0)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllm-grounded-diffusion.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTonyLianLong\u002FLLM-groundedDiffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTonyLianLong\u002FLLM-groundedDiffusion)\n\n+ **Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration** (22 May 2023)\u003Cdetails>\u003Csummary>Qifan Yu, Juncheng Li, Wentao Ye, et al.\u003C\u002Fsummary>Qifan Yu, Juncheng Li, Wentao Ye, Siliang Tang, Yueting Zhuang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.12799)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=43a55dbd95c9d5cd82de8db276f41adeec4a937d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F43a55dbd95c9d5cd82de8db276f41adeec4a937d)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuqifan1117\u002FLabal-Anything-Pipeline.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYuqifan1117\u002FLabal-Anything-Pipeline)\n\n+ **LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation** (18 May 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023] Yujie Lu, Xianjun Yang, Xiujun Li, et al.\u003C\u002Fsummary>Yujie Lu, Xianjun Yang, Xiujun Li, Xin Eric Wang, William Yang Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11116)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=972501b057e2b84d6ce6506f70bcac697bab7872)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F972501b057e2b84d6ce6506f70bcac697bab7872)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYujieLu10\u002FLLMScore.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYujieLu10\u002FLLMScore)\n\n+ **SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models** (9 May 2023)\u003Cdetails>\u003Csummary>[ACM MM 2023] Shanshan Zhong, Zhongzhan Huang, Wushao Wen, et al.\u003C\u002Fsummary>Shanshan Zhong, Zhongzhan Huang, Wushao Wen, Jinghui Qin, Liang Lin\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.05189)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQrange-group\u002FSUR-adapter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FQrange-group\u002FSUR-adapter)\n\n+ **Grounding Language Models to Images for Multimodal Inputs and Outputs** (31 Jan 2023)\\\n[ICML 2023] Koh, Jing Yu, Ruslan Salakhutdinov, and Daniel Fried.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.13823)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-35-blue.svg?paper=6173520a1eb2814d067e8c5fd16212b7cbf6ee78)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6173520a1eb2814d067e8c5fd16212b7cbf6ee78)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjykoh.com\u002Ffromage)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkohjingyu\u002Ffromage.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fkohjingyu\u002Ffromage)\n\n+ **[RPG-DiffusionMaster] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs** (22 Jan 2024) \u003Cdetails>\u003Csummary>[ICML 2024] Ling Yang, Zhaochen Yu, Chenlin Meng, et al.\u003C\u002Fsummary>Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, Bin Cui\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.11708)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-23-blue.svg?paper=140cfda71bfff852c3e205b7ad61854b78c76982)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F140cfda71bfff852c3e205b7ad61854b78c76982)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYangLing0818\u002FRPG-DiffusionMaster.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYangLing0818\u002FRPG-DiffusionMaster)\n\n+ **RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models** (20 Feb 2024)\u003Cdetails>\u003Csummary>Xinchen Zhang, Ling Yang, Yaqi Cai, et al.\u003C\u002Fsummary>Xinchen Zhang, Ling Yang, Yaqi Cai, Zhaochen Yu, Kai-Ni Wang, Jiake Xie, Ye Tian, Minkai Xu, Yong Tang, Yujiu Yang, Bin Cui\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12908)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=9c2ba04c376f127da506b63c566887fca2861b25)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9c2ba04c376f127da506b63c566887fca2861b25)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcominclip.github.io\u002FRealCompo_Page\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYangLing0818\u002FRealCompo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYangLing0818\u002FRealCompo)\n\n### Non-LLM-based (Clip\u002FT5)\n+ **Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models** (11 Nov 2024)\u003Cdetails>\u003Csummary>NVIDIA: Yuval Atzmon, Maciej Bala, Yogesh Balaji, et al.\u003C\u002Fsummary>NVIDIA: Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani, Jacob Huffman, Ronald Isaac, Pooya Jannaty, Tero Karras, Grace Lam, J. P. Lewis, Aaron Licata, Yen-Chen Lin, Ming-Yu Liu, Qianli Ma, Arun Mallya, Ashlee Martino-Tarr, Doug Mendez, Seungjun Nah, Chris Pruett, Fitsum Reda, Jiaming Song, Ting-Chun Wang, Fangyin Wei, Xiaohui Zeng, Yu Zeng, Qinsheng Zhang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.06959)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fdir\u002Fedify-image\u002F)\n\n\n+ **InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation** (3 Apr 2024)\u003Cdetails>\u003Csummary>Haofan Wang, Matteo Spinelli, Qixun Wang, et al.\u003C\u002Fsummary>Haofan Wang, Matteo Spinelli, Qixun Wang, Xu Bai, Zekui Qin, Anthony Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.02733)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=6b5fc164c4f21e4a4f151df60bfd5e32b061a903)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FInstantStyle%3A-Free-Lunch-towards-Style-Preserving-Wang-Spinelli\u002F6b5fc164c4f21e4a4f151df60bfd5e32b061a903)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Finstantstyle.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinstantX-research\u002FInstantStyle.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FinstantX-research\u002FInstantStyle)\n\n+ **InstantID: Zero-shot Identity-Preserving Generation in Seconds** (15 Jan 2024)\u003Cdetails>\u003Csummary>Qixun Wang, Xu Bai, Haofan Wang, et al.\u003C\u002Fsummary>Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, Yao Hu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.07519)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=0f9b66c9208b11369e9d94d85b7dc23bcc5115e9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FInstantID%3A-Zero-shot-Identity-Preserving-Generation-Wang-Bai\u002F0f9b66c9208b11369e9d94d85b7dc23bcc5115e9)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Finstantid.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinstantX-research\u002FInstantID.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FinstantX-research\u002FInstantID)\n\n+ **PIXART-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis** (30 Sep 2023)\u003Cdetails>\u003Csummary>[ICLR 2024] Junsong Chen, Jincheng Yu, Chongjian Ge, et al.\u003C\u002Fsummary>Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00426)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=7dfe1c9f1d7120102499c7e561efc2326e7a0358)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7dfe1c9f1d7120102499c7e561efc2326e7a0358)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fpixart-alpha.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPixArt-alpha\u002FPixArt-alpha.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPixArt-alpha\u002FPixArt-alpha)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FPixArt-alpha\u002FPixArt-alpha)\n\n+ **TextDiffuser: Diffusion Models as Text Painters** (18 May 2023) \u003Cdetails>\u003Csummary>[NeurIPS 2023] Jingye Chen, Yupan Huang, Tengchao Lv, et al.\u003C\u002Fsummary>Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10855)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=e779781f1bea273573fc9d3f1a5e874bcff2cd2b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe779781f1bea273573fc9d3f1a5e874bcff2cd2b)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjingyechen.github.io\u002Ftextdiffuser\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Ftextdiffuser)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FJingyeChen22\u002FTextDiffuser)\n\n+ **TiGAN: Text-Based Interactive Image Generation and Manipulation** (Dec 2022)\u003Cdetails>\u003Csummary>[AAAI 2022] Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, et al.\u003C\u002Fsummary>Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Chris Tensmeyer, Tong Yu,Changyou Chen, Jinhui Xu, Tong Sun\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAAAI\u002Farticle\u002Fview\u002F20270)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=839dc73c1adae268144d9cfb9d70985b2001304f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F839dc73c1adae268144d9cfb9d70985b2001304f)\nTags: `iteractive`\n\n+ **Multi-Concept Customization of Text-to-Image Diffusion** (8 Dec 2022)\u003Cdetails>\u003Csummary>[CVPR 2023] Nupur Kumari, Bingliang Zhang, Richard Zhang, et al.\u003C\u002Fsummary>Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, Jun-Yan Zhu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.04488)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-307-blue.svg?paper=144eca44e250cc462f6fc3a172abb865978f66f5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F144eca44e250cc462f6fc3a172abb865978f66f5)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.cs.cmu.edu\u002F~custom-diffusion\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fadobe-research\u002Fcustom-diffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fadobe-research\u002Fcustom-diffusion)\\\nTags: `customization`\n\n+ **DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation** (25 Aug 2022)\u003Cdetails>\u003Csummary>[CVPR 2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, et al.\u003C\u002Fsummary>Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, Kfir Aberman\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.12242)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1090-blue.svg?paper=5b19bf6c3f4b25cac96362c98b930cf4b37f6744)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5b19bf6c3f4b25cac96362c98b930cf4b37f6744)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdreambooth.github.io\u002F)\\\nTags: `customization`\n\n+ **An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion** (2 Aug 2022)\u003Cdetails>\u003Csummary>Rinon Gal, Yuval Alaluf, Yuval Atzmon, et al. \u003C\u002Fsummary>Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.01618)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1090-blue.svg?paper=5b19bf6c3f4b25cac96362c98b930cf4b37f6744)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5b19bf6c3f4b25cac96362c98b930cf4b37f6744)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdreambooth.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frinongal\u002Ftextual_inversion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Frinongal\u002Ftextual_inversion)\\\nTags: `customization`\n\n+ **Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding** (23 May 2022)\\\n[NeurIPS 2022] \u003Cdetails>\u003Csummary>Saharia, Chitwan Chan, William Saxena, Saurabh Li, Lala Whang, Jay Denton, Emily L Ghasemipour, Kamyar Gontijo Lopes, Raphael Karagol Ayan, Burcu Salimans, Tim others\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.11487)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2844-blue.svg?paper=9695824d7a01fad57ba9c01d7d76a519d78d65e7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9695824d7a01fad57ba9c01d7d76a519d78d65e7)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fimagen.research.google\u002F) \n\n+ **High-Resolution Image Synthesis with Latent Diffusion Models** (20 Dec 2021)\\\n[CVPR 2022 (Oral)] \u003Cdetails>\u003Csummary>Rombach, Robin Blattmann, Andreas Lorenz, et al. \u003C\u002Fsummary>Rombach, Robin Blattmann, Andreas Lorenz, Dominik Esser, Patrick Ommer, Bj{\\\"o}rn\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.10752)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5641-blue.svg?paper=c10075b3746a9f3dd5811970e93c8ca3ad39b39d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc10075b3746a9f3dd5811970e93c8ca3ad39b39d)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fommer-lab.com\u002Fresearch\u002Flatent-diffusion-models\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCompVis\u002Fstable-diffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FCompVis\u002Fstable-diffusion)\n\n### Datasets\n\n\n+ **MIMIC-IT: Multi-Modal In-Context Instruction Tuning** (8 Jun 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023] Bo Li, Yuanhan Zhang, Liangyu Chen, et al.\u003C\u002Fsummary>Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, Ziwei Liu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05425)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-78-blue.svg?paper=d47524cd5c3c4b57af2e5a29f6f91c420310f236)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd47524cd5c3c4b57af2e5a29f6f91c420310f236)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLuodian\u002Fotter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLuodian\u002Fotter)\n\n\n\n+ **[LAION-Glyph] GlyphControl: Glyph Conditional Control for Visual Text Generation** (29 May 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023] Yukang Yang, Dongnan Gui, Yuhui Yuan, et al.\u003C\u002Fsummary>Yukang Yang, Dongnan Gui, Yuhui Yuan, Weicong Liang, Haisong Ding, Han Hu, Kai Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18259)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=5fbe4c92791fbecb179c1ab79bba9a59b2e155ba)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5fbe4c92791fbecb179c1ab79bba9a59b2e155ba)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIGText\u002FGlyphControl-release.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAIGText\u002FGlyphControl-release)\n\n\n\n\n\n+ **[MARIO-10M] TextDiffuser: Diffusion Models as Text Painters** (18 May 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023] Jingye Chen, Yupan Huang, Tengchao Lv, et al.\u003C\u002Fsummary>Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.14108)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=e779781f1bea273573fc9d3f1a5e874bcff2cd2b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe779781f1bea273573fc9d3f1a5e874bcff2cd2b)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjingyechen.github.io\u002Ftextdiffuser\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm)\n\n\n\n+ **DataComp: In search of the next generation of multimodal datasets** (27 Apr 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023] Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, et al.\u003C\u002Fsummary>Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, Ludwig Schmidt\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.14108)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-103-blue.svg?paper=f9570989919338079088270a9cf1a7afc8db8093)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff9570989919338079088270a9cf1a7afc8db8093)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.datacomp.ai\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlfoundations\u002Fdatacomp.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fdatacomp)\n\n+ **[LLava-instruct] Visual Instruction Tuning** (17 Apr 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023] Haotian Liu, Chunyuan Li, Qingyang Wu, et al.\u003C\u002Fsummary>Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08485)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-820-blue.svg?paper=a5036f31f0e629dc661f120b8c3b1f374d479ab8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fa5036f31f0e629dc661f120b8c3b1f374d479ab8)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllava-vl.github.io\u002F\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaotian-liu\u002FLLaVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA)\n\n\n\n+ **Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text** (14 Apr 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023] Wanrong Zhu, Jack Hessel, Anas Awadalla, et al.\u003C\u002Fsummary>Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, Yejin Choi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.06939)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-64-blue.svg?paper=df958800014d310b6df34ad83d771314d68fbb2d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdf958800014d310b6df34ad83d771314d68fbb2d)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Fmmc4.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fmmc4)\n\n\n\n\n\n+ **Language Is Not All You Need: Aligning Perception with Language Models** (27 Feb 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023] Shaohan Huang, Li Dong, Wenhui Wang, et al.\u003C\u002Fsummary>Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, Furu Wei\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.14045)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-261-blue.svg?paper=fbfef4723d8c8467d7bd523e1d0b703cce0e0f9c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffbfef4723d8c8467d7bd523e1d0b703cce0e0f9c)\n\n\n\n+ **COYO-700M: Image-Text Pair Dataset** (31 Aug 2022)\\\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkakaobrain\u002Fcoyo-dataset.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fcoyo-dataset)\n\n\n+ **LAION-5B: An open large-scale dataset for training next generation image-text models** (16 Oct 2022)\u003Cdetails>\u003Csummary>[NeurIPS 2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, et al. \u003C\u002Fsummary>Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, Jenia Jitsev\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.08402)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1288-blue.svg?paper=e5c8960eb2ec034ffbd353ef39fd1cb541d3c7c9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe5c8960eb2ec034ffbd353ef39fd1cb541d3c7c9)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flaion.ai\u002Fblog\u002Flaion-5b\u002F)\n\n\n\n\n+ **LAION COCO: 600M SYNTHETIC CAPTIONS FROM LAION2B-EN** (15 Sep 2022)\u003Cdetails>\u003Csummary>Christoph Schuhmann, Andreas Köpf , Theo Coombes, et al.\u003C\u002Fsummary>Christoph Schuhmann, Andreas Köpf , Theo Coombes, Richard Vencu, Benjamin Trom , Romain Beaumont\u003C\u002Fdetails>\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flaion.ai\u002Fblog\u002Flaion-coco\u002F)\n\n\n+ **[M3W] Flamingo: a Visual Language Model for Few-Shot Learning** (29 Apr 2022)\u003Cdetails>\u003Csummary>[NeurIPS 2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, et al.\u003C\u002Fsummary>Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.14198)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1483-blue.svg?paper=26218bdcc3945c7edae7aa2adbfba4cd820a2df3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F26218bdcc3945c7edae7aa2adbfba4cd820a2df3)\n\n\n\n\n+ **[LAION-FACE]General Facial Representation Learning in a Visual-Linguistic Manner** (6 Dec 2021)\u003Cdetails>\u003Csummary>[NeurIPS 2021] Yinglin Zheng, Hao Yang, Ting Zhang, et al.\u003C\u002Fsummary>Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, Fang Wen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.03109)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-70-blue.svg?paper=037bab9d26ef7da11ee32d7682836604d2cc8a72)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F037bab9d26ef7da11ee32d7682836604d2cc8a72)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFacePerceiver\u002FFaRL.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFacePerceiver\u002FFaRL)\n\n\n\n\n+ **[LAION-400M] Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs** (3 Nov 2021)\u003Cdetails>\u003Csummary>[NeurIPS 2021] Christoph Schuhmann, Richard Vencu, Romain Beaumont, et al. \u003C\u002Fsummary>Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, Aran Komatsuzaki\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.02114)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-756-blue.svg?paper=b668ce936cff0b0ca8b635cd5f25a62eaf4eb3df)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb668ce936cff0b0ca8b635cd5f25a62eaf4eb3df)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flaion.ai\u002Flaion-400-open-dataset\u002F)\n\n\n\n\n+ **WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning** (2 Mar 2021)\u003Cdetails>\u003Csummary>[SIGIR 2021] Krishna Srinivasan, Karthik Raman, Jiecao Chen, et al.\u003C\u002Fsummary>Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, Marc Najork\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.01913)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-186-blue.svg?paper=98e565fa06f6c7bf7c46833b5106b26dc45130c4)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F98e565fa06f6c7bf7c46833b5106b26dc45130c4)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002Fwit)\n\n+ **Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts** (17 Feb 2021)\u003Cdetails>\u003Csummary>[CVPR 2021] Soravit Changpinyo, Piyush Sharma, Nan Ding, et al.\u003C\u002Fsummary>Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.08981)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-617-blue.svg?paper=394be105b87e9bfe72c20efe6338de10604e1a11)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F394be105b87e9bfe72c20efe6338de10604e1a11)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002Fconceptual-12m)\n\n\n+ **[ALIGN] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision** (11 Feb 2021)\u003Cdetails>\u003Csummary>[ICML 2021] Chao Jia, Yinfei Yang, Ye Xia, et al. \u003C\u002Fsummary>Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.05918)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2120-blue.svg?paper=141a5033d9994242b18bb3b217e79582f1ee9306)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F141a5033d9994242b18bb3b217e79582f1ee9306)\n\n\n\n+ **[MS COCO] Microsoft COCO: Common Objects in Context** (1 May 2014)\u003Cdetails>\u003Csummary>[ECCV 2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. \u003C\u002Fsummary>Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1405.0312)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-33630-blue.svg?paper=71b7178df5d2b112d07e45038cb5637208659ff7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F71b7178df5d2b112d07e45038cb5637208659ff7)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcocodataset-org.translate.goog\u002F?_x_tr_sl=en&_x_tr_tl=zh-CN&_x_tr_hl=zh-CN&_x_tr_pto=sc#home)\n\n\n\n+ **[Im2Text] Describing Images Using 1 Million Captioned Photographs** (12 Dec 2011)\\\n[NeurIPS 2011] Vicente Ordonez, Girish Kulkarni, Tamara Berg\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fpapers.nips.cc\u002Fpaper_files\u002Fpaper\u002F2011\u002Fhash\u002F5dd9db5e033da9c6fb5ba83c7a7ebea9-Abstract.html)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1192-blue.svg?paper=8e080b98efbe65c02a116439205ca2344b9f7cd4)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8e080b98efbe65c02a116439205ca2344b9f7cd4)\n\n\n\n## Video Generation\n\n### 🔅 LLM-based\n\n\n+ **Loong: Generating Minute-level Long Videos with Autoregressive Language Models** (3 Oct 2024)\u003Cdetails>\u003Csummary>Yuqing Wang, Tianwei Xiong, Daquan Zhou, et al.\u003C\u002Fsummary>Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, Xihui Liu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02757)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=1ac7fc5a55ce5843fb8a19d9f62b623e822bb7de)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FLoong%3A-Generating-Minute-level-Long-Videos-with-Wang-Xiong\u002F1ac7fc5a55ce5843fb8a19d9f62b623e822bb7de)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fepiphqny.github.io\u002FLoong-video\u002F)\n\n\n+ **Compositional 3D-aware Video Generation with LLM Director** (31 Aug 2024)\u003Cdetails>\u003Csummary>Hanxin Zhu, Tianyu He, Anni Tang, et al.\u003C\u002Fsummary>Hanxin Zhu, Tianyu He, Anni Tang, Junliang Guo, Zhibo Chen, Jiang Bian\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.00558)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fproject\u002Fcompositional-3d-aware-video-generation\u002F)\n\n\n+ **Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation** (19 Aug 2024)\u003Cdetails>\u003Csummary>[SIGGRAPH Asia 2024] Yunxin Li, Haoyuan Shi, Baotian Hu, et al.\u003C\u002Fsummary>Yunxin Li, Haoyuan Shi, Baotian Hu, Longyue Wang, Jiashun Zhu, Jinyi Xu, Zhen Zhao, Min Zhang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.09787)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHITsz-TMG\u002FAnim-Director.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FHITsz-TMG\u002FAnim-Director?tab=readme-ov-file)\n\n+ **[BSQ-ViT] Image and Video Tokenization with Binary Spherical Quantization** (11 Jun 2024)\\\n[Tech Report]Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07548v1)  `tokenizer`\n\n\n+ **DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation** (11 Mar 2024)\u003Cdetails>\u003Csummary>Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, et al.\u003C\u002Fsummary>Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, Xingang Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06845)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=b34fb645165da381e27077282d69ff224dd2d5f5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FDriveDreamer-2%3A-LLM-Enhanced-World-Models-for-Video-Zhao-Wang\u002Fb34fb645165da381e27077282d69ff224dd2d5f5)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdrivedreamer2.github.io\u002F)\n\n\n+ **[Sora] Video generation models as world simulators** (15 Feb 2024)\u003Cdetails>\u003Csummary>Tim Brooks, Bill Peebles, Connor Holmes, et al.\u003C\u002Fsummary>Tim Brooks and Bill Peebles and Connor Holmes and Will DePue and Yufei Guo and Li Jing and David Schnurr and Joe Taylor and Troy Luhman and Eric Luhman and Clarence Ng and Ricky Wang and Aditya Ramesh\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fopenai.com\u002Fresearch\u002Fvideo-generation-models-as-world-simulators)\n\n\n+ **[LWM] World Model on Million-Length Video And Language With Blockwise RingAttention** (13 Feb 2024)\u003Cdetails>\u003Csummary>Hao Liu, Wilson Yan, Matei Zaharia, et al.\u003C\u002Fsummary>Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.08268)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-38-blue.svg?paper=9259b476c31ba52b7e9ed059e5fbce2125092738)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FWorld-Model-on-Million-Length-Video-And-Language-Liu-Yan\u002Fdb7498f569be9852a04b2bb5bd68bd2885820bea)\n\n\n+ **[LGVI] Towards Language-Driven Video Inpainting via Multimodal Large Language Models** (18 Jan 2024)\u003Cdetails>\u003Csummary>Jianzong Wu, Xiangtai Li, Chenyang Si, et al.\u003C\u002Fsummary>Jianzong Wu, Xiangtai Li, Chenyang Si, Shangchen Zhou, Jingkang Yang, Jiangning Zhang, Yining Li, Kai Chen, Yunhai Tong, Ziwei Liu, Chen Change Loy\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10226)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=02d96eb0da4a282831f14923d1a65976952b7177)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FTowards-Language-Driven-Video-Inpainting-via-Large-Wu-Li\u002F02d96eb0da4a282831f14923d1a65976952b7177)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjianzongwu.github.io\u002Fprojects\u002Frovi\u002F)\n\n+ **Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization: Content-Consistent Multi-Scene Video Generation with LLM** (2 Jan 2024)\u003Cdetails>\u003Csummary>Yang Jin, Zhicheng Sun, Kun Xu, et al.\u003C\u002Fsummary>Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.03161)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=c1b5195bc09a2232ec2b69e5a2a6bd39b3162c62)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc1b5195bc09a2232ec2b69e5a2a6bd39b3162c62)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvideo-lavit.github.io\u002F)\n`tokenizer`\n\n+ **VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM** (2 Jan 2024)\u003Cdetails>\u003Csummary>Fuchen Long, Zhaofan Qiu, Ting Yao, et al.\u003C\u002Fsummary>Fuchen Long, Zhaofan Qiu, Ting Yao, Tao Mei\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.01256)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=fc84fcf269a37ed7ddcb1b0f2d7d1a00f677eaea)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffc84fcf269a37ed7ddcb1b0f2d7d1a00f677eaea)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvideodrafter.github.io\u002F)\n\n+ **[PRO-Motion] Plan, Posture and Go: Towards Open-World Text-to-Motion Generation** (22 Dec 2023)\u003Cdetails>\u003Csummary>Jinpeng Liu, Wenxun Dai, Chunyu Wang, et al.\u003C\u002Fsummary>Jinpeng Liu, Wenxun Dai, Chunyu Wang, Yiji Cheng, Yansong Tang, Xin Tong\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14828)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=4599d5af850da482f591a02a3b17d56e0d358771)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4599d5af850da482f591a02a3b17d56e0d358771)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmoonsliu.github.io\u002FPro-Motion\u002F)\n\n+ **VideoPoet: A Large Language Model for Zero-Shot Video Generation** (21 Dec 2023)\u003Cdetails>\u003Csummary>Dan Kondratyuk, Lijun Yu, Xiuye Gu, et al.\u003C\u002Fsummary>Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, Yong Cheng, Ming-Chang Chiu, Josh Dillon, Irfan Essa, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, David Ross, Grant Schindler, Mikhail Sirotenko, Kihyuk Sohn, Krishna Somandepalli, Huisheng Wang, Jimmy Yan, Ming-Hsuan Yang, Xuan Yang, Bryan Seybold, Lu Jiang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14125)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-18-blue.svg?paper=0c4f46e4dcae5527018e6432fb60cfe8c3354e97)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0c4f46e4dcae5527018e6432fb60cfe8c3354e97)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsites.research.google\u002Fvideopoet\u002F)\n\n+ **FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax** (27 Nov 2023)\u003Cdetails>\u003Csummary>[arXiv 2023] Yu Lu, Linchao Zhu, Hehe Fan, et al.\u003C\u002Fsummary>Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15813)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=9cdb7e415a96795dc6705e66f3b798238b4dec2c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8feb33300c04fffa050e0dca59c3fdcafc920a3b)\n\n+ **InterControl: Generate Human Motion Interactions by Controlling Every Joint** (27 Nov 2023)\u003Cdetails>\u003Csummary>Zhenzhi Wang, Jingbo Wang, Dahua Lin, et al.\u003C\u002Fsummary>Zhenzhi Wang, Jingbo Wang, Dahua Lin, Bo Dai\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15864)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=9cdb7e415a96795dc6705e66f3b798238b4dec2c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9cdb7e415a96795dc6705e66f3b798238b4dec2c)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhenzhiwang\u002Fintercontrol.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzhenzhiwang\u002Fintercontrol)\\\nTags: `human motion generation`\n+ **MotionLLM: Multimodal Motion-Language Learning with Large Language Models** (27 May 2024)\u003Cdetails>\u003Csummary>Qi Wu, Yubo Zhao, Yifan Wang, et al.\u003C\u002Fsummary>Qi Wu, Yubo Zhao, Yifan Wang, Yu-Wing Tai, Chi-Keung Tang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.17013)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=480da1ac2d39b5e036ce786af081366c23f08d1b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F480da1ac2d39b5e036ce786af081366c23f08d1b)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fknoxzhao.github.io\u002FMotionLLM\u002F)\\\nTags: `general human motion generation`\n+ **GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning** (21 Nov 2023)\u003Cdetails>\u003Csummary>Jiaxi Lv, Yi Huang, Mingfu Yan, et al.\u003C\u002Fsummary>Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, Shifeng Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.12631)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=9cdb7e415a96795dc6705e66f3b798238b4dec2c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9cdb7e415a96795dc6705e66f3b798238b4dec2c)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgpt4motion.github.io\u002F)\n\n+ **[LVD] LLM-grounded Video Diffusion Models** (29 Sep 2023)\u003Cdetails>\u003Csummary>Long Lian, Baifeng Shi, Adam Yala, et al.\u003C\u002Fsummary>Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17444)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=87bf66eb6d22df17f70170a0e575b4f12c4813ef)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F87bf66eb6d22df17f70170a0e575b4f12c4813ef)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllm-grounded-video-diffusion.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTonyLianLong\u002FLLM-groundedVideoDiffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTonyLianLong\u002FLLM-groundedVideoDiffusion)\n\n+ **VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning** (26 Sep 2023)\u003Cdetails>\u003Csummary>[arXiv 2023] Han Lin, Abhay Zala, Jaemin Cho, et al.\u003C\u002Fsummary>Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15091)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=16753e0317730e8c1b297338300a8c6163dd06f2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F16753e0317730e8c1b297338300a8c6163dd06f2)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvideodirectorgpt.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHL-hanlin\u002FVideoDirectorGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FHL-hanlin\u002FVideoDirectorGPT)\n\n+ **Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator** (25 Sep 2023)\u003Cdetails>\u003Csummary>[NIPS 2023] Hanzhuo Huang, Yufan Feng, Cheng Shi, et al.\u003C\u002Fsummary>Hanzhuo Huang, Yufan Feng, Cheng Shi, Lan Xu, Jingyi Yu, Sibei Yang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.14494)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-24-blue.svg?paper=120aca3e415b6641a0b0cd20695ab85ed7789612)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F120aca3e415b6641a0b0cd20695ab85ed7789612)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSooLab\u002FFree-Bloom.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSooLab\u002FFree-Bloom)\n\n+ **[Dysen-VDM] Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models** (26 Aug 2023)\u003Cdetails>\u003Csummary>[CVPR 2024] Hao Fei, Shengqiong Wu, Wei Ji, et al.\u003C\u002Fsummary>Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.13812)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=d0a7f7fe31e0e0c42b471b4c47a313bd8c8e5206)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd0a7f7fe31e0e0c42b471b4c47a313bd8c8e5206)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](http:\u002F\u002Fhaofei.vip\u002FDysen-VDM\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fscofield7419\u002FDysen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fscofield7419\u002FDysen)\n\n+ **[DirecT2V] Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation** (23 May 2023)\u003Cdetails>\u003Csummary>[arXiv 2023] Susung Hong, Junyoung Seo, Sunghwan Hong, et al.\u003C\u002Fsummary>Susung Hong, Junyoung Seo, Sunghwan Hong, Heeseong Shin, Seungryong Kim\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14330)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-15-blue.svg?paper=b1750d2a6e3480e690999916a86c8b3876577b39)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb1750d2a6e3480e690999916a86c8b3876577b39)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKU-CVLAB\u002FDirecT2V.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FKU-CVLAB\u002FDirecT2V)\n\n+ **Text2Motion: From Natural Language Instructions to Feasible Plans** (21 Mar 2023)\u003Cdetails>\u003Csummary>[Autonomous Robots 2023] Kevin Lin, Christopher Agia, Toki Migimatsu, et al.\u003C\u002Fsummary>Kevin Lin, Christopher Agia, Toki Migimatsu, Marco Pavone, Jeannette Bohg\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12153)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-110-blue.svg?paper=8f2d4758e6d525509ae36bb30224dc9259027e6b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8f2d4758e6d525509ae36bb30224dc9259027e6b)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsites.google.com\u002Fstanford.edu\u002Ftext2motion)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKU-CVLAB\u002FDirecT2V.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FKU-CVLAB\u002FDirecT2V)\n\n### Non-LLM-based\n\n\n+ **OSV: One Step is Enough for High-Quality Image to Video Generation** (17 Sep 2024)\u003Cdetails>\u003Csummary>Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, et al.\u003C\u002Fsummary>Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2409.11367)\n\n+ **[PAB] Real-Time Video Generation with Pyramid Attention Broadcast** (26 Jun 2024)\u003Cdetails>\u003Csummary>Xuanlei Zhao, Xiaolong Jin, Kai Wang, et al.\u003C\u002Fsummary>Xuanlei Zhao,  Xiaolong Jin,  Kai Wang,  Yang You\u003C\u002Fdetails>\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Foahzxl.github.io\u002FPAB\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNUS-HPC-AI-Lab\u002FOpenDiT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNUS-HPC-AI-Lab\u002FOpenDiT)\n\n\n\n+ **Video-Infinity: Distributed Long Video Generation** (24 Jun 2024)\u003Cdetails>\u003Csummary>Zhenxiong Tan, Xingyi Yang, Songhua Liu, et al.\u003C\u002Fsummary>Zhenxiong Tan, Xingyi Yang, Songhua Liu, Xinchao Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16260)\n\n\n\n\n+ **Pandora: Towards General World Model with Natural Language Actions and Video** (12 Jun 2024)\u003Cdetails>\u003Csummary>Jiannan Xiang, Guangyi Liu, Yi Gu, et al.\u003C\u002Fsummary>Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, Zhengzhong Liu, Eric P. Xing, Zhiting Hu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09455)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fworld-model.maitrix.org\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FyhZhai\u002Fmcm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmaitrix-org\u002FPandora)\n\n\n\n+ **Text-Animator: Controllable Visual Text Video Generation** (25 Jun 2024)\u003Cdetails>\u003Csummary>Lin Liu, Quande Liu, Shengju Qian, et al.\u003C\u002Fsummary>Lin Liu, Quande Liu, Shengju Qian, Yuan Zhou, Wengang Zhou, Houqiang Li, Lingxi Xie, Qi Tian\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09455)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flaulampaul.github.io\u002Ftext-animator.html)\n\n\n+ **MotionBooth: Motion-Aware Customized Text-to-Video Generation** (25 Jun 2024)\u003Cdetails>\u003Csummary>Jianzong Wu, Xiangtai Li, Yanhong Zeng, et al.\u003C\u002Fsummary>Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, Kai Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.17758v1)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjianzongwu.github.io\u002Fprojects\u002Fmotionbooth\u002F)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=7178bbc5e8d2d9b11c890c60486ba2cc2b79b784)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FMotionBooth%3A-Motion-Aware-Customized-Text-to-Video-Wu-Li\u002F7178bbc5e8d2d9b11c890c60486ba2cc2b79b784)\n\n+ **FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models** (24 Jun 2024)\u003Cdetails>\u003Csummary>Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, et al.\u003C\u002Fsummary>Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, Ziwei Liu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16863)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](http:\u002F\u002Fhaonanqiu.com\u002Fprojects\u002FFreeTraj.html)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=1868d2c2f56a92044908a789049fdd44094fc8f2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FFreeTraj%3A-Tuning-Free-Trajectory-Control-in-Video-Qiu-Chen\u002F1868d2c2f56a92044908a789049fdd44094fc8f2)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Farthur-qiu\u002FFreeTraj.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Farthur-qiu\u002FFreeTraj)\n\n\n+ **Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model** (22 Jun 2024)\u003Cdetails>\u003Csummary>Min Zhao, Hongzhou Zhu, Chendong Xiang, et al.\u003C\u002Fsummary>Min Zhao, Hongzhou Zhu, Chendong Xiang, Kaiwen Zheng, Chongxuan Li, Jun Zhu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.15735v1)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcond-image-leak.github.io\u002F)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=ebf4f746d24d79d61c070f8c354b3371f461aafb)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FIdentifying-and-Solving-Conditional-Image-Leakage-Zhao-Zhu\u002Febf4f746d24d79d61c070f8c354b3371f461aafb)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002Fcond-image-leakage.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002Fcond-image-leakage\u002F)\n\n\n+ **Image Conductor: Precision Control for Interactive Video Synthesis** (21 Jun 2024)\u003Cdetails>\u003Csummary>Yaowei Li, Xintao Wang, Zhaoyang Zhang, et al.\u003C\u002Fsummary>Yaowei Li, Xintao Wang, Zhaoyang Zhang, Zhouxia Wang, Ziyang Yuan, Liangbin Xie, Yuexian Zou, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.15339)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcond-image-leak.github.io\u002F)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=b0bd64273dc8075db530fd696ee7eecb179bb908)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FImage-Conductor%3A-Precision-Control-for-Interactive-Li-Wang\u002Fb0bd64273dc8075db530fd696ee7eecb179bb908)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliyaowei-stu\u002FImageConductor.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fliyaowei-stu\u002FImageConductor)\n\n\n+ **VIDEOSCORE: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation** (21 Jun 2024)\u003Cdetails>\u003Csummary>Xuan He, Dongfu Jiang, Ge Zhang, et al.\u003C\u002Fsummary>Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bohan Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, Wenhu Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.15252)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Ftiger-ai-lab.github.io\u002FVideoScore\u002F)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=1680eedc706ef081c0b103457bb52c071ab924b8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVideoScore%3A-Building-Automatic-Metrics-to-Simulate-He-Jiang\u002F1680eedc706ef081c0b103457bb52c071ab924b8)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTIGER-AI-Lab\u002FVideoScore\u002F.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTIGER-AI-Lab\u002FVideoScore\u002F)\n\n\n+ **Dreamitate: Real-World Visuomotor Policy Learning via Video Generation** (24 Jun 2024)\u003Cdetails>\u003Csummary>Junbang Liang, Ruoshi Liu, Ege Ozguroglu, et al.\u003C\u002Fsummary>Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, Carl Vondrick\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16862)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdreamitate.cs.columbia.edu\u002F)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=b0ac4f62f55bcf0427008e18f1b4b5bf7ee43df2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FDreamitate%3A-Real-World-Visuomotor-Policy-Learning-Liang-Liu\u002Fb0ac4f62f55bcf0427008e18f1b4b5bf7ee43df2)\n\n\n\n\n+ **[MCM] Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation** (11 Jun 2024)\u003Cdetails>\u003Csummary>Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, et al.\u003C\u002Fsummary>Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, David Doermann, Junsong Yuan, Lijuan Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06890v1)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fyhzhai.github.io\u002Fmcm\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FyhZhai\u002Fmcm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FyhZhai\u002Fmcm)\n\n\n+ **Searching Priors Makes Text-to-Video Synthesis Better** (5 Jun 2024)\u003Cdetails>\u003Csummary>Haoran Cheng, Liang Peng, Linxuan Xia, et al.\u003C\u002Fsummary>Haoran Cheng, Liang Peng, Linxuan Xia, Yuepeng Hu, Hengjia Li, Qinglin Lu, Xiaofei He, Boxi Wu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.03215)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=0dd0e0bdff37973e102a042f82cd882b890476cc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FSearching-Priors-Makes-Text-to-Video-Synthesis-Cheng-Peng\u002F0dd0e0bdff37973e102a042f82cd882b890476cc)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fhrcheng98.github.io\u002FSearch_T2V\u002F)\n\n+ **ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation** (3 Jun 2024)\u003Cdetails>\u003Csummary>Shaoshu Yang, Yong Zhang, Xiaodong Cun, et al.\u003C\u002Fsummary>Shaoshu Yang, Yong Zhang, Xiaodong Cun, Ying Shan, Ran He\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.00908v1)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=2917706886df4e3bf57acd0b41bd4e396be77506)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FZeroSmooth%3A-Training-free-Diffuser-Adaptation-for-Yang-Zhang\u002F2917706886df4e3bf57acd0b41bd4e396be77506#cited-papers)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fssyang2020.github.io\u002Fzerosmooth.github.io\u002F)\n\n\n+ **EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture** (30 May 2024)\u003Cdetails>\u003Csummary>Sijie Zhao, Yong Zhang, Xiaodong Cun, et al.\u003C\u002Fsummary>Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.20279)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=70569a07d841f86faf8914aea435a1696f911a32)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FCV-VAE%3A-A-Compatible-Video-VAE-for-Latent-Video-Zhao-Zhang\u002F70569a07d841f86faf8914aea435a1696f911a32)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fcvvae\u002Findex.html)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FCV-VAE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FCV-VAE)\n\n+ **[MOFT] Video Diffusion Models are Training-free Motion Interpreter and Controller** (23 Mar 2024)\u003Cdetails>\u003Csummary>Zeqi Xiao, Yifan Zhou, Shuai Yang, et al.\u003C\u002Fsummary>Zeqi Xiao, Yifan Zhou, Shuai Yang, Xingang Pan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14864)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=4f3e62c0fea3dc43f345e775192c972760b9d113)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVideo-Diffusion-Models-are-Training-free-Motion-and-Xiao-Zhou\u002F4f3e62c0fea3dc43f345e775192c972760b9d113)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fxizaoqu.github.io\u002Fmoft\u002F)\n\n+ **StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text** (21 Mar 2024)\u003Cdetails>\u003Csummary>Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, et al.\u003C\u002Fsummary>Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.14773)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=21a77ed349c8621d0a0ef8407eb744e3de3b13c5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FStreamingT2V%3A-Consistent%2C-Dynamic%2C-and-Extendable-Henschel-Khachatryan\u002F21a77ed349c8621d0a0ef8407eb744e3de3b13c5)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPicsart-AI-Research\u002FStreamingT2V.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPicsart-AI-Research\u002FStreamingT2V)\n\n+ **Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis** (22 Feb 2024)\u003Cdetails>\u003Csummary>Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, et al.\u003C\u002Fsummary>Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, Sergey Tulyakov\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.14797)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=97cb2eb0d0517e34bf4202f0593600bb6fa043cd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FSnap-Video%3A-Scaled-Spatiotemporal-Transformers-for-Menapace-Siarohin\u002F97cb2eb0d0517e34bf4202f0593600bb6fa043cd)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsnap-research.github.io\u002Fsnapvideo\u002F)\n\n+ **VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models** (17 Jan 2024)\u003Cdetails>\u003Csummary>Haoxin Chen, Yong Zhang, Xiaodong Cun, et al.\u003C\u002Fsummary>Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09047)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=492bc8339d8aac442c4ec13f8c1d59e980a3af2f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVideoCrafter2%3A-Overcoming-Data-Limitations-for-Chen-Zhang\u002F492bc8339d8aac442c4ec13f8c1d59e980a3af2f)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fvideocrafter2\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FVideoCrafter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FVideoCrafter)\n\n\n+ **Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets** (25 Nov 2023)\u003Cdetails>\u003Csummary>Andreas Blattmann, Tim Dockhorn, Sumith Kulal, et al.\u003C\u002Fsummary>Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15127)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-51-blue.svg?paper=1206b05eae5a06ba662ae79fb291b50e359c4f42)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1206b05eae5a06ba662ae79fb291b50e359c4f42)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fstability.ai\u002Fresearch\u002Fstable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FStability-AI\u002Fgenerative-models.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FStability-AI\u002Fgenerative-models)\n\n+ **VideoCrafter1: Open Diffusion Models for High-Quality Video Generation** (30 Oct 2023)\u003Cdetails>\u003Csummary>Haoxin Chen, Menghan Xia, Yingqing He, et al.\u003C\u002Fsummary>Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.19512)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-38-blue.svg?paper=1891c3756f870d902a0b793a1dcd5cc34c778612)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1891c3756f870d902a0b793a1dcd5cc34c778612)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fvideocrafter\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FVideoCrafter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FVideoCrafter)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVideoCrafter\u002FVideoCrafter)\n\n+ **DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors** (18 Oct 2023)\u003Cdetails>\u003Csummary>Jinbo Xing, Menghan Xia, Yong Zhang, et al.\u003C\u002Fsummary>Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Xintao Wang, Tien-Tsin Wong, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.12190)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-42-blue.svg?paper=083bab4a967c2221d9f4da9110fe37d8ca679078)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FDynamiCrafter%3A-Animating-Open-domain-Images-with-Xing-Xia\u002F083bab4a967c2221d9f4da9110fe37d8ca679078)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdoubiiu.github.io\u002Fprojects\u002FDynamiCrafter\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDoubiiu\u002FDynamiCrafter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDoubiiu\u002FDynamiCrafter)\n\n+ **FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling** (23 Oct 2023)\u003Cdetails>\u003Csummary>Haonan Qiu, Menghan Xia, Yong Zhang, et al.\u003C\u002Fsummary>Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.15169)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=d831988859f0c077b38094446d8585a8340af223)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd831988859f0c077b38094446d8585a8340af223)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](http:\u002F\u002Fhaonanqiu.com\u002Fprojects\u002FFreeNoise.html)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Farthur-qiu\u002FLongerCrafter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Farthur-qiu\u002FLongerCrafter)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FMoonQiu\u002FLongerCrafter)\n\n+ **Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation** (13 Jul 2023)\u003Cdetails>\u003Csummary>Yingqing He, Menghan Xia, Haoxin Chen, et al.\u003C\u002Fsummary>Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06940)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-18-blue.svg?paper=77040969110fab39a55699cb06f9edf68789445a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F77040969110fab39a55699cb06f9edf68789445a)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002FAnimate-A-Story\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FAnimate-A-Story.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FAnimate-A-Story)\n\n+ **Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance** (1 Jun 2023)\u003Cdetails>\u003Csummary>Jinbo Xing, Menghan Xia, Yuxin Liu, et al.\u003C\u002Fsummary>Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan, Tien-Tsin Wong\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00943)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-31-blue.svg?paper=52b10ae66d025e99fbb602935e155f97f4f0696f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F52b10ae66d025e99fbb602935e155f97f4f0696f)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdoubiiu.github.io\u002Fprojects\u002FMake-Your-Video\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FMake-Your-Video.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FMake-Your-Video)\n\n+ **Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos** (3 Apr 2023)\u003Cdetails>\u003Csummary>Yue Ma, Yingqing He, Xiaodong Cun, et al.\u003C\u002Fsummary>Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan, Xiu Li, Qifeng Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.01186)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-53-blue.svg?paper=ee73edebd42626d9c2d91e35fd2ed3cdb0fb26d0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fee73edebd42626d9c2d91e35fd2ed3cdb0fb26d0)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Ffollow-your-pose.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmayuelala\u002FFollowYourPose.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmayuelala\u002FFollowYourPose)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FYueMafighting\u002FFollowYourPose)\n\n+ **Real-time Controllable Denoising for Image and Video** (29 Mar 2023)\u003Cdetails>\u003Csummary>[CVPR 2023] Zhaoyang Zhang, Yitong Jiang, Wenqi Shao, et al.\u003C\u002Fsummary>Zhaoyang Zhang, Yitong Jiang, Wenqi Shao, Xiaogang Wang, Ping Luo, Kaimo Lin, Jinwei Gu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.16425)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-108-blue.svg?paper=3f3746c3c64212e97c877bd3d862b578fa24632c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FReal-Time-Controllable-Denoising-for-Image-and-Zhang-Jiang\u002F3f3746c3c64212e97c877bd3d862b578fa24632c)\n\n+ **VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation** (15 Mar 2023)\u003Cdetails>\u003Csummary>Zhengxiong Luo, Dayou Chen, Yingya Zhang, et al.\u003C\u002Fsummary>Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.08320)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-108-blue.svg?paper=26c6090b7e7ba4513f82aa28d41360c60770c618)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F26c6090b7e7ba4513f82aa28d41360c60770c618)\n\n\n\n\n### Video VAE\u002FTokenizers\n\n\n+ **DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation** (17 Feb 2025)\u003Cdetails>\u003Csummary>Zhihang Yuan, Siyuan Wang, Rui Xie, et al.\u003C\u002Fsummary>Zhihang Yuan, Siyuan Wang, Rui Xie, Hanling Zhang, Tongcheng Fang, Yuzhang Shang, Shengen Yan, Guohao Dai, Yu Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11897)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-nics\u002FDLFR-VAE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthu-nics\u002FDLFR-VAE)\n\n+ **VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE** (23 Dec 2024)\u003Cdetails>\u003Csummary>Yazhou Xing, Yang Fei, Yingqing He, et al.\u003C\u002Fsummary>Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17805)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fyzxing87.github.io\u002Fvae\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVideoVerses\u002FVideoVAEPlus.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVideoVerses\u002FVideoVAEPlus?tab=readme-ov-file)\n\n+ **VidTwin: Video VAE with Decoupled Structure and Dynamics** (23 Dec 2024)\u003Cdetails>\u003Csummary>Yuchi Wang, Junliang Guo, Xinyi Xie, et al.\u003C\u002Fsummary>Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, Jiang Bian\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17726)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FVidTok.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVidTok\u002Ftree\u002Fmain\u002Fvidtwin)\n\n\n+ **VidTok: A Versatile and Open-Source Video Tokenizer** (17 Dec 2024)\u003Cdetails>\u003Csummary>Anni Tang, Tianyu He, Junliang Guo, et al.\u003C\u002Fsummary>Anni Tang, Tianyu He, Junliang Guo, Xinle Cheng, Li Song, Jiang Bian\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.13061)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FVidTok.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVidTok)\n\n\n+ **[CVPR 2025] WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model** (26 Nov 2024)\u003Cdetails>\u003Csummary>Zongjian Li, Bin Lin, Yang Ye, et al.\u003C\u002Fsummary>Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, Li Yuan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.17459)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=58d9eaa0868e971687c20d0588de3058b7780b51)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FWF-VAE%3A-Enhancing-Video-VAE-by-Wavelet-Driven-Flow-Li-Lin\u002F58d9eaa0868e971687c20d0588de3058b7780b51)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fcvvae\u002Findex.html)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FWF-VAE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FWF-VAE)\n\n\n+ **[CVPR 2025] [IV-VAE] Improved Video VAE for Latent Video Diffusion Model** (10 Nov 2024)\u003Cdetails>\u003Csummary>Pingyu Wu, Kai Zhu, Yu Liu, et al.\u003C\u002Fsummary>Pingyu Wu, Kai Zhu, Yu Liu, Liming Zhao, Wei Zhai, Yang Cao, Zheng-Jun Zha\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.06449)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=4e073da5a37753fba320719baaa17ca593e6a094)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FImproved-Video-VAE-for-Latent-Video-Diffusion-Model-Wu-Zhu\u002F4e073da5a37753fba320719baaa17ca593e6a094)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwpy1999.github.io\u002FIV-VAE\u002F)\n\n\n\n+ **[Tech Report] Cosmos Tokenizer: A suite of image and video neural tokenizers** (Nov 6, 2024)\u003Cdetails>\u003Csummary>Fitsum Reda, Jinwei Gu, Xian Liu, et al.\u003C\u002Fsummary>Fitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu\u003C\u002Fdetails>\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fdir\u002Fcosmos-tokenizer\u002F)\n[![TokenBench](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTokenBench-00000)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FTokenBench?tab=readme-ov-file)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FCosmos-Tokenizer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FCosmos-Tokenizer)\n\n\n+ **[NeurIPS 2024] CV-VAE: A Compatible Video VAE for Latent Generative Video Models** (30 May 2024)\u003Cdetails>\u003Csummary>Sijie Zhao, Yong Zhang, Xiaodong Cun, et al.\u003C\u002Fsummary>Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.20279)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=70569a07d841f86faf8914aea435a1696f911a32)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FCV-VAE%3A-A-Compatible-Video-VAE-for-Latent-Video-Zhao-Zhang\u002F70569a07d841f86faf8914aea435a1696f911a32)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fcvvae\u002Findex.html)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FCV-VAE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FCV-VAE)\n\n\n\n+ **[ICLR 2024] [MAGVIT-v2] Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation** (9 Oct 2023)\u003Cdetails>\u003Csummary>Lijun Yu, José Lezama, Nitesh B. Gundavarapu, et al.\u003C\u002Fsummary>Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05737)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-15-blue.svg?paper=985f0c89c5a607742ec43c1fdc2cbfe54541cbad)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F985f0c89c5a607742ec43c1fdc2cbfe54541cbad)\n\n\n\n### Audio-Video\n****\n\n+ **JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization** (30 Mar 2025)\u003Cdetails>\u003Csummary>Kai Liu, Wei Li, Lai Chen, et al.\u003C\u002Fsummary>Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, Tat-Seng Chua\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23377)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjavisdit.github.io\u002F)\n\n\n+ **[LVAS-Agent] Long-Video Audio Synthesis with Multi-Agent Collaboration** (13 Mar 2025)\n  \u003Cdetails>\u003Csummary>Yehang Zhang, Xinli Xu, Xiaojie Xu, et al\u003C\u002Fsummary>Yehang Zhang, Xinli Xu, Xiaojie Xu, Li Liu, Yingcong Chen\u003C\u002Fdetails>\n  \n  [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10719)\n  [![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=2503.10719)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2503.10719)\n  [![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flvas-agent.github.io\u002F)\n  [![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZYH-Lightyear\u002FLVAS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FZYH-Lightyear\u002FLVAS)\n\n\n+ **UniForm: A Unified Diffusion Transformer for Audio-Video Generation** (6 Feb 2025)\n    \n    \u003Cdetails>\u003Csummary>Lei Zhao, Linfeng Feng, Dongxu Ge, et al\u003C\u002Fsummary>Lei Zhao, Linfeng Feng, Dongxu Ge, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, Xuelong Li\u003C\u002Fdetails>\n\n    [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03897)\n    [![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=7a4436209d8cf0d868b13000c8abff63d72daf0f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FUniForm%3A-A-Unified-Diffusion-Transformer-for-Zhao-Feng\u002F7a4436209d8cf0d868b13000c8abff63d72daf0f)\n    [![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Funiform-t2av.github.io)\n\n\n+ **TIA2V: Video generation conditioned on triple modalities of text–image–audio** (4 Jan 2025)\u003Cdetails>\u003Csummary>Minglu Zhao, Wenmin Wang, Rui Zhang, et al.\u003C\u002Fsummary>Minglu Zhao, Wenmin Wang, Rui Zhang, Haomei Jia, Qi Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-b31b1b.svg)](https:\u002F\u002Fdoi.org\u002F10.1016\u002Fj.eswa.2024.126278)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMinglu58\u002FTIA2V.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMinglu58\u002FTIA2V)\n\n\n\n+ **SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation** (18 Dec 2024)\n\n    \u003Cdetails>\u003Csummary>Kazuki Shimada, Christian Simon, Takashi Shibuya, et al.\u003C\u002Fsummary>Kazuki Shimada, Christian Simon, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji\u003C\u002Fdetails>\n\n    [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.13462)\n\n\n+ **AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation** (19 Dec 2024)\n\n    \u003Cdetails>\u003Csummary>Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, et al,\u003C\u002Fsummary>Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov\u003C\u002Fdetails>\n\n    [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15191)\n    [![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsnap-research.github.io\u002FAVLink\u002F)\n\n\n+ **SyncFlow: Temporally Aligned Joint Audio-Video Generation from Text** (3 Dec 2024)\n\n    \u003Cdetails>\u003Csummary>Haohe Liu, Gael Le Lan, Xinhao Mei, et al.\u003C\u002Fsummary>Haohe Liu, Gael Le Lan, Xinhao Mei, Zhaoheng Ni, Anurag Kumar, Varun Nagaraja, Wenwu Wang, Mark D. Plumbley, Yangyang Shi, Vikas Chandra\u003C\u002Fdetails>\n\n    [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15220)\n\n\n\n+ **A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation** (26 Sep 2024)\n\n    \u003Cdetails>\u003Csummary>Masato Ishii, Akio Hayakawa, Takashi Shibuya\u003C\u002Fsummary>Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji\u003C\u002Fdetails>\n\n    [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17550)\n\n\n+ **AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation** (11 Jun 2024)\n\n    \u003Cdetails>\u003Csummary>Kai Wang, Shijian Deng, Jing Shi, et al.\u003C\u002Fsummary>Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, Yapeng Tian\u003C\u002Fdetails>\n\n    [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07686)\n\n\n+ **Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation** (28 May 2024)\u003Cdetails>\u003Csummary>Akio Hayakawa, Masato Ishii, Takashi Shibuya, et al.\u003C\u002Fsummary>Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.17842)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSonyResearch\u002FMMDisCo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSonyResearch\u002FMMDisCo)\n\n\n\n+ **AudioScenic: Audio-Driven Video Scene Editing** (25 Apr 2024)\u003Cdetails>\u003Csummary>Kaixin Shen, Ruijie Quan, Linchao Zhu, et al.\u003C\u002Fsummary>Kaixin Shen, Ruijie Quan, Linchao Zhu, Jun Xiao, Yi Yang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.16581)\n\n\n\n+ **A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation** (22 May 2024)\u003Cdetails>\u003Csummary>Gwanghyun Kim, Alonso Martinez, Yu-Chuan Su, et al.\u003C\u002Fsummary>Gwanghyun Kim, Alonso Martinez, Yu-Chuan Su, Brendan Jou, José Lezama, Agrim Gupta, Lijun Yu, Lu Jiang, Aren Jansen, Jacob Walker, Krishna Somandepalli\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.13762)\n\n\n\n\n+ **Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model** (25 Apr 2024)\u003Cdetails>\u003Csummary>Gehui Chen, Guan'an Wang, Xiaowen Huang, et al.\u003C\u002Fsummary>Gehui Chen, Guan'an Wang, Xiaowen Huang, Jitao Sang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.16305)\n\n\n+ **TAVGBench: Benchmarking Text to Audible-Video Generation** (22 Apr 2024)\u003Cdetails>\u003Csummary>Yuxin Mao, Xuyang Shen, Jing Zhang, et al.\u003C\u002Fsummary>Yuxin Mao, Xuyang Shen, Jing Zhang, Zhen Qin, Jinxing Zhou, Mochu Xiang, Yiran Zhong, Yuchao Dai\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14381)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FTAVGBench)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenNLPLab\u002FTAVGBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FTAVGBench)\n\n\n+ **[ECCV 2024 Oral] ASVA: Audio-Synchronized Visual Animation** (8 Mar 2024)\u003Cdetails>\u003Csummary>Lin Zhang, Shentong Mo, Yijing Zhang, et al.\u003C\u002Fsummary>Lin Zhang, Shentong Mo, Yijing Zhang, Pedro Morgado\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05659)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flzhangbj.github.io\u002Fprojects\u002Fasva\u002Fasva.html)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flzhangbj\u002FASVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Flzhangbj\u002FASVA)\n\n\n+ **[CVPR 2024] Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners** (27 Feb 2024)\u003Cdetails>\u003Csummary>Yazhou Xing, Yingqing He, Zeyue Tian, et al.\u003C\u002Fsummary>Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17723)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyzxing87\u002FSeeing-and-Hearing.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyzxing87\u002FSeeing-and-Hearing)\n\n\n+ **TräumerAI: Dreaming Music with StyleGAN** (9 Feb 2021)\u003Cdetails>\u003Csummary>Dasaem Jeong, Seungheon Doh, Taegyun Kwon (NeurIPS Workshop 2020)\u003C\u002Fsummary>Dasaem Jeong, Seungheon Doh, Taegyun Kwon\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.04680)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-259-blue.svg?paper=8a1384e041cc6ea2735b01c734aeef666dc92884)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8a1384e041cc6ea2735b01c734aeef666dc92884)\n\n\n+ **Sound2Sight: Generating Visual Dynamics from Sound and Context** (23 Jul 2020)\u003Cdetails>\u003Csummary>Anoop Cherian, Moitreya Chatterjee, Narendra Ahuja. (ECCV 2020)\u003C\u002Fsummary>Anoop Cherian, Moitreya Chatterjee, Narendra Ahuja\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.12130)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=02f3ced09497c5db59985b2a5db9d3d0aebe5074)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F02f3ced09497c5db59985b2a5db9d3d0aebe5074)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmerlresearch\u002FSound2Sight.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmerlresearch\u002FSound2Sight)\n\n\n\n\n\n\n\n### Benchmarks\n\n\n+ **VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models** (20 Nov 2024)\u003Cdetails>\u003Csummary>Ziqi Huang, Fan Zhang, Xiaojie Xu, et al.\u003C\u002Fsummary>Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13503)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvchitect.github.io\u002FVBench-project\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVchitect\u002FVBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVchitect\u002FVBench)\n\n\n+ **[VideoGen-Eval] The Dawn of Video Generation: Preliminary Explorations with SORA-like Models** (7 Oct 2024)\u003Cdetails>\u003Csummary>Ailing Zeng, Yuhang Yang, Weidong Chen, et al.\u003C\u002Fsummary>Ailing Zeng, Yuhang Yang, Weidong Chen, Wei Liu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05227)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002FVideoGen-Eval\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FVideoGen-Eval.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FVideoGen-Eval)\n\n\n+ **ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation** (26 Jun 2024)\u003Cdetails>\u003Csummary>Shenghai Yuan, Jinfa Huang, Yongqi Xu, et al.\u003C\u002Fsummary>Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.18522v1)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fpku-yuangroup.github.io\u002FChronoMagic-Bench\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FChronoMagic-Bench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FChronoMagic-Bench)\n\n\n+ **TAVGBench: Benchmarking Text to Audible-Video Generation** (22 Apr 2024)\u003Cdetails>\u003Csummary>Yuxin Mao, Xuyang Shen, Jing Zhang, et al.\u003C\u002Fsummary>Yuxin Mao, Xuyang Shen, Jing Zhang, Zhen Qin, Jinxing Zhou, Mochu Xiang, Yiran Zhong, Yuchao Dai\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14381)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=21a77ed349c8621d0a0ef8407eb744e3de3b13c5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FTAVGBench%3A-Benchmarking-Text-to-Audible-Video-Mao-Shen\u002F4ba90678411ddc0a2eb997e1184b059bdc955fd5)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenNLPLab\u002FTAVGBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FTAVGBench)\n\n\n+ **Sora Generates Videos with Stunning Geometrical Consistency** (27 Feb 2024)\u003Cdetails>\u003Csummary>Xuanyi Li, Daquan Zhou, Chenxu Zhang, et al.\u003C\u002Fsummary>Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, Ming-Ming Cheng\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17403)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsora-geometrical-consistency.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmeteorshowers\u002FSora-Generates-Videos-with-Stunning-Geometrical-Consistency.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmeteorshowers\u002FSora-Generates-Videos-with-Stunning-Geometrical-Consistency)\n\n\n+ **[CVPR 2024 Highlight] VBench: Comprehensive Benchmark Suite for Video Generative Models** (29 Nov 2023)\u003Cdetails>\u003Csummary>Ziqi Huang, Yinan He, Jiashuo Yu, et al.\u003C\u002Fsummary>Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17982)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=4e9a8141da2a8c603722b07d096109207f8e0b66)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4e9a8141da2a8c603722b07d096109207f8e0b66)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvchitect.github.io\u002FVBench-project\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVchitect\u002FVBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVchitect\u002FVBench)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVchitect\u002FVBench_Leaderboard)\n\n\n+ **[CVPR 2024] EvalCrafter: Benchmarking and Evaluating Large Video Generation Models** (23 Mar 2024)\u003Cdetails>\u003Csummary>Yaofang Liu, Xiaodong Cun, Xuebo Liu, et al.\u003C\u002Fsummary>Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17982)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fevalcrafter.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEvalCrafter\u002FEvalCrafter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FEvalCrafter\u002FEvalCrafter)\n\n\n### Datasets\n\n+ **VidGen-1M: A Large-Scale Dataset for Text-to-video Generation** (5 Aug 2024)\u003Cdetails>\u003Csummary>Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, et al.\u003C\u002Fsummary>Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Hao Li\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.02629)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=8af933a6e0d45e041a1ca35d461aad92022aa957)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVidGen-1M%3A-A-Large-Scale-Dataset-for-Text-to-video-Tan-Yang\u002F8af933a6e0d45e041a1ca35d461aad92022aa957)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmutonix\u002FVript.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSAIS-FUXI\u002FVidGen)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsais-fuxi.github.io\u002Fprojects\u002Fvidgen-1m\u002F)\n\n\n+ **Vript: A Video Is Worth Thousands of Words** (10 Jun 2024)\u003Cdetails>\u003Csummary>[NIPS 2024 Dataset & Benchmark track] Dongjie Yang, Suyuan Huang, Chengqiang Lu, et al.\u003C\u002Fsummary>Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, Hai Zhao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06040)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=2fb2a76be0f261fb2660457a6cec8a8384b33a19)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVript%3A-A-Video-Is-Worth-Thousands-of-Words-Yang-Huang\u002F2fb2a76be0f261fb2660457a6cec8a8384b33a19)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmutonix\u002FVript.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmutonix\u002FVript)\n\n\n+ **MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions** (30 Jul 2024)\u003Cdetails>\u003Csummary>Xiaowei Chi, Yatian Wang, Aosong Cheng, et al.\u003C\u002Fsummary>Xiaowei Chi, Yatian Wang, Aosong Cheng, Pengjun Fang, Zeyue Tian, Yingqing He, Zhaoyang Liu, Xingqun Qi, Jiahao Pan, Rongyu Zhang, Mengfei Li, Ruibin Yuan, Yanbing Jiang, Wei Xue, Wenhan Luo, Qifeng Chen, Shanghang Zhang, Qifeng Liu, Yike Guo\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20962)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=0d8ea8cf8fcadc0eb52304258e254e01c62dfe52)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FMMTrail%3A-A-Multimodal-Trailer-Video-Dataset-with-Chi-Wang\u002F0d8ea8cf8fcadc0eb52304258e254e01c62dfe52)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flitwellchi\u002FMMTrail.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Flitwellchi\u002FMMTrail)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmattie-e.github.io\u002FMMTrail\u002F)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flitwell\u002FMMTrail-20M)\n\n\n+ **InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation** (13 Jul 2023)\u003Cdetails>\u003Csummary>[ICLR 2024 Spotlight] Yi Wang, Yinan He, Yizhuo Li, et al.\u003C\u002Fsummary>Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, Yu Qiao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06942)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-36-blue.svg?paper=369b449415d50387fba048bbd4d26ee890df84b5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F369b449415d50387fba048bbd4d26ee890df84b5)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVideo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVideo)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenGVLab\u002FInternVid)\n\n+ **[HD-VG-130M] VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation** (18 May 2023)\u003Cdetails>\u003Csummary>Wenjing Wang, Huan Yang, Zixi Tuo, et al.\u003C\u002Fsummary>Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, Jiaying Liu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10874)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-43-blue.svg?paper=50bbf2c11984d18aa14f964a4909ac25f07e50ea)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F50bbf2c11984d18aa14f964a4909ac25f07e50ea)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVideo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdaooshee\u002FHD-VG-130M?tab=readme-ov-file)\n\n+ **[VideoCC3M] Learning Audio-Video Modalities from Image Captions** (18 May 2023)\u003Cdetails>\u003Csummary>[ECCV 2022] Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, et al.\u003C\u002Fsummary>Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.00679)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-48-blue.svg?paper=aa1b722485106c84e52c5e35b2d4b2f8c7fb3135)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Faa1b722485106c84e52c5e35b2d4b2f8c7fb3135)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVideo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002FvideoCC-data)\n\n+ **CelebV-Text: A Large-Scale Facial Text-Video Dataset** (26 Mar 2023)\u003Cdetails>\u003Csummary>[CVPR 2023] Jianhui Yu, Hao Zhu, Liming Jiang, et al.\u003C\u002Fsummary>Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, Wayne Wu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.14717)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=484d2194ce8459bfa9da906e556f63812c6ca999)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F484d2194ce8459bfa9da906e556f63812c6ca999)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcelebv-text.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCelebV-Text\u002FCelebV-Text.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FCelebV-Text\u002FCelebV-Text)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0TS1hQwjNWw)\n\n+ **[HD-VILA-100M] Advancing High-Resolution Video-Language Representation\nwith Large-Scale Video Transcriptions** (19 Nov 2021)\u003Cdetails>\u003Csummary>[CVPR 2022] Hongwei Xue, Tiankai Hang, Yanhong Zeng, et al. \u003C\u002Fsummary>Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.10337)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-84-blue.svg?paper=e1a3e6856b6ac6af3600b5954392e5368603fd1b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe1a3e6856b6ac6af3600b5954392e5368603fd1b)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FXPretrain.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FXPretrain\u002Fblob\u002Fmain\u002Fhd-vila-100m\u002FREADME.md)\n\n+ **[YT-Temporal-180M] MERLOT: Multimodal Neural Script Knowledge Models** (4 Jun 2021)\u003Cdetails>\u003Csummary>[NeurIPS 2021] Rowan Zellers, Ximing Lu, Jack Hessel, et al. \u003C\u002Fsummary>Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.02636)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-277-blue.svg?paper=90357a6dc817e2f7cec477a51156675fbf545cf1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F90357a6dc817e2f7cec477a51156675fbf545cf1)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frowanz\u002Fmerlot.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Frowanz\u002Fmerlot)\n\n+ **[WebVid-10M] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval** (1 Apr 2021)\u003Cdetails>\u003Csummary>[ICCV 2021] Max Bain, Arsha Nagrani, Gül Varol, et al. \u003C\u002Fsummary>Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.00650)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-585-blue.svg?paper=bac87bdb1cabc35fafb8176a234d332ebcc02864)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbac87bdb1cabc35fafb8176a234d332ebcc02864)\n\n+ **[WTS70M] Learning Video Representations from Textual Web Supervision** (29 Jul 2020)\u003Cdetails>\u003Csummary>Jonathan C. Stroud, Zhichao Lu, Chen Sun, et al.\u003C\u002Fsummary>Jonathan C. Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, David A. Ross\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.14937)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-42-blue.svg?paper=da55208bc9b56b5f394c242239d8cd0734bd5a87)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fda55208bc9b56b5f394c242239d8cd0734bd5a87)\n\n+ **HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips** (7 Jun 2019)\u003Cdetails>\u003Csummary>[ICCV 2019] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, et al. \u003C\u002Fsummary>Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.03327)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-847-blue.svg?paper=9311779489e597315488749ee6c386bfa3f3512e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9311779489e597315488749ee6c386bfa3f3512e)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.di.ens.fr\u002Fwillow\u002Fresearch\u002Fhowto100m\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fantoine77340\u002Fhowto100m.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fantoine77340\u002Fhowto100m?tab=readme-ov-file)\n\n+ **VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research** (6 Apr 2019)\u003Cdetails>\u003Csummary>[ICCV 2019 Oral] Xin Wang, Jiawei Wu, Junkun Chen, et al. \u003C\u002Fsummary>Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, William Yang Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.03493)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-373-blue.svg?paper=28b74bb7c8b08cceb2430ec2d54dfa0f3225d796)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F28b74bb7c8b08cceb2430ec2d54dfa0f3225d796)\n\n+ **How2: A Large-scale Dataset for Multimodal Language Understanding** (7 Jun 2019)\u003Cdetails>\u003Csummary>[NeurIPS 2018] Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, et al. \u003C\u002Fsummary>Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, Florian Metze\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.00347)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-223-blue.svg?paper=f56cb5dc32b5b280546998418fda7769d0858629)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff56cb5dc32b5b280546998418fda7769d0858629)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsrvk.github.io\u002Fhow2-dataset\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsrvk\u002Fhow2-dataset.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsrvk\u002Fhow2-dataset)\n\n+ **[ActivityNet Captions] Dense-Captioning Events in Videos** (2 May 2017)\u003Cdetails>\u003Csummary>[ICCV 2017] Ranjay Krishna, Kenji Hata, Frederic Ren, et al. \u003C\u002Fsummary>Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, Juan Carlos Niebles\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.00754)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-934-blue.svg?paper=96dd1fc39a368d23291816d57763bc6eb4f7b8d6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F96dd1fc39a368d23291816d57763bc6eb4f7b8d6)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Franjaykrishna\u002Fdensevid\u002F)\n\n+ **[LSMDC] Movie Description** (12 May 2016)\u003Cdetails>\u003Csummary>[IJCV 2017] Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, et al. \u003C\u002Fsummary>Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, Bernt Schiele\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1605.03705)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-278-blue.svg?paper=154c22ca5eef149aedc8a986fa684ca1fd14e7dc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F154c22ca5eef149aedc8a986fa684ca1fd14e7dc)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.mpi-inf.mpg.de\u002Fdepartments\u002Fcomputer-vision-and-machine-learning\u002Fresearch\u002Fvision-and-language\u002Fmpii-movie-description-dataset)\n\n\n+ **MSR-VTT: A Large Video Description Dataset for Bridging Video and Language** (1 Apr 2021)\u003Cdetails>\u003Csummary>[CVPR 2016] Jun Xu , Tao Mei , Ting Yao, et al. \u003C\u002Fsummary>Jun Xu , Tao Mei , Ting Yao and Yong Rui\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_cvpr_2016\u002Fpapers\u002FXu_MSR-VTT_A_Large_CVPR_2016_paper.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1395-blue.svg?paper=b8e2e9f3ba008e28257195ec69a00e07f260131d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb8e2e9f3ba008e28257195ec69a00e07f260131d)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcrux82\u002Fmsr-vtt-it.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fcrux82\u002Fmsr-vtt-it)\n\n\n\n\n\n## 3D Generation\n\n### 🔅 LLM-based\n+ **SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code** (2 Mar 2024)\u003Cdetails>\u003Csummary>Ziniu Hu, Ahmet Iscen, Aashi Jain, et al. \u003C\u002Fsummary>Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, Alireza Fathi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01248v1)\n\n\n+ **MotionScript: Natural Language Descriptions for Expressive 3D Human Motions** (19 Dec 2023)\u003Cdetails>\u003Csummary>Payam Jome Yazdian, Eric Liu, Li Cheng, et al. \u003C\u002Fsummary>Payam Jome Yazdian, Eric Liu, Li Cheng, Angelica Lim\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.12634)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=816792e66f463be2aa1888e4ecb51f8fb2b4dd79)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F816792e66f463be2aa1888e4ecb51f8fb2b4dd79)\n\n+ **HOLODECK: Language Guided Generation of 3D Embodied AI Environments** (19 Dec 2023)\u003Cdetails>\u003Csummary>[CVPR 2024]Yue Yang, Fan-Yun Sun, Luca Weihs, et al. \u003C\u002Fsummary>Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.09067)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=1dbc2cdcae3e17c3d721d12a5a2d98ced727681a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1dbc2cdcae3e17c3d721d12a5a2d98ced727681a)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002FHolodeck.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fallenai\u002FHolodeck)\n\n+ **PoseGPT: Chatting about 3D Human Pose** (30 Nov 2023)\u003Cdetails>\u003Csummary>Yao Feng, Jing Lin, Sai Kumar Dwivedi, et al. \u003C\u002Fsummary>[CVPR 2024] Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Michael J. Black\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.18836)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=4673c2ac4abb4b055da87171231acb60801ffe74)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4673c2ac4abb4b055da87171231acb60801ffe74)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfeng95\u002FPoseGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyfeng95\u002FPoseGPT)\n\n\n+ **3D-GPT: Procedural 3D MODELING WITH LARGE LANGUAGE MODELS** (19 Oct 2023)\u003Cdetails>\u003Csummary>Chunyi Sun*, Junlin Han*, Weijian Deng, et al. \u003C\u002Fsummary>Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.12945)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=588930cdd801f335b5e524d13f99aa94136a20a0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F588930cdd801f335b5e524d13f99aa94136a20a0)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChuny1\u002F3DGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FChuny1\u002F3DGPT)\n\n\n\n\n### Non-LLM-based (Clip\u002FT5)\n+ **DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric Diffusion** (12 Mar 2024)\u003Cdetails>\u003Csummary>Yuanze Lin, Ronald Clark, Philip Torr. \u003C\u002Fsummary>Yuanze Lin, Ronald Clark, Philip Torr\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.17237)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=72e54db6eebb99d4039cd66cb5dad6a40b31cf87)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F72e54db6eebb99d4039cd66cb5dad6a40b31cf87)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyuanze-lin\u002FDreamPolisher.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyuanze-lin\u002FDreamPolisher)\n\n+ **Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior** (12 Mar 2024)\u003Cdetails>\u003Csummary>Zike Wu, Pan Zhou, Xuanyu Yi, et al. \u003C\u002Fsummary>[CVPR 2024]Zike Wu, Pan Zhou, Xuanyu Yi, Xiaoding Yuan, Hanwang Zhang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09050)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=834f595fb25b8306b46f9e744ef8150f4971322f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F834f595fb25b8306b46f9e744ef8150f4971322f)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsail-sg\u002FConsistent3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FConsistent3D)\n\n+ **AToM: Amortized Text-to-Mesh using 2D Diffusion** (1 Feb 2024)\u003Cdetails>\u003Csummary>Guocheng Qian, Junli Cao, Aliaksandr Siarohin, et al. \u003C\u002Fsummary>Guocheng Qian, Junli Cao, Aliaksandr Siarohin, Yash Kant, Chaoyang Wang, Michael Vasilkovsky, Hsin-Ying Lee, Yuwei Fang, Ivan Skorokhodov, Peiye Zhuang, Igor Gilitschenski, Jian Ren, Bernard Ghanem, Kfir Aberman, Sergey Tulyakov\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.00867)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=ce2edcc6a0ef7592c385c5d8fd0924f79707e223)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fce2edcc6a0ef7592c385c5d8fd0924f79707e223)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnap-research\u002FAToM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsnap-research\u002FAToM)\n\n\n+ **DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior** ( 12 Mar 2024)\u003Cdetails>\u003Csummary>Tianyu Huang, Yihan Zeng, Zhilu Zhang, et al. \u003C\u002Fsummary>[CVPR 2024]Tianyu Huang, Yihan Zeng, Zhilu Zhang, Wan Xu, Hang Xu, Songcen Xu, Rynson W. H. Lau, Wangmeng Zuo\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06439)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=c15380dcda5a010827e3b014dcebe95b1218c680)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc15380dcda5a010827e3b014dcebe95b1218c680)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftyhuang0428\u002FDreamControl.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftyhuang0428\u002FDreamControl)\n\n\n+ **UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation** (14 Dec 2023)\u003Cdetails>\u003Csummary>Zexiang Liu, Yangguang Li, Youtian Lin, et al. \u003C\u002Fsummary>Zexiang Liu, Yangguang Li, Youtian Lin, Xin Yu, Sida Peng, Yan-Pei Cao, Xiaojuan Qi, Xiaoshui Huang, Ding Liang, Wanli Ouyang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08754)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=210e63599d49abdb848a4440d4244cdcdedeadff)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F210e63599d49abdb848a4440d4244cdcdedeadff)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYG256Li\u002FUniDream.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYG256Li\u002FUniDream)\n\n+ **Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior** (11 Dec 2023)\u003Cdetails>\u003Csummary>[CVPR 2024] Fangfu Liu, Diankun Wu, Yi Wei, et al. \u003C\u002Fsummary>Fangfu Liu, Diankun Wu, Yi Wei, Yongming Rao, Yueqi Duan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06655)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=fe1b3f0d074974ce946f10f3bbf52e8351bc0156)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffe1b3f0d074974ce946f10f3bbf52e8351bc0156)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliuff19\u002FSherpa3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fliuff19\u002FSherpa3D)\n\n+ **Learn to Optimize Denoising Scores for 3D Generation: A Unified and Improved Diffusion Prior on NeRF and 3D Gaussian Splatting** (8 Dec 2023)\u003Cdetails>\u003Csummary>Xiaofeng Yang, Yiwen Chen, Cheng Chen, et al. \u003C\u002Fsummary>Xiaofeng Yang, Yiwen Chen, Cheng Chen, Chi Zhang, Yi Xu, Xulei Yang, Fayao Liu, Guosheng Lin\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04820)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=6d10f9b0e0a579a1359df7dfbdef00bc798d5714)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6d10f9b0e0a579a1359df7dfbdef00bc798d5714)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangxiaofeng\u002FLODS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyangxiaofeng\u002FLODS)\n\n+ **DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling** (28 Nov 2023)\u003Cdetails>\u003Csummary>Linqi Zhou, Andy Shih, Chenlin Meng, et al. \u003C\u002Fsummary>Linqi Zhou, Andy Shih, Chenlin Meng, Stefano Ermon\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16918)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=e88d5399956c9d9519a5cfd49308b7d439167543)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe88d5399956c9d9519a5cfd49308b7d439167543)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falexzhou907\u002FDreamPropeller.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Falexzhou907\u002FDreamPropeller)\n\n+ **RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D** (28 Nov 2023)\u003Cdetails>\u003Csummary>[CVPR 2024] Lingteng Qiu, Guanying Chen, Xiaodong Gu, et al. \u003C\u002Fsummary>Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo, Mutian Xu, Yushuang Wu, Weihao Yuan, Zilong Dong, Liefeng Bo, Xiaoguang Han\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17082)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=cf60a639add75cdea4273697269ee463024b7926)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcf60a639add75cdea4273697269ee463024b7926)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmodelscope\u002Frichdreamer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Frichdreamer)\n\n+ **DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models** (30 Nov 2023)\u003Cdetails>\u003Csummary>[CVPR 2024] Yukang Cao, Yan-Pei Cao, Kai Han, et al. \u003C\u002Fsummary>Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.00916)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-69-blue.svg?paper=0fa1501c7378a0dca2ac913fce9dcdcc2b1958a7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0fa1501c7378a0dca2ac913fce9dcdcc2b1958a7)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyukangcao\u002FDreamAvatar.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyukangcao\u002FDreamAvatar)\n\n+ **LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching** (2 Dec 2023)\u003Cdetails>\u003Csummary>[CVPR 2024] Yixun Liang, Xin Yang, Jiantao Lin, et al. \u003C\u002Fsummary>Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, Yingcong Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.11284)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=6f709278506813d04a074e6fa20188cce9bb927b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6f709278506813d04a074e6fa20188cce9bb927b)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEnVision-Research\u002FLucidDreamer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FEnVision-Research\u002FLucidDreamer)\n\n\n+ **GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models** (12 Oct 2023)\u003Cdetails>\u003Csummary>[CVPR 2024] Taoran Yi, Jiemin Fang, Junjie Wang, et al. \u003C\u002Fsummary>Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, Xinggang Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08529)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-230-blue.svg?paper=c5e9fd131cde68c218d0ea69cd617a67c7f35d42)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc5e9fd131cde68c218d0ea69cd617a67c7f35d42)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhustvl\u002FGaussianDreamer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FGaussianDreamer)\n\n+ **Text-to-3D using Gaussian Splatting** (28 Sep 2023)\u003Cdetails>\u003Csummary>[CVPR 2024] Zilong Chen, Feng Wang, Huaping Liu \u003C\u002Fsummary>Zilong Chen, Feng Wang, Huaping Liu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.16585)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-57-blue.svg?paper=86b5318b0a69ccdeec17abb0120e4bd7688a4b59)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F86b5318b0a69ccdeec17abb0120e4bd7688a4b59)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgsgen3d\u002Fgsgen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgsgen3d\u002Fgsgen)\n\n+ **EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior** (10 Sep 2023)\u003Cdetails>\u003Csummary>[CVPR 2024] Zhipeng Hu, Minda Zhao, Chaoyi Zhao, Xinyue Liang, Lincheng Li, Zeng Zhao, Changjie Fan, Xiaowei Zhou, Xin Yu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.13223)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-23-blue.svg?paper=fec17239569efd6914f0df9e25b66b310969d3c5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffec17239569efd6914f0df9e25b66b310969d3c5)\n\n+ **TADA! Text to Animatable Digital Avatars** (21 Aug 2023)\u003Cdetails>\u003Csummary>[3DV 2024] Tingting Liao, Hongwei Yi, Yuliang Xiu, et al.\u003C\u002Fsummary>Tingting Liao, Hongwei Yi, Yuliang Xiu, Jiaxaing Tang, Yangyi Huang, Justus Thies, Michael J. Black\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.10899)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-36-blue.svg?paper=303f466fb823112f79a9f36637c7084dd8363fc5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F303f466fb823112f79a9f36637c7084dd8363fc5)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTingtingLiao\u002FTADA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTingtingLiao\u002FTADA)\n\n+ **SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D** (20 Oct 2023 )\u003Cdetails>\u003Csummary>[ICLR 2024] Weiyu Li, Rui Chen, Xuelin Chen, et al.\u003C\u002Fsummary>Weiyu Li, Rui Chen, Xuelin Chen, Ping Tan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02596)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-29-blue.svg?paper=438e9fb79c9e37d43223e61bb575ebd2dae0b0a7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F438e9fb79c9e37d43223e61bb575ebd2dae0b0a7)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwyysf-98\u002FSweetDreamer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fwyysf-98\u002FSweetDreamer)\n\n+ **Noise-Free Score Distillation** (26 Oct 2023)\u003Cdetails>\u003Csummary>[ICLR 2024] Oren Katzir, Or Patashnik, Daniel Cohen-Or, et al.\u003C\u002Fsummary>Oren Katzir, Or Patashnik, Daniel Cohen-Or, Dani Lischinski\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.17590)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-14-blue.svg?paper=85a70c0a048cba4f53dcf332ee73f6032a2e53bc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F85a70c0a048cba4f53dcf332ee73f6032a2e53bc)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Forenkatzir\u002Fnfsd.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Forenkatzir\u002Fnfsd)\n\n+ **Text-to-3D with Classifier Score Distillation** (26 Oct 2023 )\u003Cdetails>\u003Csummary>[ICLR 2024] Xin Yu, Yuan-Chen Guo, Yangguang Li, et al. \u003C\u002Fsummary>Xin Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Song-Hai Zhang, Xiaojuan Qi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.19415)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=4e21879b564cc2e803b16edf0dda9f1edb91b497)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4e21879b564cc2e803b16edf0dda9f1edb91b497)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCVMI-Lab\u002FClassifier-Score-Distillation.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FCVMI-Lab\u002FClassifier-Score-Distillation)\n\n+ **HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance** (28 Nov 2023)\u003Cdetails>\u003Csummary>[ICLR 2024] Junzhe Zhu, Peiye Zhuang. \u003C\u002Fsummary>Junzhe Zhu, Peiye Zhuang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18766)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-17-blue.svg?paper=daf3b117f789b2b95223e58592979fb57627515e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdaf3b117f789b2b95223e58592979fb57627515e)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJunzheJosephZhu\u002FHiFA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FJunzheJosephZhu\u002FHiFA)\n\n+ **MVDream: Multi-view Diffusion for 3D Generation** (31 Aug 2023)\u003Cdetails>\u003Csummary>[ICLR 2024] Yichun Shi, Peng Wang, Jianglong Ye, et al. \u003C\u002Fsummary>Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, Xiao Yang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.16512)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-154-blue.svg?paper=9aa01997226b5c4d705ae2e2f52c32681006654b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9aa01997226b5c4d705ae2e2f52c32681006654b)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FMVDream.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FMVDream)\n\n+ **DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation** (28 Sep 2023)\u003Cdetails>\u003Csummary>[ICLR 2024] Jiaxiang Tang, Jiawei Ren, Hang Zhou, et al.\u003C\u002Fsummary>Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, Gang Zeng\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.16653)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-99-blue.svg?paper=cc1a674bb164d09a060cf5b26fe518c02fae0ddc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcc1a674bb164d09a060cf5b26fe518c02fae0ddc)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdreamgaussian\u002Fdreamgaussian.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdreamgaussian\u002Fdreamgaussian)\n\n+ **Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation** (11 Apr 2023)\u003Cdetails>\u003Csummary>[ICLR 2024] Junyoung Seo, Wooseok Jang, Min-Seop Kwak, et al.\u003C\u002Fsummary>Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Hyeonsu Kim, Jaehoon Ko, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, Seungryong Kim\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.07937)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-81-blue.svg?paper=5356c3dac654854a0842753bcc2e3433dc4a2afd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5356c3dac654854a0842753bcc2e3433dc4a2afd)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Feladrich\u002Flatent-nerf.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Feladrich\u002Flatent-nerf)\n\n\n+ **IT3D: Improved Text-to-3D Generation with Explicit View Synthesis** (22 Aug 2023)\u003Cdetails>\u003Csummary>[AAAI 2024] Yiwen Chen, Chi Zhang, Xiaofeng Yang, et al. \u003C\u002Fsummary>Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Gang Yu, Lei Yang, Guosheng Lin\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.11473)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-30-blue.svg?paper=2b94785cbfd865a01cc68d7d4c7500b710e5e2fb)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2b94785cbfd865a01cc68d7d4c7500b710e5e2fb)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbuaacyw\u002FIT3D-text-to-3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fbuaacyw\u002FIT3D-text-to-3D)\n\n\n+ **HD-Fusion: Detailed Text-to-3D Generation Leveraging Multiple Noise Estimation** (30 Jul 2023)\u003Cdetails>\u003Csummary>[WACV 2024] Jinbo Wu, Xiaobo Gao, Xing Liu, et al. \u003C\u002Fsummary>Jinbo Wu, Xiaobo Gao, Xing Liu, Zhengyang Shen, Chen Zhao, Haocheng Feng, Jingtuo Liu, Errui Ding\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16183)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=d8aaed01dffc621488aecbb0ef01b50f86e44bc1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd8aaed01dffc621488aecbb0ef01b50f86e44bc1)\n\n\n+ **Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond** (11 Apr 2023)\u003Cdetails>\u003Csummary>Mohammadreza Armandpour, Ali Sadeghian, Huangjie Zheng, et al. \u003C\u002Fsummary>Mohammadreza Armandpour, Ali Sadeghian, Huangjie Zheng, Amir Sadeghian, Mingyuan Zhou\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.04968)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-50-blue.svg?paper=b19ca192a5bebbc3473be61989baf085ff21daa5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb19ca192a5bebbc3473be61989baf085ff21daa5)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPerp-Neg\u002FPerp-Neg-stablediffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPerp-Neg\u002FPerp-Neg-stablediffusion)\n\n\n+ **Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures** (14 Nov 2022)\u003Cdetails>\u003Csummary>[CVPR 2023] Gal Metzer, Elad Richardson, Or Patashnik, et al.\u003C\u002Fsummary>Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, Daniel Cohen-Or\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.07600)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-212-blue.svg?paper=793939b83e10903f58d8edbb7534963df627a1fe)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F793939b83e10903f58d8edbb7534963df627a1fe)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Feladrich\u002Flatent-nerf.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Feladrich\u002Flatent-nerf)\n\n+ **Magic3D: High-Resolution Text-to-3D Content Creation** (18 Nov 2022)\u003Cdetails>\u003Csummary>[CVPR 2023 Highlight] Chen-Hsuan Lin, Jun Gao, Luming Tang, et al. \u003C\u002Fsummary>Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, Tsung-Yi Lin\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FLin_Magic3D_High-Resolution_Text-to-3D_Content_Creation_CVPR_2023_paper.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-466-blue.svg?paper=bdf4af8311637c681904e71cf50f96fd0026f578)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbdf4af8311637c681904e71cf50f96fd0026f578)\n\n+ **Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation** (1 Dec 2022)\u003Cdetails>\u003Csummary>[CVPR 2023] Haochen Wang, Xiaodan Du, Jiahao Li, et al. \u003C\u002Fsummary>Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, Greg Shakhnarovich\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.00774)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-238-blue.svg?paper=fc011ed5ee986332523a62d2783adee1179dc1ed)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffc011ed5ee986332523a62d2783adee1179dc1ed)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpals-ttic\u002Fsjc.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpals-ttic\u002Fsjc)\n\n+ **High-fidelity 3D Face Generation from Natural Language Descriptions** (5 May 2023)\u003Cdetails>\u003Csummary>[CVPR 2023] Menghua Wu, Hao Zhu, Linjia Huang, et al. \u003C\u002Fsummary>Menghua Wu, Hao Zhu, Linjia Huang, Yiyu Zhuang, Yuanxun Lu, Xun Cao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.03302)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=012d7d3ee690e5acadf416787651a8fe425e8eb3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F012d7d3ee690e5acadf416787651a8fe425e8eb3)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhuhao-nju\u002Fdescribe3d.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzhuhao-nju\u002Fdescribe3d)\n\n+ **RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion** (12 Dec 2022)\u003Cdetails>\u003Csummary>[CVPR 2023 Highlight] Tengfei Wang, Bo Zhang, Ting Zhang, et al. \u003C\u002Fsummary>Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, Baining Guo\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.06135)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-158-blue.svg?paper=7e993a9ca01dcd4538362454aaac29a18a63c000)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7e993a9ca01dcd4538362454aaac29a18a63c000)\n\n+ **ClipFace: Text-guided Editing of Textured 3D Morphable Models** (24 Apr 2023)\u003Cdetails>\u003Csummary>[SIGGRAPH 2023] Tengfei Wang, Bo Zhang, Ting Zhang, et al. \u003C\u002Fsummary>Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, Baining Guo\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.01406)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-40-blue.svg?paper=f21e8eddf42580d1f38a11ec5acd8891c0454a1f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff21e8eddf42580d1f38a11ec5acd8891c0454a1f)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshivangi-aneja\u002FClipFace.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshivangi-aneja\u002FClipFace)\n\n+ **DreamFusion: Text-to-3D using 2D Diffusion** (29 Sep 2022)\u003Cdetails>\u003Csummary>[ICLR 2023 Oral] Ben Poole, Ajay Jain, Jonathan T. Barron, et al.\u003C\u002Fsummary>Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.14988)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-908-blue.svg?paper=4c94d04afa4309ec2f06bdd0fe3781f91461b362)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4c94d04afa4309ec2f06bdd0fe3781f91461b362)\n\n+ **ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation** (25 May 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023 Spotlight] Zhengyi Wang, Cheng Lu, Yikai Wang, et al. \u003C\u002Fsummary>Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.13873)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-230-blue.svg?paper=c5e9fd131cde68c218d0ea69cd617a67c7f35d42)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc5e9fd131cde68c218d0ea69cd617a67c7f35d42)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKU-CVLAB\u002F3DFuse-threestudio.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FKU-CVLAB\u002F3DFuse-threestudio)\n\n+ **HeadSculpt: Crafting 3D Head Avatars with Text** (25 May 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023] Xiao Han, Yukang Cao, Kai Han, et al. \u003C\u002Fsummary>Xiao Han, Yukang Cao, Kai Han, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, Tao Xiang, Kwan-Yee K. Wong\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.03038)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-230-blue.svg?paper=4e8cf9602d4ef714dcdb8580de40e1a2a717ab11)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4e8cf9602d4ef714dcdb8580de40e1a2a717ab11)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBrandonHanx\u002FHeadSculpt.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FBrandonHanx\u002FHeadSculpt)\n\n+ **ATT3D: Amortized Text-to-3D Object Synthesis** (6 Jun 2023)\u003Cdetails>\u003Csummary>[ICCV 2023] Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, et al. \u003C\u002Fsummary>Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, James Lucas\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07349)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-25-blue.svg?paper=1e8403af2e1e7a8f803d8df9e8daac584f99c2a0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1e8403af2e1e7a8f803d8df9e8daac584f99c2a0)\n\n+ **Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation** (24 Mar 2023)\u003Cdetails>\u003Csummary>[ICCV 2023] Rui Chen, Yongwei Chen, Ningxin Jiao, et al. \u003C\u002Fsummary>Rui Chen, Yongwei Chen, Ningxin Jiao, Kui Jia\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.13873)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-202-blue.svg?paper=0cbb518c364067200476a51e5ce7476a4f582770)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0cbb518c364067200476a51e5ce7476a4f582770)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGorilla-Lab-SCUT\u002FFantasia3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FGorilla-Lab-SCUT\u002FFantasia3D)\n\n+ **Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models** (10 Sep 2023)\u003Cdetails>\u003Csummary>[ICCV 2023] Lukas Höllein, Ang Cao, Andrew Owens, et al. \u003C\u002Fsummary>Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, Matthias Nießner\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.11989)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-55-blue.svg?paper=95aa6fa4e42387561cff22378348d528adea37f2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F95aa6fa4e42387561cff22378348d528adea37f2)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlukasHoel\u002Ftext2room.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FlukasHoel\u002Ftext2room)\n\n+ **X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance** (28 Mar 2023) \u003Cdetails>\u003Csummary>[ICCV 2023] Yiwei Ma, Xiaioqing Zhang, Xiaoshuai Sun, et al.\u003C\u002Fsummary>Yiwei Ma, Xiaioqing Zhang, Xiaoshuai Sun, Jiayi Ji, Haowei Wang, Guannan Jiang, Weilin Zhuang, Rongrong Ji\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.15764)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-18-blue.svg?paper=f8bf2225a2993e3ead73d886b5797378d6e53186)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff8bf2225a2993e3ead73d886b5797378d6e53186)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxmu-xiaoma666\u002FX-Mesh.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fxmu-xiaoma666\u002FX-Mesh)\n\n+ **StyleAvatar3D: Leveraging Image-Text Diffusion Models for High-Fidelity 3D Avatar Generation** (31 May 2023) \u003Cdetails>\u003Csummary> Chi Zhang, Yiwen Chen, Yijun Fu, et al.\u003C\u002Fsummary>Chi Zhang, Yiwen Chen, Yijun Fu, Zhenglin Zhou, Gang YU, Billzb Wang, Bin Fu, Tao Chen, Guosheng Lin, Chunhua Shen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.19012)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=b980d98c81252dfbed334728c46625e58f54dd9d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb980d98c81252dfbed334728c46625e58f54dd9d)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ficoz69\u002FStyleAvatar3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ficoz69\u002FStyleAvatar3D)\n\n+ **TextMesh: Generation of Realistic 3D Meshes From Text Prompts** (24 Apr 2023) \u003Cdetails>\u003Csummary>[3DV 2023] Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, et al.\u003C\u002Fsummary>Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, Michael Niemeyer, Federico Tombari\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.12439)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-67-blue.svg?paper=2c6392491b6a942e08db46c8fff0ef5ba1fd9de8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2c6392491b6a942e08db46c8fff0ef5ba1fd9de8)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthreestudio-project\u002Fthreestudio.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthreestudio-project\u002Fthreestudio)\n\n+ **Clip-forge: Towards zero-shot text-to-shape generation** (28 Apr 2022)\u003Cdetails>\u003Csummary>[CVPR 2022] Aditya Sanghi, Hang Chu, Joseph G. Lambourne, et al. \u003C\u002Fsummary>Aditya Sanghi, Hang Chu, Joseph G. Lambourne, Ye Wang, Chin-Yi Cheng, Marco Fumero, Kamal Rahimi Malekshan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.02624)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-197-blue.svg?paper=738e3e0623054da29dc57fc6aee5e6711867c4e8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F738e3e0623054da29dc57fc6aee5e6711867c4e8)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAutodeskAILab\u002FClip-Forge.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAutodeskAILab\u002FClip-Forge)\n\n\n+ **Zero-Shot Text-Guided Object Generation with Dream Fields** (2 Dec 2021) \u003Cdetails>\u003Csummary>[CVPR 2022] Ajay Jain, Ben Mildenhall, Jonathan T. Barron, et al.\u003C\u002Fsummary>Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, Ben Poole\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.01455)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-345-blue.svg?paper=03e1c3b5fdad9b21bbed3d13af7e8d6c73cbcfa6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F03e1c3b5fdad9b21bbed3d13af7e8d6c73cbcfa6)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fajayj.com\u002Fdreamfields)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle-research\u002Fgoogle-research.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002F)\n\n+ **Text2Mesh: Text-Driven Neural Stylization for Meshes** (6 Dec 2021) \u003Cdetails>\u003Csummary>[CVPR 2022] Oscar Michel, Roi Bar-On, Richard Liu, et al. \u003C\u002Fsummary>Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, Rana Hanocka\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.02624)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-222-blue.svg?paper=d15b27edf3630728cdb40f49946365d9011641cf)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd15b27edf3630728cdb40f49946365d9011641cf)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthreedle\u002Ftext2mesh.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthreedle\u002Ftext2mesh)\n\n+ **TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition** (20 Oct 2022) \u003Cdetails>\u003Csummary>[NeurIPS 2022 Spotlight] Yongwei Chen, Rui Chen, Jiabao Lei, et al. \u003C\u002Fsummary>Yongwei Chen, Rui Chen, Jiabao Lei, Yabin Zhang, Kui Jia\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.11277)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-47-blue.svg?paper=44e49f72fb6b97f52c25a30f0adc68c2384430ba)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F44e49f72fb6b97f52c25a30f0adc68c2384430ba)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGorilla-Lab-SCUT\u002Ftango.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FGorilla-Lab-SCUT\u002Ftango)\n\n+ **CLIP-Mesh: Generating textured meshes from text using pretrained image-text models** (24 Mar 2022) \u003Cdetails>\u003Csummary>[SIGGRAPH ASIA 2022] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, et al. \u003C\u002Fsummary>Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, Tiberiu Popa\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.13333)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-167-blue.svg?paper=8941e477b2f39eb92712f04400412da60d349ec1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8941e477b2f39eb92712f04400412da60d349ec1)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNasirKhalid24\u002FCLIP-Mesh.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNasirKhalid24\u002FCLIP-Mesh)\n\n+ **MotionCLIP: Exposing Human Motion Generation to CLIP Space** (15 Mar 2022) \u003Cdetails>\u003Csummary>[ECCV 2022] Guy Tevet, Brian Gordon, Amir Hertz, et al. \u003C\u002Fsummary>Guy Tevet, Brian Gordon, Amir Hertz, Amit H. Bermano, Daniel Cohen-Or\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.08063)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-140-blue.svg?paper=e82df4b6a3628501fce67835ad8316d6525ad133)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe82df4b6a3628501fce67835ad8316d6525ad133)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGuyTevet\u002FMotionCLIP.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FGuyTevet\u002FMotionCLIP)\n\n  \n### Datasets\n\n+ **Objaverse-XL: A Universe of 10M+ 3D Objects** (11 Jul 2023) \u003Cdetails>\u003Csummary>Matt Deitke, Dustin Schwenk, Jordi Salvador, et al. \u003C\u002Fsummary>Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, Ali Farhadi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.05663)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-70-blue.svg?paper=1b90e9e9734bed6b379ae87d688cb3b887baf597)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1b90e9e9734bed6b379ae87d688cb3b887baf597)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Fobjaverse-xl.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fobjaverse-xl)\n\n+ **Objaverse: A Universe of Annotated 3D Objects** (15 Dec 2022) \u003Cdetails>\u003Csummary>[CVPR 2023] Matt Deitke, Dustin Schwenk, Jordi Salvador, et al. \u003C\u002Fsummary>Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, Ali Farhadi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.08051)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-230-blue.svg?paper=1b31dbf44e68b698120552366df03e6e35a1e428)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1b31dbf44e68b698120552366df03e6e35a1e428)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Fobjaverse-xl.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fobjaverse-xl)\n\n \n## Audio Generation\n\n### 🔅 LLM-based\n\n+ **SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation** (27 Feb 2024)\u003Cdetails>\u003Csummary>Shuangrui Ding, Zihan Liu, Xiaoyi Dong, et al.\u003C\u002Fsummary>Shuangrui Ding, Zihan Liu, Xiaoyi Dong, Pan Zhang, Rui Qian, Conghui He, Dahua Lin, Jiaqi Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17645)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=e7c8a74423a5811a3aac5f33001fce32d2e2386c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe7c8a74423a5811a3aac5f33001fce32d2e2386c)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fpjlab-songcomposer.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpjlab-songcomposer\u002Fsongcomposer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpjlab-songcomposer\u002Fsongcomposer)\n\n+ **ChatMusician: Understanding and Generating Music Intrinsically with LLM** (25 Feb 2024)\u003Cdetails>\u003Csummary>Ruibin Yuan, Hanfeng Lin, Yi Wang, et al.\u003C\u002Fsummary>Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, Shangda Wu, Tianhao Shen, Ge Zhang, Yuhang Wu, Cong Liu, Ziya Zhou, Ziyang Ma, Liumeng Xue, Ziyu Wang, Qin Liu, Tianyu Zheng, Yizhi Li, Yinghao Ma, Yiming Liang, Xiaowei Chi, Ruibo Liu, Zili Wang, Pengfei Li, Jingcheng Wu, Chenghua Lin, Qifeng Liu, Tao Jiang, Wenhao Huang, Wenhu Chen, Emmanouil Benetos, Jie Fu, Gus Xia, Roger Dannenberg, Wei Xue, Shiyin Kang, Yike Guo\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16153)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=48494aa30f35a64858644aba839c8cba38c0cf2a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F48494aa30f35a64858644aba839c8cba38c0cf2a)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fshanghaicannon.github.io\u002FChatMusician\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhf-lin\u002FChatMusician.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhf-lin\u002FChatMusician)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fm-a-p\u002FChatMusician)\n\n+ **AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling** (19 Feb 2024)\u003Cdetails>\u003Csummary>Jun Zhan, Junqi Dai, Jiasheng Ye, et al.\u003C\u002Fsummary>Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12226)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=14191e9f12913ad8c7ac6e1188682afac04aad09)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F14191e9f12913ad8c7ac6e1188682afac04aad09)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjunzhan2000.github.io\u002FAnyGPT.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenMOSS\u002FAnyGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FAnyGPT)\n\n+ **Boosting Large Language Model for Speech Synthesis: An Empirical Study** (30 Dec 2023)\u003Cdetails>\u003Csummary>Hongkun Hao, Long Zhou, Shujie Liu, et al.\u003C\u002Fsummary>Hongkun Hao, Long Zhou, Shujie Liu, Jinyu Li, Shujie Hu, Rui Wang, Furu Wei\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00246)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=c1dd77e48dd615ee6881b2cc876a00a92cae6eac)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc1dd77e48dd615ee6881b2cc876a00a92cae6eac)\n\n+ **Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action** (28 Dec 2023)\u003Cdetails>\u003Csummary>Jiasen Lu, Christopher Clark, Sangho Lee, et al.\u003C\u002Fsummary>Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17172)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=6c64ddd2190909de2c680dd18abc9b92e80c39f9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6c64ddd2190909de2c680dd18abc9b92e80c39f9)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Funified-io-2.allenai.org\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Funified-io-2.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fallenai\u002Funified-io-2)\n\n+ **M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models** (19 Nov 2023)\u003Cdetails>\u003Csummary>Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, et al.\u003C\u002Fsummary>Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.11255)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=1e84d7c45f70038574fcdb7bc1b20da9b348a092)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1e84d7c45f70038574fcdb7bc1b20da9b348a092)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcrypto-code.github.io\u002FM2UGen-Demo\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshansongliu\u002FM2UGen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshansongliu\u002FM2UGen)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FM2UGen\u002FM2UGen-Demo)\n\n+ **LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT** (7 Oct 2023)\u003Cdetails>\u003Csummary>Jiaming Wang, Zhihao Du, Qian Chen, et al.\u003C\u002Fsummary>Jiaming Wang, Zhihao Du, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.04673)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=ffa05cb5504ba08254f498223f613b3ebcf87692)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fffa05cb5504ba08254f498223f613b3ebcf87692)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flauragpt.github.io\u002F)\n\n+ **LLaSM: Large Language and Speech Model** (30 Aug 2023)\u003Cdetails>\u003Csummary>Yu Shu, Siwei Dong, Guangyao Chen, et al.\u003C\u002Fsummary>Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, Yemin Shi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.15930)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=7b22ecd9f1ced58c1704ac6191e029b98054e330)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7b22ecd9f1ced58c1704ac6191e029b98054e330)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLinkSoul\u002FLLaSM)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLinkSoul-AI\u002FLLaSM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLinkSoul-AI\u002FLLaSM)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLinkSoul\u002FLLaSM)\n\n+ **AudioPaLM: A Large Language Model That Can Speak and Listen** (22 Jun 2023)\u003Cdetails>\u003Csummary>Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, et al.\u003C\u002Fsummary>Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats, Neil Zeghidour, Yu Zhang, Zhishuai Zhang, Lukas Zilka, Christian Frank\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.12925)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-69-blue.svg?paper=3efb81de24eb88017d6dbcf22cb4215084223fd8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3efb81de24eb88017d6dbcf22cb4215084223fd8)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgoogle-research.github.io\u002Fseanet\u002Faudiopalm\u002Fexamples\u002F)\n\n+ **Pengi: An Audio Language Model for Audio Tasks** (19 May 2023)\u003Cdetails>\u003Csummary>Soham Deshmukh, Benjamin Elizalde, Rita Singh, et al.\u003C\u002Fsummary>Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11834)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-35-blue.svg?paper=ad22af138fa1d1490cda0301abf8159a7c30c5a2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fad22af138fa1d1490cda0301abf8159a7c30c5a2)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPengi)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FPengi.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPengi)\n\n+ **Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities** (18 May 2023)\u003Cdetails>\u003Csummary>Dong Zhang, Shimin Li, Xin Zhang, et al.\u003C\u002Fsummary>Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, Xipeng Qiu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11000)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-76-blue.svg?paper=5cac6430bd379c9d2fe13137dfd6ae7721a2679f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5cac6430bd379c9d2fe13137dfd6ae7721a2679f)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002F0nutation.github.io\u002FSpeechGPT.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F0nutation\u002FSpeechGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002F0nutation\u002FSpeechGPT)\n\n+ **Sparks of Artificial General Intelligence: Early experiments with GPT-4** (22 Mar 2023)\u003Cdetails>\u003Csummary>Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, et al.\u003C\u002Fsummary>Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12712)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1407-blue.svg?paper=574beee702be3856d60aa482ec725168fe64fc99)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F574beee702be3856d60aa482ec725168fe64fc99)\n\n### Non-LLM-based\n+ **Audiobox: Unified Audio Generation with Natural Language Prompts** (25 Dec 2023)\\\nApoorv Vyas, Bowen Shi, Matthew Le\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.15821)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=f124ae1e4663359193be32adb37b07b3252d5329)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff124ae1e4663359193be32adb37b07b3252d5329)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Faudiobox.metademolab.com\u002F)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Faudiobox.metademolab.com\u002Fcapabilities)\n\n+ **Music ControlNet: Multiple Time-varying Controls for Music Generation** (13 Nov 2023)\u003Cdetails>\u003Csummary>Shih-Lun Wu, Chris Donahue, Shinji Watanabe, et al.\u003C\u002Fsummary>Shih-Lun Wu, Chris Donahue, Shinji Watanabe, Nicholas J. Bryan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.07069)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=42239e71a712d70cd24e06ffc0cf0d22fc628a36)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F42239e71a712d70cd24e06ffc0cf0d22fc628a36)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmusiccontrolnet.github.io\u002Fweb\u002F)\n\n+ **Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing** (19 Oct 2023)\u003Cdetails>\u003Csummary>Yixiao Zhang, Akira Maezawa, Gus Xia, et al.\u003C\u002Fsummary>Yixiao Zhang, Akira Maezawa, Gus Xia, Kazuhiko Yamamoto, Simon Dixon\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.12404)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=cca4218dd7c10c1614bbd84aa7cd7e00027bdc7c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcca4218dd7c10c1614bbd84aa7cd7e00027bdc7c)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsites.google.com\u002Fview\u002Floop-copilot)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fldzhangyx\u002Floop-copilot.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fldzhangyx\u002Floop-copilot\u002F)\n\n+ **MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models** (18 Oct 2023)\u003Cdetails>\u003Csummary>Dingyao Yu, Kaitao Song, Peiling Lu, et al.\u003C\u002Fsummary>Dingyao Yu, Kaitao Song, Peiling Lu, Tianyu He, Xu Tan, Wei Ye, Shikun Zhang, Jiang Bian\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11954)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=beaf64df85f8204b8cd89a7f46827608e6d16922)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbeaf64df85f8204b8cd89a7f46827608e6d16922)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fmuzic\u002Ftree\u002Fmain\u002Fmusicagent.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmuzic\u002Ftree\u002Fmain\u002Fmusicagent)\n\n+ **UniAudio: An Audio Foundation Model Toward Universal Audio Generation** (1 Oct 2023)\\\nDongchao Yang, Jinchuan Tian, Xu Tan\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00704)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-15-blue.svg?paper=74bfbbb7307a7af2686043ea97ab8b34cb062ba8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F74bfbbb7307a7af2686043ea97ab8b34cb062ba8)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdongchaoyang.top\u002FUniAudio_demo\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangdongchao\u002FUniAudio.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyangdongchao\u002FUniAudio)\n\n+ **AudioLM: a Language Modeling Approach to Audio Generation** (7 Sep 2022)\u003Cdetails>\u003Csummary>Zalán Borsos, Raphaël Marinier, Damien Vincent, et al. (IEEE\u002FACM Transactions on Audio, Speech, and Language Processing)\u003C\u002Fsummary>Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, Neil Zeghidour\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.03143)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-232-blue.svg?paper=8c870bef01a4fbb20f60722ffc2f6bee3870b18b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8c870bef01a4fbb20f60722ffc2f6bee3870b18b)\n\n+ **Wavjourney: Compositional audio creation with large language models** (26 Jul 2023)\u003Cdetails>\u003Csummary>Xubo Liu, Zhongkai Zhu, Haohe Liu, et al.\u003C\u002Fsummary>Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.14335)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=aa7bcd1f9453c9096ec78900a7b94e816ed0e1c5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Faa7bcd1f9453c9096ec78900a7b94e816ed0e1c5)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Faudio-agi.github.io\u002FWavJourney_demopage\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAudio-AGI\u002FWavJourney.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAudio-AGI\u002FWavJourney)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAudio-AGI\u002FWavJourney)\n\n+ **Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody** (16 Jun 2023)\u003Cdetails>\u003Csummary>Sofoklis Kakouros, Juraj Šimko, Martti Vainio, et al. (2023 SSW)\u003C\u002Fsummary>Sofoklis Kakouros, Juraj Šimko, Martti Vainio, Antti Suni\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.09814)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=63aad36dc981348493be6743292a04234b29ba4e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F63aad36dc981348493be6743292a04234b29ba4e)\n\n+ **Simple and Controllable Music Generation** (8 Jun 2023)\u003Cdetails>\u003Csummary>Jade Copet, Felix Kreuk, Itai Gat, et al.\u003C\u002Fsummary>Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05284)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-71-blue.svg?paper=4cc8e18f5eece0b0d8e1abcb8ee10fb33680fbb2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4cc8e18f5eece0b0d8e1abcb8ee10fb33680fbb2)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fai.honu.io\u002Fpapers\u002Fmusicgen\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Faudiocraft.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Faudiocraft)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ffacebook\u002FMusicGen)\n\n+ **Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation** (29 May 2023)\u003Cdetails>\u003Csummary>Jiawei Huang, Yi Ren, Rongjie Huang, et al.\u003C\u002Fsummary>Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18474)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=83d4b22d803ae856cf6b308482bd504fa151d39e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F83d4b22d803ae856cf6b308482bd504fa151d39e)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmake-an-audio-2.github.io\u002F)\n\n+ **Jukebox: A Generative Model for Music** (30 Apr 2020)\u003Cdetails>\u003Csummary>Prafulla Dhariwal, Heewoo Jun, Christine Payne, et al.\u003C\u002Fsummary>Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.00341)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-477-blue.svg?paper=67dea28495cab71703993d0d52ca4733b9a66077)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F67dea28495cab71703993d0d52ca4733b9a66077)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fopenai.com\u002Fresearch\u002Fjukebox)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopenai\u002Fjukebox?tab=readme-ov-file.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fjukebox?tab=readme-ov-file)\n\n+ **Audiogpt: Understanding and generating speech, music, sound, and talking head** (25 Apr 2023)\u003Cdetails>\u003Csummary>Rongjie Huang, Mingze Li, Dongchao Yang, et al.\u003C\u002Fsummary>Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.12995)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-83-blue.svg?paper=8bc617c9139648d7a92991d70c671230bac7b2e2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8bc617c9139648d7a92991d70c671230bac7b2e2)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIGC-Audio\u002FAudioGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAIGC-Audio\u002FAudioGPT)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAIGC-Audio\u002FAudioGPT)\n\n+ **TANGO: Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model** (24 Apr 2023)\u003Cdetails>\u003Csummary>Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, et al.\u003C\u002Fsummary>Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Soujanya Poria\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.13731)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-51-blue.svg?paper=f51bc74814a3452009ea5ca262d9768d08149ee6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff51bc74814a3452009ea5ca262d9768d08149ee6)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Ftango-web.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeclare-lab\u002Ftango.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdeclare-lab\u002Ftango)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdeclare-lab\u002Ftango)\n\n+ **Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface** (30 Mar 2023)\u003Cdetails>\u003Csummary>Yongliang Shen, Kaitao Song, Xu Tan, et al.\u003C\u002Fsummary>Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.17580)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-413-blue.svg?paper=d1120d67b700e4dfe8b39eb1e48fbdea4e1a0c43)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd1120d67b700e4dfe8b39eb1e48fbdea4e1a0c43)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FJARVIS)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FHuggingGPT)\n\n+ **Neural codec language models are zero-shot text to speech synthesizers** (5 Jan 2023)\u003Cdetails>\u003Csummary>Chengyi Wang, Sanyuan Chen, Yu Wu, et al.\u003C\u002Fsummary>Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.02111)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-203-blue.svg?paper=c2f91f35df893714418cc29096083dce0b441229)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc2f91f35df893714418cc29096083dce0b441229)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fproject\u002Fvall-e-x\u002F)\n\n+ **MusicLM: Generating Music From Text** (26 Jan 2023)\u003Cdetails>\u003Csummary>Andrea Agostinelli, Timo I. Denk, Zalán Borsos, et al.\u003C\u002Fsummary>Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.11325)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-171-blue.svg?paper=428854d9e75f94f0e61f37c6887c77800437d516)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F428854d9e75f94f0e61f37c6887c77800437d516)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgoogle-research.github.io\u002Fseanet\u002Fmusiclm\u002Fexamples\u002F)\n### Datasets\n+ **Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context** (15 Sep 2023)\u003Cdetails>\u003Csummary>Wei Kang, Xiaoyu Yang, Zengwei Yao, et al.\u003C\u002Fsummary>Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong Guo, Long Lin, Daniel Povey\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.08105)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=e99b45179686982401d2d6ec919e42b327f04c0b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe99b45179686982401d2d6ec919e42b327f04c0b)\n\n+ **WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition** (7 Oct 2021)\u003Cdetails>\u003Csummary>BinBin Zhang, Hang Lv, Pengcheng Guo, et al.\u003C\u002Fsummary>BinBin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di wu, Zhendong Peng\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.03370)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-103-blue.svg?paper=9de3ac21af795dac56f6031e73db8198716bb352)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9de3ac21af795dac56f6031e73db8198716bb352)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwenet.org.cn\u002FWenetSpeech\u002F)\n\n+ **Vggsound: A large-scale audio-visual dataset** (29 Apr 2020)\u003Cdetails>\u003Csummary>Honglie Chen, Weidi Xie, Andrea Vedaldi, et al. (ICASSP)\u003C\u002Fsummary>Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.14368)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-316-blue.svg?paper=66831f683141c11ed7e20b0f2e8b40700740c164)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F66831f683141c11ed7e20b0f2e8b40700740c164)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fdata\u002Fvggsound\u002F)\n\n+ **Libri-Light: A Benchmark for ASR with Limited or No Supervision** (17 Dec 2019 )\u003Cdetails>\u003Csummary>Jacob Kahn, Morgane Rivière, Weiyi Zheng, et al. (ICASSP)\u003C\u002Fsummary>Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdel-rahman Mohamed, Emmanuel Dupoux\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9052942?casa_token=lrlqiDak4dkAAAAA:VfCRwwWhLiJyb61NkesOfzpobk4zjac1boi4PoJ7llh1SKSi5YJDt4DaozUQw_o8X4LvO1bK)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-449-blue.svg?paper=f59c038dee828e0a8c2fc28130d12e39ee4952d6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff59c038dee828e0a8c2fc28130d12e39ee4952d6)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flibri-light)\n\n+ **The mtg-jamendo dataset for automatic music tagging** (15 Jun 2019)\u003Cdetails>\u003Csummary>Dmitry Bogdanov, Minz Won, Philip Tovstogan, et al. (ICML)\u003C\u002Fsummary>Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, Xavier Serra\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Frepositori.upf.edu\u002Fbitstream\u002Fhandle\u002F10230\u002F42015\u002Fbogdanov_ICML2019__Jamendo.pdf?sequence=1&isAllowed=y)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-102-blue.svg?paper=23037085b0815455e6d47333089b925c8c0e21d5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F23037085b0815455e6d47333089b925c8c0e21d5)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmtg.github.io\u002Fmtg-jamendo-dataset\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMTG\u002Fmtg-jamendo-dataset.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMTG\u002Fmtg-jamendo-dataset)\n\n+ **LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech** (5 Apr 2019)\u003Cdetails>\u003Csummary>Heiga Zen, Viet Dang, Rob Clark, et al.\u003C\u002Fsummary>Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.02882)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-555-blue.svg?paper=2789b6c84ba1422746246685001accba5563e7c1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2789b6c84ba1422746246685001accba5563e7c1)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.openslr.org\u002F60\u002F)\n\n+ **Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset** (29 Oct 2018)\u003Cdetails>\u003Csummary>Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, et al.\u003C\u002Fsummary>Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.12247)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-331-blue.svg?paper=2603a68b4503ba949c91c7e00cd342624b4aae2f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2603a68b4503ba949c91c7e00cd342624b4aae2f)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmagenta.tensorflow.org\u002Fdatasets\u002Fmaestro)\n\n+ **Audio Set: An ontology and human-labeled dataset for audio events** (05 Mar 2017)\u003Cdetails>\u003Csummary>Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, et al. (TASLP)\u003C\u002Fsummary>Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, Marvin Ritter\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fstatic.googleusercontent.com\u002Fmedia\u002Fresearch.google.com\u002Fzh-CN\u002F\u002Fpubs\u002Farchive\u002F45857.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2338-blue.svg?paper=5ba2218b708ca64ab556e39d5997202e012717d5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5ba2218b708ca64ab556e39d5997202e012717d5)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fresearch.google.com\u002Faudioset\u002Findex.html)\n\n+ **Librispeech: An ASR corpus based on public domain audio books** (19 Apr2015)\u003Cdetails>\u003Csummary>Vassil Panayotov, Guoguo Chen, Daniel Povey, et al. (ICASSP)\u003C\u002Fsummary>Vassil Panayotov, Guoguo Chen, Daniel Povey, Sanjeev Khudanpur\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7178964)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4815-blue.svg?paper=34038d9424ce602d7ac917a4e582d977725d4393)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F34038d9424ce602d7ac917a4e582d977725d4393)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.openslr.org\u002F12)\n\n+ **Evaluation of Algorithms Using Games: The Case of Music Tagging** (26 Oct 2009)\u003Cdetails>\u003Csummary>Edith Law, Kris West, Michael Mandel, et al. (ISMIR)\u003C\u002Fsummary>Edith Law, Kris West, Michael Mandel, Mert Bay J. Stephen Downie\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fismir2009.ismir.net\u002Fproceedings\u002FOS5-5.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-214-blue.svg?paper=8a1384e041cc6ea2735b01c734aeef666dc92884)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8a1384e041cc6ea2735b01c734aeef666dc92884)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmirg.city.ac.uk\u002Fcodeapps\u002Fthe-magnatagatune-dataset)\n\n## Generation with Multiple Modalities\n### 🔅 LLM-based\n\n\n\n+ **C3LLM: Conditional Multimodal Content Generation Using Large Language Models** (25 May 2024) \u003Cdetails>\u003Csummary>Zixuan Wang, Qinkai Duan, Yu-Wing Tai, et al.\u003C\u002Fsummary>Zixuan Wang, Qinkai Duan, Yu-Wing Tai, Chi-Keung Tang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16136)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=78582ad19779a69d97b797a3c6eb2397f99398b6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FC3LLM%3A-Conditional-Multimodal-Content-Generation-Wang-Duan\u002Ff9151b94ff5476cf155c085cf4c3280715cf9bde)\n\n\n\n+ **CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation** (30 Nov 2023) \u003Cdetails>\u003Csummary>Zineng Tang, Ziyi Yang, Mahmoud Khademi, et al.\u003C\u002Fsummary>Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.18775)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=78582ad19779a69d97b797a3c6eb2397f99398b6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F78582ad19779a69d97b797a3c6eb2397f99398b6)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcodi-2.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fi-Code.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fi-Code\u002Ftree\u002Fmain\u002FCoDi-2)\n\n\n\n+ **TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models** (8 Nov 2023)\u003Cdetails>\u003Csummary>Zhen Yang, Yingxue Zhang, Fandong Meng, et al.\u003C\u002Fsummary>Zhen Yang, Yingxue Zhang, Fandong Meng, Jie Zhou\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.04589)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=9f411fda2ad5b141a3115f707bcf5ee865b3fb94)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FTEAL%3A-Tokenize-and-Embed-ALL-for-Multi-modal-Large-Yang-Zhang\u002F59d716b442ab760a78f58de6748c0fa1d507bfc1)\n`tokenizer`\n\n\n\n+ **NExT-GPT: Any-to-Any Multimodal LLM** (11 Sep 2023)\u003Cdetails>\u003Csummary>Shengqiong Wu, Hao Fei, Leigang Qu, et al.\u003C\u002Fsummary>Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05519)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-94-blue.svg?paper=fa75a55760e6ea49b39b83cb85c99a22e1088254)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffa75a55760e6ea49b39b83cb85c99a22e1088254)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fnext-gpt.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNExT-GPT\u002FNExT-GPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNExT-GPT\u002FNExT-GPT)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002F1ca8b1601858a12830.gradio.live\u002F)\n\n\n+ **CoDi: Any-to-Any Generation via Composable Diffusion** (19 May 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023] Zineng Tang, Ziyi Yang, Chenguang Zhu, et al.\u003C\u002Fsummary>Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11846)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-41-blue.svg?paper=9f411fda2ad5b141a3115f707bcf5ee865b3fb94)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9f411fda2ad5b141a3115f707bcf5ee865b3fb94)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fi-Code.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fi-Code\u002Ftree\u002Fmain\u002Fi-Code-V3)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcodi-gen.github.io\u002F)\n\n\n\n### Non-LLM-based\n\n\n+ **DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation** (9 Jan 2024)\u003Cdetails>\u003Csummary>[CVPR 2024] Junming Chen, et al.\u003C\u002Fsummary>Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, Qifeng Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.04747)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=1370d74ff6f7857a84da952e2f4cb6f42da40615)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1370d74ff6f7857a84da952e2f4cb6f42da40615)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjeremycjm.github.io\u002Fproj\u002FDiffSHEG\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJeremyCJM\u002FDiffSHEG.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FJeremyCJM\u002FDiffSHEG) \n\n\n+ **Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners** (27 Feb 2024)\u003Cdetails>\u003Csummary>[CVPR 2024] Yazhou Xing, Yingqing He, Zeyue Tian, et al.\u003C\u002Fsummary>Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17723)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=21a77ed349c8621d0a0ef8407eb744e3de3b13c5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FSeeing-and-Hearing%3A-Open-domain-Visual-Audio-with-Xing-He\u002Fd9822d11ae4ead1f1d32c43124a6a0eb80ea4f0c)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyzxing87\u002FSeeing-and-Hearing.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyzxing87\u002FSeeing-and-Hearing)\n\n\n\n# 📍 Multimodal Editing\n\n## Image Editing\n\n### 🔅 LLM-based\n\n\n+ **UltraEdit: Instruction-based Fine-Grained Image Editing at Scale** (7 Jul 2024)\u003Cdetails>\u003Csummary>Haozhe Zhao, Xiaojian Ma, Liang Chen, et al.\u003C\u002Fsummary> Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, Baobao Chang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06739)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-16-blue.svg?paper=388b0f44faf0a14cc402c2554ec36a868cf59129)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FSmartEdit%3A-Exploring-Complex-Instruction-Based-with-Huang-Xie\u002F388b0f44faf0a14cc402c2554ec36a868cf59129)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fultra-editing.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHaozheZhao\u002FUltraEdit.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FHaozheZhao\u002FUltraEdit)\n\n\n+ **TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing** (27 May 2024)\u003Cdetails>\u003Csummary>Xinyu Zhang, Mengxue Kang, Fei Wei, et al.\u003C\u002Fsummary>Xinyu Zhang, Mengxue Kang, Fei Wei, Shuang Xu, Yuhe Liu, Lin Ma\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16803)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=8342dfd84eb91cab27404497ef0570b8d9ec55d5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FTIE%3A-Revolutionizing-Text-based-Image-Editing-for-Zhang-Kang\u002F8342dfd84eb91cab27404497ef0570b8d9ec55d5)\n\n\n+ **SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models** (11 Dec 2023)\u003Cdetails>\u003Csummary>[CVPR 2024] Yuzhou Huang, Liangbin Xie, Xintao Wang, et al.\u003C\u002Fsummary> Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06739)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=388b0f44faf0a14cc402c2554ec36a868cf59129)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F388b0f44faf0a14cc402c2554ec36a868cf59129)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fyuzhou914.github.io\u002FSmartEdit\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencentARC\u002FSmartEdit.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FSmartEdit)\n\n\n+ **Self-correcting LLM-controlled Diffusion Models** (27 Nov 2023)\u003Cdetails>\u003Csummary>[CVPR 2024] Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, et al.\u003C\u002Fsummary> Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16090)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=42c4315b5d2e33d7d9a0afdf84e6a47ccd7a700e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F42c4315b5d2e33d7d9a0afdf84e6a47ccd7a700e)\n\n\n+ **Emu Edit: Precise Image Editing via Recognition and Generation Tasks** (16 Nov 2023)\u003Cdetails>\u003Csummary>[ArXiv 2023] Shelly Sheynin, Adam Polyak, Uriel Singer, et al.\u003C\u002Fsummary> Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, Yaniv Taigman\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10089)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=5bcb0153dd0840113eb27d4d6f753414ef656a03)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5bcb0153dd0840113eb27d4d6f753414ef656a03)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Femu-edit.metademolab.com\u002F)\n\n\n+ **Guiding Instruction-based Image Editing via Multimodal Large Language Models** \u003Cdetails>\u003Csummary>[ICLR 2024 (Spotlight)] Tsu-Jui Fu, Wenze Hu, Xianzhi Du, et al.\u003C\u002Fsummary> Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17102v1)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=092245d86b77181c36f972b1b7a17a59cd989c4a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F092245d86b77181c36f972b1b7a17a59cd989c4a)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmllm-ie.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftsujuifu\u002Fpytorch_mgie.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftsujuifu\u002Fpytorch_mgie)\n\n\n\n\n+ **CHATEDIT: Towards Multi-turn Interactive Facial Image Editing via Dialogue** (20 Mar 2023)\u003Cdetails>\u003Csummary>[EMNLP 2023] Xing Cui, Zekun Li, Peipei Li, et al.\u003C\u002Fsummary> Xing Cui, Zekun Li, Peipei Li, Yibo Hu, Hailin Shi, Zhaofeng He\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.11108)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=5a185965ad1e87367d044b47043706d00b85b007)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5a185965ad1e87367d044b47043706d00b85b007)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcuixing100876\u002FChatEdit.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fcuixing100876\u002FChatEdit)\n\n\n\n\n+ **HIVE: Harnessing Human Feedback for Instructional Visual Editing** (16 Mar 2023)\u003Cdetails>\u003Csummary>Shu Zhang, Xinyi Yang, Yihao Feng, et al.\u003C\u002Fsummary> Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, Caiming Xiong, Ran Xu.\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.09618)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-28-blue.svg?paper=372bc41602bbd21f192305775f0a58de9880e454)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F372bc41602bbd21f192305775f0a58de9880e454)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fshugerdou.github.io\u002Fhive\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsalesforce\u002FHIVE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FHIVE)\n\n\n+ **Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models** (8 Mar 2023) \u003Cdetails>\u003Csummary>Chenfei Wu, Shengming Yin, Weizhen Qi, et al.\u003C\u002Fsummary> Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, Nan Duan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.04671)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-337-blue.svg?paper=af997821231898a5f8d0fd78dad4eec526acabe5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Faf997821231898a5f8d0fd78dad4eec526acabe5)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmoymix\u002FTaskMatrix)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002Fvisual_chatgpt)\n\n+ **InstructPix2Pix: Learning to Follow Image Editing Instructions** (17 Nov 2022)\\\n[CVPR 2023 (Highlight)] Brooks, Tim, Aleksander Holynski, and Alexei A. Efros.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.09800)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-582-blue.svg?paper=a2d2bbe4c542173662a444b33b76c66992697830)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fa2d2bbe4c542173662a444b33b76c66992697830)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.timothybrooks.com\u002Finstruct-pix2pix)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftimothybrooks\u002Finstruct-pix2pix.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftimothybrooks\u002Finstruct-pix2pix)\n\n\n\n### Non-LLM-based (Clip\u002FT5)\n\n+ **SeedEdit: Align Image Re-Generation to Image Editing** (11 Nov 2024)\\\nYichun Shi, Peng Wang, Weilin Huang \\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.06686)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fteam.doubao.com\u002Fen\u002Fspecial\u002Fseededit)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FByteDance\u002FSeedEdit-APP)\n\n\n+ **DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing** (4 Feb 2024)\u003Cdetails>\u003Csummary>[CVPR 2024] Chong Mou, Xintao Wang, Jiechong Song, et al.\u003C\u002Fsummary>Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.02583)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=198b3d809594a76bc473927af37b858132ac7fdd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F198b3d809594a76bc473927af37b858132ac7fdd)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMC-E\u002FDragonDiffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMC-E\u002FDragonDiffusion)\n\n\n\n+ **ZONE: Zero-Shot Instruction-Guided Local Editing** (28 Dec 2023)\u003Cdetails>\u003Csummary>Shanglin Li, Bohan Zeng, Yutang Feng, et al.\u003C\u002Fsummary>Shanglin Li, Bohan Zeng, Yutang Feng, Sicheng Gao, Xuhui Liu, Jiaming Liu, Li Lin, Xu Tang, Yao Hu, Jianzhuang Liu, Baochang Zhang. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.16794)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=05eb2ad3af471c05a24abbf70258688e579cdf22)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F05eb2ad3af471c05a24abbf70258688e579cdf22)\n\n\n\n\n+ **Watch Your Steps: Local Image and Scene Editing by Text Instructions** (17 Aug 2023 )\u003Cdetails>\u003Csummary>Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, et al.\u003C\u002Fsummary>Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.08947)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=737ad8905228cd410e3342b5cceefd4feb57d166)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F737ad8905228cd410e3342b5cceefd4feb57d166)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fashmrz.github.io\u002FWatchYourSteps\u002F)\n\n\n+ **Dragondiffusion: Enabling drag-style manipulation on diffusion models** (5 Jul 2023)\u003Cdetails>\u003Csummary>[ICLR 2024] Chong Mou, Xintao Wang, Jiechong Song, et al.\u003C\u002Fsummary>Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.14331)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-36-blue.svg?paper=2cfaa5b3571d3b75f040f6d639359a3c673f5561)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2cfaa5b3571d3b75f040f6d639359a3c673f5561)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmc-e.github.io\u002Fproject\u002FDragonDiffusion\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMC-E\u002FDragonDiffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMC-E\u002FDragonDiffusion)\n\n\n+ **Differential Diffusion: Giving Each Pixel Its Strength** (1 Jun 2023)\u003Cdetails>\u003Csummary>[Arxiv 2023] Thao Nguyen, Yuheng Li, Utkarsh Ojha, et al.\u003C\u002Fsummary>Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.14331)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=6e5760e5d4b468bbf01a95a6f64bd65c3aa3d798)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6e5760e5d4b468bbf01a95a6f64bd65c3aa3d798)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdifferential-diffusion.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fexx8\u002Fdifferential-diffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fexx8\u002Fdifferential-diffusion)\n\n\n+ **Visual Instruction Inversion: Image Editing via Visual Prompting** (26 Jul 2023)\u003Cdetails>\u003Csummary>[ArXiv 2023] Thao Nguyen, Yuheng Li, Utkarsh Ojha, et al.\u003C\u002Fsummary> Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.14331)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=f4c62aa336de45273e0fdfcfbd65b3c2e552ad56)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff4c62aa336de45273e0fdfcfbd65b3c2e552ad56)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fthaoshibe.github.io\u002Fvisii\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthaoshibe\u002Fvisii.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthaoshibe\u002Fvisii)\n\n+ **MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing** (17 Apr 2023)\u003Cdetails>\u003Csummary>[ICCV 2023] Mingdeng Cao, Xintao Wang, Zhongang Qi, et al.\u003C\u002Fsummary> Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, Yinqiang Zheng. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.08947)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-102-blue.svg?paper=85963807c11abe38e9a2797d9860e012238607ef)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F85963807c11abe38e9a2797d9860e012238607ef)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fljzycmd.github.io\u002Fprojects\u002FMasaCtrl\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencentARC\u002FMasaCtrl.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FMasaCtrl)\n\n\n+ **PAIR-Diffusion: A Comprehensive Multimodal Object-Level Image Editor** (30 Mar 2023)\u003Cdetails>\u003Csummary>[ArXiv 2023] Vidit Goel, Elia Peruzzo, Yifan Jiang, et al.\u003C\u002Fsummary> Vidit Goel, Elia Peruzzo, Yifan Jiang, Dejia Xu, Xingqian Xu, Nicu Sebe, Trevor Darrell, Zhangyang Wang, Humphrey Shi. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.17546)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=c614a4da924466f62ca39002af425c9d14d240a3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc614a4da924466f62ca39002af425c9d14d240a3)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvidit98.github.io\u002Fpublication\u002Fconference-paper\u002Fpair_diff.html)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpix2pixzero\u002Fpix2pix-zero.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPicsart-AI-Research\u002FPAIR-Diffusion)\n\n\n\n+ **Zero-shot Image-to-Image Translation** (6 Feb 2023)\u003Cdetails>\u003Csummary>[SIGGRAPH 2023] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, et al.\u003C\u002Fsummary> Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, Jun-Yan Zhu. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.03027)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-159-blue.svg?paper=daf61010eee0fbf6f9bab7db71c395ffca6f3ff3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdaf61010eee0fbf6f9bab7db71c395ffca6f3ff3)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fpix2pixzero.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpix2pixzero\u002Fpix2pix-zero.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpix2pixzero\u002Fpix2pix-zero)\n\n+ **SINE: SINgle Image Editing with Text-to-Image Diffusion Models** (8 Dec 2022)\u003Cdetails>\u003Csummary>[CVPR 2023] Zhixing Zhang, Ligong Han, Arnab Ghosh, et al.\u003C\u002Fsummary> Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris Metaxas, Jian Ren. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.04489)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-69-blue.svg?paper=a6ad30123bef4b19ee40c3d63cfabf00d211f0ef)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fa6ad30123bef4b19ee40c3d63cfabf00d211f0ef)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fzhang-zx.github.io\u002FSINE\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhang-zx\u002FSINE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzhang-zx\u002FSINE)\n\n+ **Interactive Image Manipulation with Complex Text Instructions** (25 Nov 2022)\u003Cdetails>\u003Csummary>[WACV 2023] Ryugo Morita, Zhiqiang Zhang, Man M. Ho, et al.\u003C\u002Fsummary> Ryugo Morita, Zhiqiang Zhang, Man M. Ho, Jinjia Zhou. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.15352)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=387144d293567408c363313aac971294e7ec8547)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F387144d293567408c363313aac971294e7ec8547)\n\n\n+ **Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation** (22 Nov 2022)\u003Cdetails>\u003Csummary>[CVPR 2023] Narek Tumanyan, Michal Geyer, Shai Bagon, et al.\u003C\u002Fsummary> Narek Tumanyan, Michal Geyer, Shai Bagon, Tali Dekel. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.12572)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-224-blue.svg?paper=b000d6865db824af1563708fb7a545ddd65c6b3a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb000d6865db824af1563708fb7a545ddd65c6b3a)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fpnp-diffusion.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMichalGeyer\u002Fplug-and-play.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMichalGeyer\u002Fplug-and-play)\n\n+ **Imagic: Text-Based Real Image Editing with Diffusion Models** (17 Oct 2022)\u003Cdetails>\u003Csummary>[CVPR 2023] Bahjat Kawar, Shiran Zada, Oran Lang, et al.\u003C\u002Fsummary> Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, Michal Irani. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.09276)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-496-blue.svg?paper=23e261a20a315059b4de5492ed071c97a20c12e7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F23e261a20a315059b4de5492ed071c97a20c12e7)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fimagic-editing.github.io\u002F)\n\u003C!-- [![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpix2pixzero\u002Fpix2pix-zero.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpix2pixzero\u002Fpix2pix-zero) -->\n\n\n\n+ **Null-text Inversion for Editing Real Images using Guided Diffusion Models**\u003Cdetails>\u003Csummary>[ICLR 2023] Ron Mokady, Amir Hertz, Kfir Aberman, et al.\u003C\u002Fsummary> Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, Daniel Cohen-Or. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.09794)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=4de94949daf9bc8dd0e5161d20dfe83198d20ec1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4de94949daf9bc8dd0e5161d20dfe83198d20ec1)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fnull-text-inversion.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Fprompt-to-prompt.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fprompt-to-prompt\u002F#null-text-inversion-for-editing-real-images)\n\n\n+ **Prompt-to-Prompt Image Editing with Cross Attention Control** \u003Cdetails>\u003Csummary>[ICLR 2023] Amir Hertz, Ron Mokady, Jay Tenenbaum, et al.\u003C\u002Fsummary> Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, Daniel Cohen-Or. \u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.01626)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-717-blue.svg?paper=04e541391e8dce14d099d00fb2c21dbbd8afe87f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F04e541391e8dce14d099d00fb2c21dbbd8afe87f)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fprompt-to-prompt.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Fprompt-to-prompt.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fprompt-to-prompt)\n\n\n+ **DiffEdit: Diffusion-based semantic image editing with mask guidance** (20 Oct 2022) \u003Cdetails>\u003Csummary>[ICLR 2023] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, et al.\u003C\u002Fsummary> Guillaume Couairon, Jakob Verbeek, Holger Schwenk, Matthieu Cord. \u003C\u002Fdetails> \n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.11427)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-208-blue.svg?paper=064ccebc03d3afabaae30fe29a457c1cfcdff7e3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F064ccebc03d3afabaae30fe29a457c1cfcdff7e3)\n\u003C!-- [![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fashmrz.github.io\u002FWatchYourSteps\u002F) -->\n\u003C!-- [![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FXiang-cd\u002FDiffEdit-stable-diffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FXiang-cd\u002FDiffEdit-stable-diffusion) -->\n\n\n\n+ **DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation** (6 Oct 2021)\\\n[CVPR 2022] Gwanghyun Kim, Taesung Kwon, Jong Chul Ye.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.02711)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-318-blue.svg?paper=8f8dedb511c0324d1cb7f9750560109ca9290b5f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8f8dedb511c0324d1cb7f9750560109ca9290b5f)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgwang-kim\u002FDiffusionCLIP.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgwang-kim\u002FDiffusionCLIP)\n\n\n+ **SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations** (2 Aug 2021)\u003Cdetails>\u003Csummary>[ICLR 2022] Chenlin Meng, Yutong He, Yang Song, et al.\u003C\u002Fsummary> Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon. \u003C\u002Fdetails> \n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.01073)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-604-blue.svg?paper=f671a09e3e5922e6d38cb77dda8d76d5ceac2a27)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff671a09e3e5922e6d38cb77dda8d76d5ceac2a27)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsde-image-editing.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fermongroup\u002FSDEdit.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fermongroup\u002FSDEdit)\n\n## Video Editing\n\n### 🔅 LLM-based\n\n\n+ **CONSISTENT VIDEO-TO-VIDEO TRANSFER USING SYNTHETIC DATASET** (1 Nov 2023)\\\nJiaxin Cheng, Tianjun Xiao, Tong He.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00213)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=e8bbffb8413cb1f88e99a7ecbabd21a6eac82271)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe8bbffb8413cb1f88e99a7ecbabd21a6eac82271)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Famazon-science\u002Finstruct-video-to-video.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Famazon-science\u002Finstruct-video-to-video)\n\n+ **InstructVid2Vid: Controllable Video Editing with Natural Language Instructions** (21 May 2023)\u003Cdetails>\u003Csummary>Bosheng Qin, Juncheng Li, Siliang Tang, et al.\u003C\u002Fsummary>Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, Yueting Zhuang. \u003C\u002Fdetails> [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.12328)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=205d2ed0906440f07a0275d7d6a63bced60951fc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F205d2ed0906440f07a0275d7d6a63bced60951fc)\n\u003C!-- [![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fduyguceylan\u002Fpix2video.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fduyguceylan\u002Fpix2video) -->\n\n### Non-LLM-based (Clip\u002FT5)\n\n\n\n+ **AudioScenic: Audio-Driven Video Scene Editing** (25 Apr 2024)\u003Cdetails>\u003Csummary>Kaixin Shen, Ruijie Quan, Linchao Zhu, et al.\u003C\u002Fsummary>Kaixin Shen, Ruijie Quan, Linchao Zhu, Jun Xiao, Yi Yang\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.16581)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=6fa898ed5e58ade17a020e3251687b811ff1d023)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FAudioScenic%3A-Audio-Driven-Video-Scene-Editing-Shen-Quan\u002F6fa898ed5e58ade17a020e3251687b811ff1d023)\n\n\n+ **LATENTWARP: CONSISTENT DIFFUSION LATENTS FOR ZERO-SHOT VIDEO-TO-VIDEO TRANSLATION** (1 Nov 2023)\u003Cdetails>\u003Csummary>Yuxiang Bao, Di Qiu, Guoliang Kang, et al.\u003C\u002Fsummary>Yuxiang Bao, Di Qiu, Guoliang Kang, Baochang Zhang, Bo Jin, Kaiye Wang, Pengfei Yan. \u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00353)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=1b4323a5324ee20fe9b2ff2a65ec26550a51ec2c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1b4323a5324ee20fe9b2ff2a65ec26550a51ec2c)\n\u003C!-- [![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdiffusion-tokenflow.github.io\u002F) -->\n\u003C!-- [![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fomerbt\u002FTokenFlow.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fomerbt\u002FTokenFlow) -->\n\n+ **MagicStick: Controllable Video Editing via Control Handle Transformations** (1 Nov 2023)\u003Cdetails>\u003Csummary>Yue Ma, Xiaodong Cun, Yingqing He, et al.\u003C\u002Fsummary>Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, Qifeng Chen\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03047)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=xxx)\n)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fxxx)\n)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmagic-stick-edit.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmayuelala\u002FMagicStick.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmayuelala\u002FMagicStick)\n\n\n+ **MagicEdit: High-Fidelity Temporally Coherent Video Editing** (28 Aug 2023) \u003Cdetails>\u003Csummary>Jun Hao Liew, Hanshu Yan, Jianfeng Zhang, et al.\u003C\u002Fsummary>Jun Hao Liew, Hanshu Yan, Jianfeng Zhang, Zhongcong Xu, Jiashi Feng. \u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.14749)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-20-blue.svg?paper=8819777e104f8c4197c262e11a01b070b50007aa)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8819777e104f8c4197c262e11a01b070b50007aa)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmagic-edit.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmagic-research\u002Fmagic-edit.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmagic-research\u002Fmagic-edit)\n\n\n+ **StableVideo: Text-driven Consistency-aware Diffusion Video Editing** (18 Aug 2023)\u003Cdetails>\u003Csummary>[ICCV 2023] Wenhao Chai, Xun Guo, Gaoang Wang, et al.\u003C\u002Fsummary>Wenhao Chai, Xun Guo, Gaoang Wang, Yan Lu. \u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.09592)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-41-blue.svg?paper=05cbac9a5101f47a6fabad72398616506572c9fa)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F05cbac9a5101f47a6fabad72398616506572c9fa)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frese1f\u002FStableVideo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Frese1f\u002FStableVideo)\n\n\n+ **CoDeF: Content Deformation Fields for Temporally Consistent Video Processing** (15 Aug 2023) \u003Cdetails>\u003Csummary>Hao Ouyang, Qiuyu Wang, Yuxi Xiao, et al.\u003C\u002Fsummary>Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen. \u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.07926)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-28-blue.svg?paper=c2d65fc3a7fde3f7662c6ef9448e5737d7e5551f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc2d65fc3a7fde3f7662c6ef9448e5737d7e5551f)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fqiuyu96.github.io\u002FCoDeF\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fqiuyu96\u002FCoDeF.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fqiuyu96\u002FCoDeF)\n\n\n+ **TokenFlow: Consistent Diffusion Features for Consistent Video Editing** (19 Jul 2023) \u003Cdetails>\u003Csummary>Michal Geyer, Omer Bar-Tal, Shai Bagon, et al.\u003C\u002Fsummary>Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel. \u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.10373)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-80-blue.svg?paper=4761f173965195798cd3046ef4af608a83504e4d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4761f173965195798cd3046ef4af608a83504e4d)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdiffusion-tokenflow.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fomerbt\u002FTokenFlow.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fomerbt\u002FTokenFlow)\n\n+ **Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation** (13 Jun 2023)\u003Cdetails>\u003Csummary>Shuai Yang, Yifan Zhou, Ziwei Liu, et al.\u003C\u002Fsummary>Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy. \u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07954)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-74-blue.svg?paper=1e09b83fe064826a9a1ac61a7bdc00f26be41aee)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1e09b83fe064826a9a1ac61a7bdc00f26be41aee)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.mmlab-ntu.com\u002Fproject\u002Frerender\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwilliamyang1991\u002FRerender_A_Video.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fwilliamyang1991\u002FRerender_A_Video)\n\n+ **ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing** (26 May 2023)\u003Cdetails>\u003Csummary>Min Zhao, Rongzhen Wang, Fan Bao, et al.\u003C\u002Fsummary>Min Zhao, Rongzhen Wang, Fan Bao, Chongxuan Li, Jun Zhu. \u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17098)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=14acc36d8c87f31f8dcbbf8433b91af70a2a516a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F14acc36d8c87f31f8dcbbf8433b91af70a2a516a)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fml.cs.tsinghua.edu.cn\u002Fcontrolvideo\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002Fcontrolvideo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002Fcontrolvideo)\n\n\n\n+ **Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts** (15 May 2023)\nMichal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.08850)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=5f51eda9f7abddca027941d50fb0b6bf6f508eff)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5f51eda9f7abddca027941d50fb0b6bf6f508eff)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmake-a-protagonist.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHeliosZhao\u002FMake-A-Protagonist.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FHeliosZhao\u002FMake-A-Protagonist)\n\n\n+ **Pix2Video: Video Editing using Image Diffusion** (22 Mar 2023)\\\n[ICCV 2023] Ceylan, Duygu, Chun-Hao P. Huang, and Niloy J. Mitra.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12688)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-92-blue.svg?paper=32a3c2fbd3e733bd0eea938517fec2ff8dc7c701)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F32a3c2fbd3e733bd0eea938517fec2ff8dc7c701)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fduyguceylan.github.io\u002Fpix2video.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fduyguceylan\u002Fpix2video.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fduyguceylan\u002Fpix2video)\n\n\n+ **FateZero: Fusing Attentions for Zero-shot Text-based Video Editing** (16 Mar 2023)\u003Cdetails>\u003Csummary>[ICCV 2023] Chenyang Qi, Xiaodong Cun, Yong Zhang, et al.\u003C\u002Fsummary>Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, Qifeng Chen. \u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.09535)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-133-blue.svg?paper=14ccb8bcceb6de10eda6ad08bec242a4f2946497)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F14ccb8bcceb6de10eda6ad08bec242a4f2946497)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Ffate-zero-edit.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenyangQiQi\u002FFateZero.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FChenyangQiQi\u002FFateZero)\n\n+ **Video-P2P: Video Editing with Cross-attention Control** (8 Mar 2023)\u003Cdetails>\u003Csummary>Shaoteng Liu, Yuechen Zhang, Wenbo Li, et al.\u003C\u002Fsummary>Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, Jiaya Jia. \u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.04761)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-81-blue.svg?paper=6283502d6900a0b403e2454b1cb1cf16ddefd5a7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6283502d6900a0b403e2454b1cb1cf16ddefd5a7)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvideo-p2p.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FShaoTengLiu\u002FVideo-P2P.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FShaoTengLiu\u002FVideo-P2P)\n\n+ **Dreamix: Video Diffusion Models are General Video Editors** (2 Feb 2023)\u003Cdetails>\u003Csummary>Eyal Molad, Eliahu Horwitz, Dani Valevski, et al.\u003C\u002Fsummary>Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, Yedid Hoshen. \u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.01329)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-107-blue.svg?paper=9758ddd6ffbaac75aa0447a9664e6989811a05e2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9758ddd6ffbaac75aa0447a9664e6989811a05e2)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdreamix-video-editing.github.io\u002F)\n\n\n+ **Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation** (22 Dec 2022)\u003Cdetails>\u003Csummary>[ICCV 2023] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, et al.\u003C\u002Fsummary>Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou. \u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.11565)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-275-blue.svg?paper=1367dcff4ccb927a5e95c452041288b3f0dd0eff)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1367dcff4ccb927a5e95c452041288b3f0dd0eff)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Ffate-zero-edit.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FTune-A-Video.svg?style=social&label=Star)](https:\u002F\u002Ftuneavideo.github.io\u002F)\n\n\n+ **M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers** (2 Apr 2021)\u003Cdetails>\u003Csummary>[CVPR 2022] Tsu-Jui Fu, Xin Eric Wang, Scott T. Grafton, et al.\u003C\u002Fsummary>Tsu-Jui Fu, Xin Eric Wang, Scott T. Grafton, Miguel P. Eckstein, William Yang Wang. \u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.01122)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=81349524489f8ba0812ac2529eac92ec45959782)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F81349524489f8ba0812ac2529eac92ec45959782)\n\n## 3D Editing\n\n### 🔅 LLM-based\n+ **SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code** (2 Mar 2024)\u003Cdetails>\u003Csummary>Ziniu Hu, Ahmet Iscen, Aashi Jain, et al. \u003C\u002Fsummary>Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, Alireza Fathi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01248v1)\n\n+ **3D-GPT: Procedural 3D MODELING WITH LARGE LANGUAGE MODELS** (19 Oct 2023)\u003Cdetails>\u003Csummary>Chunyi Sun*, Junlin Han*, Weijian Deng, et al. \u003C\u002Fsummary>Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.12945)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=588930cdd801f335b5e524d13f99aa94136a20a0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F588930cdd801f335b5e524d13f99aa94136a20a0)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChuny1\u002F3DGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FChuny1\u002F3DGPT)\n\n### Non-LLM-based (Clip\u002FT5)\n+ **Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models** (16 Nov 2023)\u003Cdetails>\u003Csummary>Xianfang Zeng, Xin Chen, Zhongqi Qi, et al.\u003C\u002Fsummary>Xianfang Zeng, Xin Chen, Zhongqi Qi, Wen Liu, Zibo Zhao, Zhibin Wang, Bin Fu, Yong Liu, Gang Yu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13913)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=e90883da4ee8c947a8b97422c95bde905a257a74)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe90883da4ee8c947a8b97422c95bde905a257a74)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenTexture\u002FPaint3D?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenTexture\u002FPaint3D)\n\n+ **3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation** (16 Nov 2023)\u003Cdetails>\u003Csummary>Dale Decatur, Itai Lang, Kfir Aberman, et al.\u003C\u002Fsummary>Dale Decatur, Itai Lang, Kfir Aberman, Rana Hanocka\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.09571)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=496bdd2804a231a3336463fca8e0a4c6a46f0304)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F496bdd2804a231a3336463fca8e0a4c6a46f0304)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthreedle\u002F3d-paintbrush?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthreedle\u002F3d-paintbrush)\n\n+ **Blending-NeRF: Text-Driven Localized Editing in Neural Radiance Fields** (23 Aug 2023)\u003Cdetails>\u003Csummary>Hyeonseop Song, Seokhun Choi, Hoseok Do, et al. \u003C\u002Fsummary>Hyeonseop Song, Seokhun Choi, Hoseok Do, Chul Lee, Taehyeong Kim\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.11974)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=bf7f31e07d9b128a0f555c275bc3fdb851f725b8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbf7f31e07d9b128a0f555c275bc3fdb851f725b8)\n\n+ **SINE: Semantic-driven Image-based NeRF Editing with Prior-guided Editing Field** (23 Mar 2023)\u003Cdetails>\u003Csummary>[CVPR 2023] Chong Bao, Yinda Zhang, Bangbang Yang, et al.\u003C\u002Fsummary>Chong Bao, Yinda Zhang, Bangbang Yang, Tianxing Fan, Zesong Yang, Hujun Bao, Guofeng Zhang, Zhaopeng Cui\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.13277)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-44-blue.svg?paper=222c47b81fe04598fd84fe8b9a43f694415ec7e9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F222c47b81fe04598fd84fe8b9a43f694415ec7e9)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzju3dv\u002FSINE?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzju3dv\u002FSINE)\n\n+ **TextDeformer: Geometry Manipulation using Text Guidance** (26 Apr 2023)\u003Cdetails>\u003Csummary> [TVCG 2022] William Gao, Noam Aigerman, Thibault Groueix, et al.\u003C\u002Fsummary>William Gao, Noam Aigerman, Thibault Groueix, Vladimir G. Kim, Rana Hanocka\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.13348)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-21-blue.svg?paper=4974186c3b5b50112cfd909de115d5fbe25411fd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4974186c3b5b50112cfd909de115d5fbe25411fd)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthreedle\u002FTextDeformer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthreedle\u002FTextDeformer)\n\n+ **Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions** (22 Mar 2023)\u003Cdetails>\u003Csummary>[SIGGRAPH Asia 2023] Ayaan Haque, Matthew Tancik, Alexei A. Efros, et al. \u003C\u002Fsummary>Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, Angjoo Kanazawa\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12789)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-131-blue.svg?paper=26c22380282a00166273038bc5ba785d845d61ad)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F26c22380282a00166273038bc5ba785d845d61ad)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fayaanzhaque\u002Finstruct-nerf2nerf.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fayaanzhaque\u002Finstruct-nerf2nerf)\n\n+ **DreamEditor: Text-Driven 3D Scene Editing with Neural Fields** (23 Jun 2023)\u003Cdetails>\u003Csummary>[SIGGRAPH Asia 2023] Jingyu Zhuang, Chen Wang, Lingjie Liu, et al. \u003C\u002Fsummary>Jingyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, Guanbin Li\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.13455)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-39-blue.svg?paper=029f3e2c215edac138be26ade67b3d70b8f74dd7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F029f3e2c215edac138be26ade67b3d70b8f74dd7)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzjy526223908\u002FDreamEditor.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzjy526223908\u002FDreamEditor)\n\n+ **SKED: Sketch-guided Text-based 3D Editing** (19 Mar 2023)\u003Cdetails>\u003Csummary>[ICCV 2023] Aryan Mikaeili, Or Perel, Mehdi Safaee, et al.\u003C\u002Fsummary>Aryan Mikaeili, Or Perel, Mehdi Safaee, Daniel Cohen-Or, Ali Mahdavi-Amiri\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.10735)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-27-blue.svg?paper=6ebec1ece44daa090158ff2531d6fabb94a4e683)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6ebec1ece44daa090158ff2531d6fabb94a4e683)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Faryanmikaeili\u002FSKED.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Faryanmikaeili\u002FSKED)\n\n+ **Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields** (22 Jun 2023)\u003Cdetails>\u003Csummary>[ICCVW 2023] Ori Gordon, Omri Avrahami, Dani Lischinski.\u003C\u002Fsummary>Ori Gordon, Omri Avrahami, Dani Lischinski\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.12760)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=3a5d4352d3dd53148a9544233bb59f88d2504910)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3a5d4352d3dd53148a9544233bb59f88d2504910)\n\n\n+ **ClipFace: Text-guided Editing of Textured 3D Morphable Modelssting Neural Radiance Fields** (2 Dec 2022)\u003Cdetails>\u003Csummary>[SIGGRAPH 2023] Shivangi Aneja, Justus Thies, Angela Dai, et al. \u003C\u002Fsummary>Shivangi Aneja, Justus Thies, Angela Dai, Matthias Nießner\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.01406)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-31-blue.svg?paper=f21e8eddf42580d1f38a11ec5acd8891c0454a1f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff21e8eddf42580d1f38a11ec5acd8891c0454a1f)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FcassiePython\u002FCLIPNeRF.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FcassiePython\u002FCLIPNeRF)\n\n+ **CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fieldsadiance Fields** (9 Dec 2021)\u003Cdetails>\u003Csummary>[CVPR 2022] Can Wang, Menglei Chai, Mingming He, et al. \u003C\u002Fsummary>Can Wang, Menglei Chai, Mingming He, Dongdong Chen, Jing Liao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.05139)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-244-blue.svg?paper=0483be6c3ec6cd41ffe248f86effc7468d3ac7be)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0483be6c3ec6cd41ffe248f86effc7468d3ac7be)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshivangi-aneja\u002FClipFace.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshivangi-aneja\u002FClipFace)\n\n\n\n\n\n## Audio Editing\n\n### 🔅 LLM-based\n\n+ **Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing** (19 Oct 2023)\u003Cdetails>\u003Csummary>Yixiao Zhang, Akira Maezawa, Gus Xia, et al.\u003C\u002Fsummary>Yixiao Zhang, Akira Maezawa, Gus Xia, Kazuhiko Yamamoto, Simon Dixon\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.12404)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=cca4218dd7c10c1614bbd84aa7cd7e00027bdc7c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcca4218dd7c10c1614bbd84aa7cd7e00027bdc7c)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsites.google.com\u002Fview\u002Floop-copilot)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fldzhangyx\u002Floop-copilot\u002F.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fldzhangyx\u002Floop-copilot\u002F)\n\n+ **UniAudio: An Audio Foundation Model Toward Universal Audio Generation** (1 Oct 2023)\\\nDongchao Yang, Jinchuan Tian, Xu Tan\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00704)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-15-blue.svg?paper=74bfbbb7307a7af2686043ea97ab8b34cb062ba8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F74bfbbb7307a7af2686043ea97ab8b34cb062ba8)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdongchaoyang.top\u002FUniAudio_demo\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangdongchao\u002FUniAudio.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyangdongchao\u002FUniAudio)\n\n### Non-LLM-based (Clip\u002FT5)\n\n# 📍 Multimodal Agents\n\n+ **LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing** (1 Nov 2023) \u003Cdetails>\u003Csummary>Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, et al.\u003C\u002Fsummary> Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, Chunyuan Li\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00571)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=c020f15be1dee20f9e2e0c5a6f05f272b5508325)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc020f15be1dee20f9e2e0c5a6f05f272b5508325)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllava-vl.github.io\u002Fllava-interactive)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLaVA-VL\u002FLLaVA-Interactive-Demo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Interactive-Demo)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fllavainteractive.ngrok.app\u002F)\\\n**Tags:** `Image Chat` `Image Segmentation`, `Image Generation` `Image Editing`\n\n+ **ControlLLM: Augment Language Models with Tools by Searching on Graphs** (26 Oct 2023) \u003Cdetails>\u003Csummary>Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, et al.\u003C\u002Fsummary>Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Ziheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, Wenhai Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.17796)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=288e7224d53d68669eb67f2496e068dc965c639e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F288e7224d53d68669eb67f2496e068dc965c639e)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcontrolllm.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FControlLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FControlLLM)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fcllm.opengvlab.com\u002F)\\\n**Tags:** `Image Understanding` `Image Generation` `Image Editing` `Video Understanding` `Video Generation` `Video Editing` `Audio Understanding` `Audio Generation`\n\n+ **ImageBind-LLM: Multi-modality Instruction Tuning** (7 Sep 2023) \u003Cdetails>\u003Csummary>Jiaming Han, Renrui Zhang, Wenqi Shao, et al.\u003C\u002Fsummary>Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, Yu Qiao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.03905)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-33-blue.svg?paper=54c68b8623505dc6bf7a0b08aaa77ca9165f2d7f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F54c68b8623505dc6bf7a0b08aaa77ca9165f2d7f)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FLLaMA-Adapter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FLLaMA-Adapter)\\\n**Modalities:** `text` `image` `video` `audio` `point cloud`\n\n+ **ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models** (2 Sep 2023) \u003Cdetails>\u003Csummary>Chenliang Li, Hehong Chen, Ming Yan, et al.\u003C\u002Fsummary>Chenliang Li, Hehong Chen, Ming Yan, Weizhou Shen, Haiyang Xu, Zhikai Wu, Zhicheng Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, Hongzhu Shi, Ji Zhang, Fei Huang, Jingren Zhou\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.00986)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=e2f1f04f648a8863d11439aa4c80ee65d6caccda)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe2f1f04f648a8863d11439aa4c80ee65d6caccda)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmodelscope\u002Fmodelscope-agent.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fmodelscope-agent)\n\n+ **InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language** (9 May 2023) \u003Cdetails>\u003Csummary>Zhaoyang Liu, Yinan He, Wenhai Wang, et al.\u003C\u002Fsummary>Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, Limin Wang, Ping Luo, Jifeng Dai, Yu Qiao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.05662)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-40-blue.svg?paper=54a8b153ed04a872da878d695239bdc413dc782c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F54a8b153ed04a872da878d695239bdc413dc782c)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternGPT)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Figpt.opengvlab.com\u002F)\\\n**Condition Modality:** `text` `image` `video` `audio`\n\n+ **HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face** (30 Mar 2023) \u003Cdetails>\u003Csummary>Yongliang Shen, Kaitao Song, Xu Tan, et al.\u003C\u002Fsummary>Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.17580)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-413-blue.svg?paper=d1120d67b700e4dfe8b39eb1e48fbdea4e1a0c43)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd1120d67b700e4dfe8b39eb1e48fbdea4e1a0c43)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FJARVIS)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FHuggingGPT)\n\n+ **Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models** (8 Mar 2023) \u003Cdetails>\u003Csummary>Chenfei Wu, Shengming Yin, Weizhen Qi, et al.\u003C\u002Fsummary>Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, Nan Duan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.04671)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-337-blue.svg?paper=af997821231898a5f8d0fd78dad4eec526acabe5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Faf997821231898a5f8d0fd78dad4eec526acabe5)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmoymix\u002FTaskMatrix)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002Fvisual_chatgpt)\n\n+ **AutoGPT: build & use AI agents**\\\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fnews.agpt.co\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSignificant-Gravitas\u002FAutoGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSignificant-Gravitas\u002FAutoGPT)\n\n\n# 📍 Multimodal Understanding with LLMs\n## Multiple modalities\n+ **Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities**  (9 Nov 2023)\u003Cdetails>\u003Csummary>[CVPR 2024] AJ Piergiovanni, Isaac Noble, Dahun Kim, et al.\u003C\u002Fsummary>AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.05698)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=a4e7199e725b34ae5ddd574057f60ebb1a2011b7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FMirasol3B%3A-A-Multimodal-Autoregressive-Model-for-Piergiovanni-Noble\u002Fa4e7199e725b34ae5ddd574057f60ebb1a2011b7)\n`text, video, audio`\n\n\n## Image Understanding\n\n+ **Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions** (11 Jun 2024)\u003Cdetails>\u003Csummary>Renjie Pi, Jianshu Zhang, Jipeng Zhang et al.\u003C\u002Fsummary> Renjie Pi, Jianshu Zhang, Jipeng Zhang, Rui Pan, Zhekai Chen, Tong Zhang\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07502)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F91b4f447bb06d081a7947b42df57491a04fa46f9)\n\n\n\n+ **T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text** (11 Jun 2024)\u003Cdetails>\u003Csummary>[ACL 2024] Aoxiong Yin, Haoyuan Li, Kai Shen et al.\u003C\u002Fsummary> Aoxiong Yin, Haoyuan Li, Kai Shen, Siliang Tang, Yueting Zhuang\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07119)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F186910d697bf7eb605aa055aee78fd91ce3ce9fe)\n\n\n+ **Open-World Human-Object Interaction Detection via Multi-modal Prompts** (11 Jun 2024)\u003Cdetails>\u003Csummary>Jie Yang, Bingliang Li, Ailing Zeng et al.\u003C\u002Fsummary>Jie Yang, Bingliang Li, Ailing Zeng, Lei Zhang, Ruimao Zhang\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07119)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F186910d697bf7eb605aa055aee78fd91ce3ce9fe)\n\n\n+ **Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?** (11 Jun 2024)\u003Cdetails>\u003Csummary>Xingyu Fu, Muyu He, Yujie Lu et al.\u003C\u002Fsummary>Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, Dan Roth\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07221v1)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff0acf2a2293d963c3786e83bb198c75612adc446)\n\n+ **InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks** (21 Dec 2023)\u003Cdetails>\u003Csummary>Zhe Chen, Jiannan Wu, Wenhai Wang, et al.\u003C\u002Fsummary>Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14238)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=6a33e58ef961a3a0a5657518b2be86395eb7c8d0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6a33e58ef961a3a0a5657518b2be86395eb7c8d0)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Finternvl.opengvlab.com\u002F)\n\n+ **LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models** (28 Nov 2023)\\\nYanwei Li, Chengyao Wang, Jiaya Jia\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17043)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=486c2df78cbb770a90a55f7fa3fe19102fba2c24)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F486c2df78cbb770a90a55f7fa3fe19102fba2c24)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllama-vid.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLaMA-VID.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](http:\u002F\u002F103.170.5.190:7864\u002F)\n\n+ **CogVLM: Visual Expert for Pretrained Language Models** (6 Nov 2023)\u003Cdetails>\u003Csummary>Weihan Wang, Qingsong Lv, Wenmeng Yu, et al.\u003C\u002Fsummary>Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.03079)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=3bf842dec99016da2d309ea8cbd7e25343032317)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3bf842dec99016da2d309ea8cbd7e25343032317)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FCogVLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVLM)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](http:\u002F\u002F36.103.203.44:7861\u002F)\n\n+ **MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning** (14 Oct 2023)\u003Cdetails>\u003Csummary>Jun Chen, Deyao Zhu, Xiaoqian Shen, et al.\u003C\u002Fsummary>Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.09478)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-97-blue.svg?paper=1ddbd08ad8cf22a5c66c4242194c4286328533bf)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1ddbd08ad8cf22a5c66c4242194c4286328533bf)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fminigpt-v2.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FMiniGPT-4.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVision-CAIR\u002FMiniGPT-v2)\n\n+ **OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue** (21 Jun 2023)\u003Cdetails>\u003Csummary>Weihao Gao, Zhuo Deng, Zhiyuan Niu, et al.\u003C\u002Fsummary>Weihao Gao, Zhuo Deng, Zhiyuan Niu, Fuju Rong, Chucheng Chen, Zheng Gong, Wenze Zhang, Daimin Xiao, Fang Li, Zhenjie Cao, Zhaoyi Ma, Wenbin Wei, Lan Ma\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.12174)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=0f8d12775a4685575f1489796b5dee9e11fbdfb5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0f8d12775a4685575f1489796b5dee9e11fbdfb5)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fminigpt-v2.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FML-AILab\u002FOphGLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FML-AILab\u002FOphGLM)\n\n\n+ **InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition** (26 Sep 2023)\u003Cdetails>\u003Csummary>Pan Zhang, Xiaoyi Dong, Bin Wang, et al.\u003C\u002Fsummary> Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15112)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-48-blue.svg?paper=c1e450284e7d6cac1855330a1197df8537df653f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc1e450284e7d6cac1855330a1197df8537df653f)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer)\n\n+ **[LaVIT] Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization** (9 Sep 2023)\u003Cdetails>\u003Csummary>Yang Jin, Kun Xu, Kun Xu, et al.\u003C\u002Fsummary>Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, Di Zhang, Wenwu Ou, Kun Gai, Yadong Mu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04669)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=bcac614f9774488447221ebb4f16f05e3975ec1e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbcac614f9774488447221ebb4f16f05e3975ec1e)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjy0205\u002FLaVIT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FLaVIT)\n`tokenizer`\n\n+ **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond** (24 Aug 2023)\u003Cdetails>\u003Csummary>Jinze Bai, Shuai Bai, Shusheng Yang, et al.\u003C\u002Fsummary>Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.12966)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-69-blue.svg?paper=fc6a2f7478f68adefd69e2071f27e38aa1647f2f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffc6a2f7478f68adefd69e2071f27e38aa1647f2f)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-VL\u002Fblob\u002Fmaster\u002FTUTORIAL.md)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen-VL.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-VL)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fqwen\u002FQwen-VL-Chat-Demo\u002Fsummary)\n\n+ **VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks** (18 May 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023] Wenhai Wang, Zhe Chen, Xiaokang Chen, et al.\u003C\u002Fsummary>Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11175)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-134-blue.svg?paper=42a30dc5470f54ec249f25d3c31e05d7c376c8e3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F42a30dc5470f54ec249f25d3c31e05d7c376c8e3)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FVisionLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FVisionLLM)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternGPT)\n\n+ **InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning** (11 May 2023)\u003Cdetails>\u003Csummary>Wenliang Dai, Junnan Li, Dongxu Li, et al.\u003C\u002Fsummary>Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06500)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-474-blue.svg?paper=8bd6a2a89503be083176f2cc26fabedb79238cbd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8bd6a2a89503be083176f2cc26fabedb79238cbd)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen-VL.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-VL)\n\n+ **MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models** (20 Apr 2023)\u003Cdetails>\u003Csummary>Deyao Zhu, Jun Chen, Xiaoqian Shen, et al.\u003C\u002Fsummary>Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.10592)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-649-blue.svg?paper=ca6a2bc279be5a3349a22bfd6866ed633d18734b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fca6a2bc279be5a3349a22bfd6866ed633d18734b)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fminigpt-4.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FMiniGPT-4.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVision-CAIR\u002FMiniGPT-v2)\n\n+ **Visual Instruction Tuning** (17 Apr 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023 (Oral)] Liu, Haotian, et al.\u003C\u002Fsummary>Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08485)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=1a8eb2cae1833df3bf12fe3b41b03d60b4a4a98d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1a8eb2cae1833df3bf12fe3b41b03d60b4a4a98d)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllava-vl.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaotian-liu\u002FLLaVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fllava.hliu.cc\u002F)\n\n\n## Video Understanding\n\n\n\n+ **StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification**  (11 Nov 2024)\u003Cdetails>\u003Csummary>Yichen He, Yuan Lin, Jianchao Wu, et al.\u003C\u002Fsummary>Yichen He, Yuan Lin, Jianchao Wu, Hanchong Zhang, Yuchen Zhang, Ruicheng Le\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.07076)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhyc2026\u002FStoryTeller.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhyc2026\u002FStoryTeller)\n\n\n\n+ **Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding**  (22 Sep 2024)\u003Cdetails>\u003Csummary>Yan Shu, Peitian Zhang, Zheng Liu, et al.\u003C\u002Fsummary>Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, Bo Zhao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.14485)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=ad6f68db45aaebc0e61b342d03da4c2702ce5697)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVideo-XL%3A-Extra-Long-Vision-Language-Model-for-Shu-Zhang\u002Fad6f68db45aaebc0e61b342d03da4c2702ce5697)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVectorSpaceLab\u002FVideo-XL.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVectorSpaceLab\u002FVideo-XL\u002Ftree\u002Fmain)\n\n\n\n+ **Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution** (19 Sep 2024)\u003Cdetails>\u003Csummary>Zuyan Liu, Yuhao Dong, Ziwei Liu, et al.\u003C\u002Fsummary>Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12961)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=85c514db4e90e1fd4200d858353f27a3cc2c29ad)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FOryx-MLLM%3A-On-Demand-Spatial-Temporal-Understanding-Liu-Dong\u002F85c514db4e90e1fd4200d858353f27a3cc2c29ad)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Foryx-mllm.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOryx-mllm\u002FOryx.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOryx-mllm\u002FOryx?tab=readme-ov-file)\n\n\n\n+ **VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**  (25 Apr 2024)\u003Cdetails>\u003Csummary>Zesen Cheng, Sicong Leng, Hang Zhang, et al.\u003C\u002Fsummary>Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07476)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=7115e1f6cdd91ef09737d5a13664d9489fe27e08)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVideoLLaMA-2%3A-Advancing-Spatial-Temporal-Modeling-Cheng-Leng\u002F7115e1f6cdd91ef09737d5a13664d9489fe27e08)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoLLaMA2.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2)\n\n+ **PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning**  (25 Apr 2024)\u003Cdetails>\u003Csummary>Lin Xu, Yilin Zhao, Daquan Zhou, et al.\u003C\u002Fsummary>Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fpllava.github.io\u002F)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=9d29da83aba362c728c36f4dea9dde678ae3e2b2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FPLLaVA-%3A-Parameter-free-LLaVA-Extension-from-Images-Xu-Zhao\u002F9d29da83aba362c728c36f4dea9dde678ae3e2b2)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmagic-research\u002FPLLaVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmagic-research\u002FPLLaVA?tab=readme-ov-file)\n\n+ **MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**  (3 Dec 2023) \\\nEnxin, Song, et al. \\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-16-blue.svg?paper=6f9b7c8cde1be2e62a503c31cac883c6d44c9d0d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6f9b7c8cde1be2e62a503c31cac883c6d44c9d0d)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frese1f\u002FMovieChat.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat)\n\n+ **LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models**  (28 Nov 2023) \\\nYanwei, Li, et al. \\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17043)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=486c2df78cbb770a90a55f7fa3fe19102fba2c24)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F486c2df78cbb770a90a55f7fa3fe19102fba2c24)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLaMA-VID.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID)\n\n+ **Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models** (27 Nov 2023)\\\nNing, Munan, et al. \\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16103)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=b037bb09aa162d8a543e64ec777ca0edc732d2af)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb037bb09aa162d8a543e64ec777ca0edc732d2af)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FVideo-Bench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-Bench)\n\n+ **PG-Video-LLaVA: Pixel Grounding Large Video-Language Models** (22 Nov 2023)\\\nMunasinghe, Shehan, et al. \\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13435)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=4edbb942c2d20a6f5a4e3caa763a9761be953231)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4edbb942c2d20a6f5a4e3caa763a9761be953231)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FVideo-LLaVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-LLaVA)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-LLaVA\u002F)\n\n+ **Video-LLaVA: Learning United Visual Representation by Alignment Before Projection** (16 Nov 2023)\\\nLin, Bin, et al. \\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10122)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-19-blue.svg?paper=107fb6eec2febbae12db29bf3e311aaf5680027c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F107fb6eec2febbae12db29bf3e311aaf5680027c)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FVideo-LLaVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FVideo-LLaVA)\n\n+ **Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding** (14 Nov 2023)\\\nJin, Peng, et al. \\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.08046)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=aad3d2e690f6c73f04a14622ceff51464bbc560e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Faad3d2e690f6c73f04a14622ceff51464bbc560e)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FChat-UniVi.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FChat-UniVi)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FChat-UniVi\u002FChat-UniVi)\n\n\n+ **Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding** (5 Jun 2023)\\\nZhang, Hang, Xin Li, and Lidong Bing. EMNLP 2023's demo track. \\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.02858)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-176-blue.svg?paper=5d321194696f1f75cf9da045e6022b2f20ba5b9c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5d321194696f1f75cf9da045e6022b2f20ba5b9c)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideo-LLaMA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideo-LLaMA)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FDAMO-NLP-SG\u002FVideo-LLaMA)\n\n+ **AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?** (31 Jul 2023)\\\nZhao, Qi, et al.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16368)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=6024f320e0a5b9b8fc29b86903aa9a96956b26dd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6024f320e0a5b9b8fc29b86903aa9a96956b26dd)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fbrown-palm.github.io\u002FAntGPT\u002F)\n\n+ **Valley: Video Assistant with Large Language model Enhanced ability** (12 Jun 2023)\\\nLuo, Ruipu, et al.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07207)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-42-blue.svg?paper=4c4d176c6e28f48041f215d563f6ee8633534cff)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4c4d176c6e28f48041f215d563f6ee8633534cff)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvalley-vl.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRupertLuo\u002FValley.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FRupertLuo\u002FValley)\n\n+ **Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models** (8 Jun 2023)\\\nMuhammad Maaz, Hanoona Rasheed, Salman Khan, et al.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05424)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-107-blue.svg?paper=bf7025a2e5dbb3c09deae02a1aa98a256ca559e2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbf7025a2e5dbb3c09deae02a1aa98a256ca559e2)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FVideo-ChatGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT)\n\n+ **VideoChat: Chat-Centric Video Understanding** (10 May 2023)\\\nLi, KunChang, et al. \\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06355)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-133-blue.svg?paper=d48cb91b9e555194f7494c4d4bb9815021d3ee45)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd48cb91b9e555194f7494c4d4bb9815021d3ee45)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FAsk-Anything.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything)\n\n+ **VideoLLM: Modeling Video Sequence with Large Language Models** (22 May 2023)\\\nChen, Guo, et al.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13292)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-33-blue.svg?paper=f9bfc6d9ba1665b73af3323d46c7642b852759ef)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff9bfc6d9ba1665b73af3323d46c7642b852759ef)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcg1177\u002FVideoLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fcg1177\u002FVideoLLM)\n\n+ **Learning video embedding space with Natural Language Supervision** (25 Mar 2023)\\\nUppala, Phani Krishna, Shriti Priya, and Vaidehi Joshi.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.14584)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=4e54a45d2118b61ae1baec07308af3fdd2c48759)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4e54a45d2118b61ae1baec07308af3fdd2c48759)\n\n\n  \n## 3D Understanding\n+ **Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding** (12 Oct 2024)\u003Cdetails>\u003Csummary>[NeurIPS 2024] Yunze Man, Shuhong Zheng, Zhipeng Bao, et al.\u003C\u002Fsummary>Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liang-Yan Gui, Yu-Xiong Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03757)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=10213cac3343964f746e99e223818b052d07b775)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F10213cac3343964f746e99e223818b052d07b775)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fyunzeman.github.io\u002Flexicon3d\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYunzeMan\u002FLexicon3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYunzeMan\u002FLexicon3D)\n\n\n+ **Situation3D: Situational Awareness Matters in 3D Vision Language Reasoning** (12 Oct 2024)\\\n[CVPR 2024] Yunze Man, Liang-Yan Gui, Yu-Xiong Wang \\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07544)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=c31970f0b5a3ffa41ab604ae29ffd323af277538)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc31970f0b5a3ffa41ab604ae29ffd323af277538)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fyunzeman.github.io\u002Fsituation3d\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYunzeMan\u002FSituation3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYunzeMan\u002FSituation3D)\n\n\n\n+ **LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning** (30 Nov 2023)\u003Cdetails>\u003Csummary>[CVPR2024]Sijin Chen, Xin Chen, Chi Zhang, et al. \u003C\u002Fsummary>[CVPR 2024] Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, Tao Chen\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.18651)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=fc53f8f3a84f1fc4993689d8f98cf6551d07a22d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffc53f8f3a84f1fc4993689d8f98cf6551d07a22d)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen3DA\u002FLL3DA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpen3DA\u002FLL3DA)\n\n\n+ **LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding** (21 Dec 2023)\\\nSenqiao Yang*, Jiaming Liu*, Ray Zhang, et al.\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14074)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=5edf706467dc76cd09319592d18db0ad4e1fb64d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5edf706467dc76cd09319592d18db0ad4e1fb64d)\n\n+ **3D-LLM: Injecting the 3D World into Large Language Models** (24 Jul 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023 Spotlight] Yining Hong, Haoyu Zhen, Peihao Chen, et al.\u003C\u002Fsummary>Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.12981)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-48-blue.svg?paper=7637ed79d30d0139901175ae4abedd822c217ab4)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7637ed79d30d0139901175ae4abedd822c217ab4)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUMass-Foundation-Model\u002F3D-LLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FUMass-Foundation-Model\u002F3D-LLM)\n\n+ **PointLLM: Empowering Large Language Models to Understand Point Clouds** (31 Aug 2023)\u003Cdetails>\u003Csummary>[NeurIPS 2023 Spotlight] Runsen Xu, Xiaolong Wang, Tai Wang, et al.\u003C\u002Fsummary>Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua Lin\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16911.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-31-blue.svg?paper=6bcc6ab9c28805d4067e99b2cdc7524550fe80e1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6bcc6ab9c28805d4067e99b2cdc7524550fe80e1)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenRobotLab\u002FPointLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FPointLLM)\n\n+ **PointCLIP: Point Cloud Understanding by CLIP** (31 Aug 2023)\u003Cdetails>\u003Csummary>[CVPR 2022] Renrui Zhang, Ziyu Guo, Wei Zhang,, et al. \u003C\u002Fsummary>Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, Hongsheng Li\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02413.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-215-blue.svg?paper=f3ce9ba3fcec362b70263a7ed63d9404975496a0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff3ce9ba3fcec362b70263a7ed63d9404975496a0)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZrrSkywalker\u002FPointCLIP.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FZrrSkywalker\u002FPointCLIP)\n\n\n\n## Audio Understanding\n+ **Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action** (28 Dec 2023)\u003Cdetails>\u003Csummary>Jiasen Lu, Christopher Clark, Sangho Lee, et al.\u003C\u002Fsummary>Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17172)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=6c64ddd2190909de2c680dd18abc9b92e80c39f9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6c64ddd2190909de2c680dd18abc9b92e80c39f9)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Funified-io-2.allenai.org\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Funified-io-2.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fallenai\u002Funified-io-2)\n\n+ **M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models** (19 Nov 2023)\u003Cdetails>\u003Csummary>Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, et al.\u003C\u002Fsummary>Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.11255)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=1e84d7c45f70038574fcdb7bc1b20da9b348a092)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1e84d7c45f70038574fcdb7bc1b20da9b348a092)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcrypto-code.github.io\u002FM2UGen-Demo\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshansongliu\u002FM2UGen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshansongliu\u002FM2UGen)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FM2UGen\u002FM2UGen-Demo)\n\n+ **Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models** (14 Nov 2023)\u003Cdetails>\u003Csummary>Yunfei Chu, Jin Xu, Xiaohuan Zhou, et al.\u003C\u002Fsummary>Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, Jingren Zhou\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.07919)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-14-blue.svg?paper=f90595f99a0c66d2bb6d0f230f17c7cd8c58f44d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff90595f99a0c66d2bb6d0f230f17c7cd8c58f44d)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-Audio)\n\n+ **SALMONN: Towards Generic Hearing Abilities for Large Language Models** (20 Oct 2023)\u003Cdetails>\u003Csummary>Changli Tang, Wenyi Yu, Guangzhi Sun, et al.\u003C\u002Fsummary>Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.13289)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=f72be31de9f9a09d4410fd38bc717efe43444827)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff72be31de9f9a09d4410fd38bc717efe43444827)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ftsinghua-ee\u002FSALMONN-7B-gradio)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FSALMONN.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FSALMONN)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Ftsinghua-ee\u002FSALMONN)\n\n+ **MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models** (18 Oct 2023)\u003Cdetails>\u003Csummary>Dingyao Yu, Kaitao Song, Peiling Lu, et al.\u003C\u002Fsummary>Dingyao Yu, Kaitao Song, Peiling Lu, Tianyu He, Xu Tan, Wei Ye, Shikun Zhang, Jiang Bian\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11954)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=beaf64df85f8204b8cd89a7f46827608e6d16922)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbeaf64df85f8204b8cd89a7f46827608e6d16922)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fmuzic\u002Ftree\u002Fmain\u002Fmusicagent.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmuzic\u002Ftree\u002Fmain\u002Fmusicagent)\n\n+ **Llark: A multimodal foundation model for music** (11 Oct 2023)\u003Cdetails>\u003Csummary>Josh Gardner, Simon Durand, Daniel Stoller, et al.\u003C\u002Fsummary>Josh Gardner, Simon Durand, Daniel Stoller, Rachel M. Bittner\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07160)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=86e75cf15a838ed7d672fb114beff727d7210ca5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F86e75cf15a838ed7d672fb114beff727d7210ca5)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fstorage.googleapis.com\u002Fmusic2text-public\u002Findex.html)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fspotify-research\u002Fllark.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fspotify-research\u002Fllark)\n\n+ **LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT** (7 Oct 2023)\u003Cdetails>\u003Csummary>Jiaming Wang, Zhihao Du, Qian Chen, et al.\u003C\u002Fsummary>Jiaming Wang, Zhihao Du, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.04673)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=ffa05cb5504ba08254f498223f613b3ebcf87692)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fffa05cb5504ba08254f498223f613b3ebcf87692)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flauragpt.github.io\u002F)\n\n+ **Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation** (29 Sep 2023)\u003Cdetails>\u003Csummary>Shih-Lun Wu, Xuankai Chang, Gordon Wichern, et al.\u003C\u002Fsummary>Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, Shinji Watanabe\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17352)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=8f0a24d1678e4d0e584b0932196cd257d5c53c7d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8f0a24d1678e4d0e584b0932196cd257d5c53c7d)\n\n+ **Connecting Speech Encoder and Large Language Model for ASR** (25 Sep 2023)\u003Cdetails>\u003Csummary>Wenyi Yu, Changli Tang, Guangzhi Sun, et al.\u003C\u002Fsummary>Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.13963)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=5596bd3e26ec2207666ec1ff3db4415d212f14b9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5596bd3e26ec2207666ec1ff3db4415d212f14b9)\n\n+ **Can Whisper perform speech-based in-context learning** (13 Sep 2023)\u003Cdetails>\u003Csummary>Siyin Wang, Chao-Han Huck Yang, Ji Wu, et al.\u003C\u002Fsummary>Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.07081)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=3a944ddba8b6fbaaac36126fc955f181f8b8b06a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3a944ddba8b6fbaaac36126fc955f181f8b8b06a)\n\n+ **Music understanding LLaMA: Advancing text-to-music generation with question answering and captioning** (22 Aug 2023)\u003Cdetails>\u003Csummary>Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, et al.\u003C\u002Fsummary>Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.11276)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=a33b437618be733fea7176bd98e18b6362af0838)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fa33b437618be733fea7176bd98e18b6362af0838)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcrypto-code.github.io\u002FMU-LLaMA-Demo\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcrypto-code\u002FMU-LLaMA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fcrypto-code\u002FMU-LLaMA)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmu-llama\u002FMusicQA)\n\n+ **On decoder-only architecture for speech-to-text and large language model integration** (8 Jul 2023)\u003Cdetails>\u003Csummary>Jian Wu, Yashesh Gaur, Zhuo Chen, et al.\u003C\u002Fsummary>Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.03917)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-24-blue.svg?paper=8e1868f84091272544cb4209c4ccaad7cc88af27)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8e1868f84091272544cb4209c4ccaad7cc88af27)\n\n+ **AudioPaLM: A Large Language Model That Can Speak and Listen** (22 Jun 2023)\u003Cdetails>\u003Csummary>Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, et al.\u003C\u002Fsummary>Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats, Neil Zeghidour, Yu Zhang, Zhishuai Zhang, Lukas Zilka, Christian Frank\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.12925)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-69-blue.svg?paper=3efb81de24eb88017d6dbcf22cb4215084223fd8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3efb81de24eb88017d6dbcf22cb4215084223fd8)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgoogle-research.github.io\u002Fseanet\u002Faudiopalm\u002Fexamples\u002F)\n\n+ **Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface** (30 Mar 2023)\u003Cdetails>\u003Csummary>Yongliang Shen, Kaitao Song, Xu Tan, et al.\u003C\u002Fsummary>Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.17580)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-413-blue.svg?paper=d1120d67b700e4dfe8b39eb1e48fbdea4e1a0c43)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd1120d67b700e4dfe8b39eb1e48fbdea4e1a0c43)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FJARVIS)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FHuggingGPT)\n\n+ **Sparks of Artificial General Intelligence: Early experiments with GPT-4** (22 Mar 2023)\u003Cdetails>\u003Csummary>Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, et al.\u003C\u002Fsummary>Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12712)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1407-blue.svg?paper=574beee702be3856d60aa482ec725168fe64fc99)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F574beee702be3856d60aa482ec725168fe64fc99)\n\n+ **Listen, Think, and Understand** (18 May 2023)\u003Cdetails>\u003Csummary>Yuan Gong, Hongyin Luo, Alexander H. Liu, et al.\u003C\u002Fsummary>Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10790)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-33-blue.svg?paper=4bb0b12803791764d641a4cef1e0ce39cf049542)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4bb0b12803791764d641a4cef1e0ce39cf049542)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fyuangongfdu\u002FLTU)\n\n+ **Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities** (18 May 2023)\u003Cdetails>\u003Csummary>Dong Zhang, Shimin Li, Xin Zhang, et al.\u003C\u002Fsummary>Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, Xipeng Qiu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11000)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-76-blue.svg?paper=5cac6430bd379c9d2fe13137dfd6ae7721a2679f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5cac6430bd379c9d2fe13137dfd6ae7721a2679f)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002F0nutation.github.io\u002FSpeechGPT.github.io\u002F)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F0nutation\u002FSpeechGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002F0nutation\u002FSpeechGPT)\n\n+ **Audiogpt: Understanding and generating speech, music, sound, and talking head** (25 Apr 2023)\u003Cdetails>\u003Csummary>Rongjie Huang, Mingze Li, Dongchao Yang, et al.\u003C\u002Fsummary>Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.12995)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-83-blue.svg?paper=8bc617c9139648d7a92991d70c671230bac7b2e2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8bc617c9139648d7a92991d70c671230bac7b2e2)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIGC-Audio\u002FAudioGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAIGC-Audio\u002FAudioGPT)\n[![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAIGC-Audio\u002FAudioGPT)\n\n\n\n# 📍 Multimodal LLM Safety\n## Attack\n\n+ **Jailbreaking gpt-4v via self-adversarial attacks with system prompts.** (20 Jan 2024)\u003Cdetails>\u003Csummary>Yuanwei Wu, Xiang Li, Yixin Liu, et al.\u003C\u002Fsummary>Yuanwei Wu, Xiang Li, Yixin Liu, Pan Zhou, Lichao Sun\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.09127.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=18a8b97d75a87e8fef07542d8875d4a62b553744)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F18a8b97d75a87e8fef07542d8875d4a62b553744)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FThuCCSLab\u002Flm-ssp.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FThuCCSLab\u002Flm-ssp)\n\n+ **Defending chatgpt against jailbreak attack via self-reminders.** (1 Dec 2023)\u003Cdetails>\u003Csummary>Yueqi Xie, Jingwei Yi, Jiawei Shao, et al.\u003C\u002Fsummary>Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, Fangzhao Wu\u003C\u002Fdetails>\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-32-blue.svg?paper=e762f92273cd96f63b7788c0173b9b6450adedd7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe762f92273cd96f63b7788c0173b9b6450adedd7)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyjw1029\u002FSelf-Reminder.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyjw1029\u002FSelf-Reminder)\n\n+ **Misusing Tools in Large Language Models With Visual Adversarial Examples** (4 Oct 2023)\u003Cdetails>\u003Csummary>Xiaohan Fu, Zihan Wang, Shuheng Li, et al.\u003C\u002Fsummary>Xiaohan Fu, Zihan Wang, Shuheng Li, Rajesh K. Gupta, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Earlence Fernandes\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.03185.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=ac5b4df0e398ca48388330ac5c795b6fe708793c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fac5b4df0e398ca48388330ac5c795b6fe708793c)\n\n+ **Image Hijacks: Adversarial Images can Control Generative Models at Runtime.** (18 Sep 2023)\u003Cdetails>\u003Csummary>Luke Bailey, Euan Ong, Stuart Russell, et al.\u003C\u002Fsummary>Luke Bailey, Euan Ong, Stuart Russell, Scott Emmons\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.00236.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-22-blue.svg?paper=5bdaadb84db0cbf72aaebda9f55f4288b63c6e9b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5bdaadb84db0cbf72aaebda9f55f4288b63c6e9b)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Feuanong\u002Fimage-hijacks.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Feuanong\u002Fimage-hijacks)\n\n+ **Universal and Transferable Adversarial Attacks on Aligned Language Models** (27 Jul 2023)\u003Cdetails>\u003Csummary>Andy Zou, Zifan Wang, Nicholas Carlini, et al.\u003C\u002Fsummary>Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.15043.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-309-blue.svg?paper=47030369e97cc44d4b2e3cf1be85da0fd134904a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F47030369e97cc44d4b2e3cf1be85da0fd134904a)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fllm-attacks\u002Fllm-attacks.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fllm-attacks\u002Fllm-attacks)\n\n+ **Prompt injection attack against llm-integrated applications** (8 Jun 2023)\u003Cdetails>\u003Csummary>Yi Liu, Gelei Deng, Yuekang Li, et al.\u003C\u002Fsummary>Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, Yang Liu\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05499.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-77-blue.svg?paper=db4cf9f6a653d5c15973e836c800ea47743251ae)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdb4cf9f6a653d5c15973e836c800ea47743251ae)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLMSecurity\u002FHouYi.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLLMSecurity\u002FHouYi)\n\n+ **Automatically Auditing Large Language Models via Discrete Optimization** (8 Mar 2023)\u003Cdetails>\u003Csummary>Erik Jones, Anca Dragan, Aditi Raghunathan, et al.\u003C\u002Fsummary>Erik Jones, Anca Dragan, Aditi Raghunathan, Jacob Steinhardt\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.04381.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-64-blue.svg?paper=2f94f03fdac62d05f0f416b7b3855d1f597afee9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2f94f03fdac62d05f0f416b7b3855d1f597afee9)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fejones313\u002Fauditing-llms.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fejones313\u002Fauditing-llms)\n\n+ **Poisoning Web-Scale Training Datasets is Practical** (20 Feb 2023)\u003Cdetails>\u003Csummary>Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, et al.\u003C\u002Fsummary>Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tram  r\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.10149.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-61-blue.svg?paper=2cf43a61d0937ad25f23eaef7c90253ab799b3c7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2cf43a61d0937ad25f23eaef7c90253ab799b3c7)\n\n+ **Exploiting programmatic behavior of llms: Dual-use through standard security attacks.** (11 Feb 2023)\u003Cdetails>\u003Csummary>Daniel Kang, Xuechen Li, Ion Stoica, et al.\u003C\u002Fsummary>Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.05733.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-94-blue.svg?paper=0cf694b8f85ab2e11d45595de211a15cfbadcd22)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0cf694b8f85ab2e11d45595de211a15cfbadcd22)\n\n+ **Ignore previous prompt: Attack techniques for language models** (17 Nov 2022)\\\nF  bio Perez, Ian Ribeiro (NeurIPS 2022 Workshop)\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.09527.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-151-blue.svg?paper=9716a2876d08fce9d8e5c5ba4d7b1a9af44806d6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9716a2876d08fce9d8e5c5ba4d7b1a9af44806d6)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fagencyenterprise\u002FPromptInject.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fagencyenterprise\u002FPromptInject)\n\n+ **Universal Adversarial Triggers for Attacking and Analyzing NLP** (20 Aug 2019)\u003Cdetails>\u003Csummary>Eric Wallace, Shi Feng, Nikhil Kandpal, et al. (EMNLP 2019)\u003C\u002Fsummary>Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.07125.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-613-blue.svg?paper=18a1c21f35153c45d0ef30c564bffb7d70a13ccc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F18a1c21f35153c45d0ef30c564bffb7d70a13ccc)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEric-Wallace\u002Funiversal-triggers.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FEric-Wallace\u002Funiversal-triggers)\n\n+ **Adversarial Examples for Evaluating Reading Comprehension Systems** (23 Jul 2017)\\\nRobin Jia, Percy Liang (EMNLP 2017)\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.07328.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1445-blue.svg?paper=ffb949d3493c3b2f3c9acf9c75cb03938933ddf0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fffb949d3493c3b2f3c9acf9c75cb03938933ddf0)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frobinjia\u002Fadversarial-squad.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Frobinjia\u002Fadversarial-squad)\n\n\n\n## Defense and Detect\n+ **Detecting and correcting hate speech in multimodal memes with large visual language model.** (12 Nov 2023)\\\nMinh-Hao Van, Xintao Wu\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.06737.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=60f4dc690ea42fb77b04fc685e9d9c3a1e209319)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F60f4dc690ea42fb77b04fc685e9d9c3a1e209319)\n\n+ **Detecting Pretraining Data from Large Language Models** (3 Nov 2023)\u003Cdetails>\u003Csummary>Weijia Shi, Anirudh Ajith, Mengzhou Xia, et al.\u003C\u002Fsummary>Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.16789.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-20-blue.svg?paper=3422d5e0cdfdc935d6a84a1e3d3f96659265fe3a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3422d5e0cdfdc935d6a84a1e3d3f96659265fe3a)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fswj0419\u002Fdetect-pretrain-code.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fswj0419\u002Fdetect-pretrain-code)\n\n+ **Jailbreak and guard aligned language models with only few in-context demonstrations** (10 Oct 2023)\\\nZeming Wei, Yifei Wang, Yisen Wang\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06387.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-43-blue.svg?paper=6b135e922a0c673aeb0b05c5aeecdb6c794791c6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6b135e922a0c673aeb0b05c5aeecdb6c794791c6)\n\n+ **Smoothllm: Defending large language models against jailbreaking attacks.** (5 Oct 2023)\u003Cdetails>\u003Csummary>Alexander Robey, Eric Wong, Hamed Hassani, et al.\u003C\u002Fsummary>Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.03684.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-46-blue.svg?paper=8cf9b49698fdb1b754df2556576412a7b44929f6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8cf9b49698fdb1b754df2556576412a7b44929f6)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Farobey1\u002Fsmooth-llm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Farobey1\u002Fsmooth-llm)\n\n+ **A Watermark for Large Language Models** (6 Jun 2023)\u003Cdetails>\u003Csummary>John Kirchenbauer, Jonas Geiping, Yuxin Wen, et al. (ICML 2023)\u003C\u002Fsummary>John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.10226.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-199-blue.svg?paper=cb5b71a622aff47014d4f28a958679629a8b6363)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcb5b71a622aff47014d4f28a958679629a8b6363)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBrianPulfer\u002FLMWatermark.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FBrianPulfer\u002FLMWatermark)\n\n+ **Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models** (23 May 2023)\u003Cdetails>\u003Csummary>Yiting Qu, Xinyue Shen, Xinlei He, et al. (ACM CCS 2023)\u003C\u002Fsummary>Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, Yang Zhang\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13873.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-29-blue.svg?paper=c9e548d72f5ad72215025602be36f72042219baf)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc9e548d72f5ad72215025602be36f72042219baf)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYitingQu\u002Funsafe-diffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYitingQu\u002Funsafe-diffusion)\n\n+ **TRAK: Attributing Model Behavior at Scale** (3 Apr 2023)\u003Cdetails>\u003Csummary>Sung Min Park, Kristian Georgiev, Andrew Ilyas, et al.\u003C\u002Fsummary>Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, Aleksander Madry\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.14186.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-32-blue.svg?paper=4f2ae5fa2dc74af9c36ee57b359a4b3241006a92)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4f2ae5fa2dc74af9c36ee57b359a4b3241006a92)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMadryLab\u002Ftrak.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMadryLab\u002Ftrak)\n\n+ **Poisoning Web-Scale Training Datasets is Practical** (20 Feb 2023)\u003Cdetails>\u003Csummary>Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, et al.\u003C\u002Fsummary>Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tram  r\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.10149.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-61-blue.svg?paper=2cf43a61d0937ad25f23eaef7c90253ab799b3c7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2cf43a61d0937ad25f23eaef7c90253ab799b3c7)\n\n+ **Mitigating Inappropriate Degeneration in Diffusion Models** (9 Nov 2022)\u003Cdetails>\u003Csummary>Patrick Schramowski, Manuel Brack, Bj?rn Deiseroth, et al. (CVPR 2023)\u003C\u002Fsummary>Patrick Schramowski, Manuel Brack, Bj?rn Deiseroth, Kristian Kersting\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.05105.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-76-blue.svg?paper=0231f2aed9a96cb516242fb57f2cb63f5651c4d8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0231f2aed9a96cb516242fb57f2cb63f5651c4d8)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fml-research\u002Fsafe-latent-diffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fml-research\u002Fsafe-latent-diffusion)\n\n+ **Extracting Training Data from Large Language Models** (15 Jun 2021)\u003Cdetails>\u003Csummary>Nicholas Carlini, Florian Tramer, Eric Wallace, et al.\u003C\u002Fsummary>Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.07805.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1005-blue.svg?paper=df7d26339adf4eb0c07160947b9d2973c24911ba)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdf7d26339adf4eb0c07160947b9d2973c24911ba)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshreyansh26\u002FExtracting-Training-Data-from-Large-Langauge-Models.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshreyansh26\u002FExtracting-Training-Data-from-Large-Langauge-Models)\n\n\n\n## Alignment\n+ **Direct Preference Optimization: Your Language Model is Secretly a Reward Model** (13 Dec 2023)\u003Cdetails>\u003Csummary>Rafael Rafailov, Archit Sharma, Eric Mitchell, et al.\u003C\u002Fsummary>Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18290.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-450-blue.svg?paper=0d1c76d45afa012ded7ab741194baf142117c495)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0d1c76d45afa012ded7ab741194baf142117c495)\n\n+ **Raft: Reward ranked fine tuning for generative foundation model alignment** (1 Dec 2023)\u003Cdetails>\u003Csummary>Hanze Dong, Wei Xiong, Deepanshu Goyal, et al. (Transactions on Machine Learning Research (TMLR))\u003C\u002Fsummary>Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, Tong Zhang\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.06767.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-116-blue.svg?paper=3ab661db57d924f4ff1706e05ac807873ca00e0a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3ab661db57d924f4ff1706e05ac807873ca00e0a)\n\n+ **Better aligning text-to-image models with human preference** (22 Aug 2023)\u003Cdetails>\u003Csummary>Xiaoshi Wu, Keqiang Sun, Feng Zhu, et al. (ICCV 2023)\u003C\u002Fsummary>Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, Hongsheng Li\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.14420.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-44-blue.svg?paper=14c3cf58192774b9b6fc6188df99efd6ab5fc739)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F14c3cf58192774b9b6fc6188df99efd6ab5fc739)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftgxs002\u002Falign_sd.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftgxs002\u002Falign_sd)\n\n+ **Scalable agent alignment via reward modeling: a research direction** (19 Nov 2018)\u003Cdetails>\u003Csummary>Jan Leike, David Krueger, Tom Everitt, et al.\u003C\u002Fsummary>Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.07871.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-244-blue.svg?paper=c6f913e4baa7f2c85363c0625c87003ad3b3a14c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc6f913e4baa7f2c85363c0625c87003ad3b3a14c)\n\n+ **Proximal policy optimization algorithms** (20 Jul 2017)\u003Cdetails>\u003Csummary>John Schulman, Filip Wolski, Prafulla Dhariwal, et al.\u003C\u002Fsummary>John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06347.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12282-blue.svg?paper=dce6f9d4017b1785979e7520fd0834ef8cf02f4b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdce6f9d4017b1785979e7520fd0834ef8cf02f4b)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmorikatron\u002FPPO.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmorikatron\u002FPPO)\n\n\n\n\n\n## Datasets\n+ **Goat-bench: Safety insights to large multimodal models through meme-based social abuse.** (7 Jan 2024)\u003Cdetails>\u003Csummary>Hongzhan Lin, Ziyang Luo, Bo Wang, et al.\u003C\u002Fsummary>Hongzhan Lin, Ziyang Luo, Bo Wang, Ruichao Yang, Jing Ma\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.01523.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=d98aa44f79fe798ad5ff0cac6e7bf32ee30bd156)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd98aa44f79fe798ad5ff0cac6e7bf32ee30bd156)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FisXinLiu\u002FMLLM-Safety-Collection.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FisXinLiu\u002FMLLM-Safety-Collection)\n\n+ **Tovilag: Your visual-language generative model is also an evildoer.** (13 Dec 2023)\u003Cdetails>\u003Csummary>Xinpeng Wang, Xiaoyuan Yi, Han Jiang, et al. (EMNLP 2023 Oral)\u003C\u002Fsummary>Xinpeng Wang, Xiaoyuan Yi, Han Jiang, Shanlin Zhou, Zhihua Wei, Xing Xie\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](http:\u002F\u002Fexport.arxiv.org\u002Fabs\u002F2312.11523)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=10280c290825fc0b0c884e988f4f1dedb80e4e80)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F10280c290825fc0b0c884e988f4f1dedb80e4e80)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvictorup\u002FToViLaG.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fvictorup\u002FToViLaG)\n\n+ **Figstep: Jailbreaking large vision-language models via typographic visual prompts.** (13 Dec 2023)\u003Cdetails>\u003Csummary>Yichen Gong, Delong Ran, Jinyuan Liu, et al.\u003C\u002Fsummary>Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, Xiaoyun Wang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.05608.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=b78b5ce5f21f46d8149824463f8eebd6103d49aa)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb78b5ce5f21f46d8149824463f8eebd6103d49aa)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FThuCCSLab\u002FFigStep.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FThuCCSLab\u002FFigStep)\n\n+ **Query-relevant images jailbreak large multi-modal models.** (29 Nov 2023)\u003Cdetails>\u003Csummary>Xin Liu, Yichen Zhu, Yunshi Lan, et al.\u003C\u002Fsummary>Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, Yu Qiao\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17600.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=74423a9ee66085e74cd2b2e42303f28359c74eb6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F74423a9ee66085e74cd2b2e42303f28359c74eb6)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FisXinLiu\u002FMM-SafetyBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FisXinLiu\u002FMM-SafetyBench)\n\n+ **Dress: Instructing large vision-language models to align and interact with humans via natural language feedback.** (16 Nov 2023)\u003Cdetails>\u003Csummary>Yangyi Chen, Karan Sikka, Michael Cogswell, et al.\u003C\u002Fsummary>Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10081.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=391eaeb1092c2b145ff0e5a2fa61637a42921fce)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F391eaeb1092c2b145ff0e5a2fa61637a42921fce)\n\n+ **Beavertails: Towards improved safety alignment of llm via a human-preference dataset** (7 Nov 2023)\u003Cdetails>\u003Csummary>Jiaming Ji, Mickel Liu, Juntao Dai, et al. (NeurIPS 2023)\u003C\u002Fsummary>Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, Yaodong Yang\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.04657.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-56-blue.svg?paper=92930ed3560ea6c86d53cf52158bc793b089054d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F92930ed3560ea6c86d53cf52158bc793b089054d)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGaryYufei\u002FAlignLLMHumanSurvey.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FGaryYufei\u002FAlignLLMHumanSurvey)\n\n+ **Can pre-trained vision and language models answer visual information-seeking questions?** (17 Oct 2023)\u003Cdetails>\u003Csummary>Yang Chen, Hexiang Hu, Yi Luan, et al. (EMNLP 2023)\u003C\u002Fsummary>Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, Ming-Wei Chang\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.11713.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-19-blue.svg?paper=f890b4dfe915174b23db909b07c515d465eaeff2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff890b4dfe915174b23db909b07c515d465eaeff2)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fedchengg\u002Finfoseek_eval.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fedchengg\u002Finfoseek_eval)\n\n+ **Can language models be instructed to protect personal information?** (3 Oct 2023)\u003Cdetails>\u003Csummary>Yang Chen, Ethan Mendes, Sauvik Das, et al.\u003C\u002Fsummary>Yang Chen, Ethan Mendes, Sauvik Das, Wei Xu, Alan Ritter\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02224.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-14-blue.svg?paper=2403c8e72a90d9c778970fc0812ecdcc58800c5d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2403c8e72a90d9c778970fc0812ecdcc58800c5d)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fethanm88\u002Fllm-access-control.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fethanm88\u002Fllm-access-control)\n\n+ **Safetybench: Evaluating the safety of large language models with multiple choice questions** (13 Sep 2023)\u003Cdetails>\u003Csummary>Zhexin Zhang, Leqi Lei, Lindong Wu, et al.\u003C\u002Fsummary>Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.07045.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=9b9a4fa3ed510fc6eb1bf831979235f3d9f8b556)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9b9a4fa3ed510fc6eb1bf831979235f3d9f8b556)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-coai\u002FSafetyBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthu-coai\u002FSafetyBench)\n\n+ **Safety assessment of chinese large language models** (20 Apr 2023)\u003Cdetails>\u003Csummary>Hao Sun, Zhexin Zhang, Jiawen Deng, et al.\u003C\u002Fsummary>Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, Minlie Huang\u003C\u002Fdetails>\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.10436.pdf)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-33-blue.svg?paper=59fc49dfd81b92661437eaf7e339c0792ccd8755)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F59fc49dfd81b92661437eaf7e339c0792ccd8755)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-coai\u002FSafety-Prompts.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthu-coai\u002FSafety-Prompts)\n\n\n\n\n\n## 3D, Video and Audio Safety\n\n+ **Not My Voice! A Taxonomy of Ethical and Safety Harms of Speech Generators** (25 Jan 2024)\\\nWiebke Hutiri, Oresiti Papakyriakopoulos, Alice Xiang\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01708)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=Not-My-Voice!-A-Taxonomy-of-Ethical-an)\n)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FNot-My-Voice!-A-Taxonomy-of-Ethical-an)\n)\n\n+ **Adv3D: Generating 3D Adversarial Examples in Driving Scenarios with NeRF** (4 Sep 2023)\\\nLeheng Li, Qing Lian, Ying-Cong Chen\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.01351)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=daa6a6b2c495d002d72075c6203c98061d1e35f9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdaa6a6b2c495d002d72075c6203c98061d1e35f9)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEnVision-Research\u002FAdv3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FEnVision-Research\u002FAdv3D)\n\n+ **Deepfake Video Detection Using Generative Convolutional Vision Transformer** (13 Jul 2023)\\\nDeressa Wodajo, Solomon Atnafu, Zahid Akhtar\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.07036)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=86301139cc02eb53247e63fca91b916348591505)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F86301139cc02eb53247e63fca91b916348591505)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ferprogs\u002FGenConViT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ferprogs\u002FGenConViT)\n\n+ **M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection** (19 Apr 2022)\\\nJunke Wang, Zuxuan Wu, Wenhao Ouyang, Xintong Han, Jingjing Chen, Ser-Nam Lim, Yu-Gang Jiang\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.09770)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-132-blue.svg?paper=21e0858665cddf51689fc680f72ec4e00b68ae04)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F21e0858665cddf51689fc680f72ec4e00b68ae04)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwangjk666\u002FM2TR-Multi-modal-Multi-scale-Transformers-for-Deepfake-Detection.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fwangjk666\u002FM2TR-Multi-modal-Multi-scale-Transformers-for-Deepfake-Detection)\n\n+ **Deepfake Video Detection Using Convolutional Vision Transformer** (11 Mar 2021)\\\nDeressa Wodajo, Solomon Atnafu\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.11126)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=86301139cc02eb53247e63fca91b916348591505)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F86301139cc02eb53247e63fca91b916348591505)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ferprogs\u002FGenConViT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ferprogs\u002FGenConViT)\n\n+ **\"Deepfakes Generation and Detection: State-of-the-art, open challenges, countermeasures, and way forward\"** (25 Feb 2021)\\\nMomina Masood, Marriam Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza\\\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.00484)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-123-blue.svg?paper=e8f1c51c4e881345c0588bec8aa8bc6d9164a535)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe8f1c51c4e881345c0588bec8aa8bc6d9164a535)\n\n\n\n# 📍 Related Surveys\n## LLM\n\n\n+ **MM-LLMs: Recent Advances in MultiModal Large Language Models** (24 Jan 2024)\u003Cdetails>\u003Csummary>Duzhen Zhang, Yahan Yu, Chenxing Li\u003C\u002Fsummary>Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, Dong Yu\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.13601)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=a050c9b0c321839e4427ab9defa3463be7825ac4)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fa050c9b0c321839e4427ab9defa3463be7825ac4)\n[![Project_Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmm-llms.github.io\u002F)\n\n\n+ **A Survey on Multimodal Large Language Models** (23 Jun 2023)\u003Cdetails>\u003Csummary>Shukang Yin, Chaoyou Fu, Sirui Zhao, et al.\u003C\u002Fsummary>Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.13549)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-114-blue.svg?paper=ebedc4d7a2356090904baba4104ef0832bc236df)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Febedc4d7a2356090904baba4104ef0832bc236df)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models)\n\n\n+ **Multimodal Large Language Models: A Survey** (22 Nov 2023)\u003Cdetails>\u003Csummary>[IEEE BigData 2023] Jiayang Wu, Wensheng Gan, Zefeng Chen, et al.\u003C\u002Fsummary>Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, Philip S. Yu\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13165)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-14-blue.svg?paper=52941cadbd340344f3e0a6f50719fe55b3de5088)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F52941cadbd340344f3e0a6f50719fe55b3de5088)\n\n\n+ **A Survey of Large Language Models** (31 Mar 2023)\u003Cdetails>\u003Csummary>Wayne Xin Zhao, Kun Zhou, Junyi Li, et al.\u003C\u002Fsummary>Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.18223)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-875-blue.svg?paper=c61d54644e9aedcfc756e5d6fe4cc8b78c87755d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc61d54644e9aedcfc756e5d6fe4cc8b78c87755d)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FLLMSurvey.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FLLMSurvey?tab=readme-ov-file#timeline-of-llms)\n\n\n## Vision\n\n\n\n+ **Autoregressive Models in Vision: A Survey** (8 Nov 2024)\u003Cdetails>\u003Csummary>Jing Xiong, Gongye Liu, Lun Huang, et al.\u003C\u002Fsummary>Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.05902)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChaofanTao\u002FAutoregressive-Models-in-Vision-Survey)](https:\u002F\u002Fgithub.com\u002FChaofanTao\u002FAutoregressive-Models-in-Vision-Survey)\n\n+ **State of the Art on Diffusion Models for Visual Computing** (11 Oct 2023)\u003Cdetails>\u003Csummary>Ryan Po, Wang Yifan, Vladislav Golyanik, et al.\u003C\u002Fsummary>Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T. Barron, Amit H. Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, C. Karen Liu, Lingjie Liu, Ben Mildenhall, Matthias Nießner, Björn Ommer, Christian Theobalt, Peter Wonka, Gordon Wetzstein\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07204)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-27-blue.svg?paper=6487ec82f6d8082a5b402a5416ea03009acb1679)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6487ec82f6d8082a5b402a5416ea03009acb1679)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCroitoruAlin\u002FDiffusion-Models-in-Vision-A-Survey.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FCroitoruAlin\u002FDiffusion-Models-in-Vision-A-Survey)\n\n+ **Diffusion Models in Vision: A Survey** (10 Sep 2022)\u003Cdetails>\u003Csummary>[TPAMI 2023] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, et al. \u003C\u002Fsummary>Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Mubarak Shah\u003C\u002Fdetails>[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.04747)\n[![citation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-371-blue.svg?paper=efa1647594b236361610a20d507127f0586a379b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fefa1647594b236361610a20d507127f0586a379b)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCroitoruAlin\u002FDiffusion-Models-in-Vision-A-Survey.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FCroitoruAlin\u002FDiffusion-Models-in-Vision-A-Survey)\n\n\n\n\n# 👨‍💻 Team\n\nHere is the list of our contributors in each modality of this repository.\n\n| Modality\u002FTask                      |  Contributors                                                 |\n| ----------------------------- | -------------------------------------------------------------------- |\n| Image Generation | Jingye Chen, Xiaowei Chi, Yingqing He                                       |\n| Video Generation | Yingqing He, Xiaowei Chi, Jingye Chen           |\n| Image and Video Editing           | Yazhou Xing |\n| 3D Generation and Editing            | Hongyu Liu |\n| Audio Generation and Editing         | Zeyue Tian, Ruibin Yuan                            |\n| LLM Agent           | Zhaoyang Liu                         |\n| Safety           | Runtao Liu                         |\n| Leaders               | Yingqing He, Zhaoyang Liu                                                   |\n\n# 😉 Citation\nIf you find this work useful in your research, Please cite the paper as below:\n```bib\n@article{he2024llms,\n    title={LLMs Meet Multimodal Generation and Editing: A Survey},\n    author={He, Yingqing and Liu, Zhaoyang and Chen, Jingye and Tian, Zeyue and Liu, Hongyu and Chi, Xiaowei and Liu, Runtao and Yuan, Ruibin and Xing, Yazhou and Wang, Wenhai and Dai, Jifeng and Zhang, Yong and Xue, Wei and Liu, Qifeng and Guo, Yike and Chen, Qifeng},\n    journal={arXiv preprint arXiv:2405.19334},\n    year={2024},\n}\n```\n\n# ⭐️ Star History\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FYingqingHe_Awesome-LLMs-meet-Multimodal-Generation_readme_35f851bb94d9.png)](https:\u002F\u002Fstar-history.com\u002F#YingqingHe\u002FAwesome-LLMs-meet-Multimodal-Generation&Date)\n","\u003Cdiv align=\"center\">\n\u003Ch2> 大型语言模型与多模态生成及编辑：综述 \u003C\u002Fh2> \n\u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.19334'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArXiv-2405.19334-red'>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n# 🤗 引言\n- 本仓库包含一份精心整理的“大型语言模型与多模态生成”相关资源列表。这里的模态包括视觉（如图像、视频和3D）以及音频（如声音、语音和音乐）。 \n  \u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FYingqingHe_Awesome-LLMs-meet-Multimodal-Generation_readme_73831b1439a4.jpg\" width=300\"\">\n\u003C\u002Fp>\n\n- 我们欢迎对本仓库的任何贡献和建议，也欢迎添加您自己的研究成果。请随时提交拉取请求或留下您的评论！！\n\n\n# 📋 目录\n- [🤗 引言](#-introduction)\n- [📋 目录](#-contents)\n- [💘 小贴士](#-tips)\n- [📍 多模态生成](#-multimodal-generation)\n  - [图像生成](#image-generation)\n    - [🔅 基于LLM](#-llm-based)\n    - [非LLM基（Clip\u002FT5）](#non-llm-based-clipt5)\n    - [数据集](#datasets)\n  - [视频生成](#video-generation)\n    - [🔅 基于LLM](#-llm-based-1)\n    - [非LLM基](#non-llm-based)\n    - [视频VAE\u002F分词器](#video-vaetokenizers)\n    - [音频-视频](#audio-video)\n    - [基准测试](#benchmarks)\n    - [数据集](#datasets-1)\n  - [3D生成](#3d-generation)\n    - [🔅 基于LLM](#-llm-based-2)\n    - [非LLM基（Clip\u002FT5）](#non-llm-based-clipt5-1)\n    - [数据集](#datasets-2)\n  - [音频生成](#audio-generation)\n    - [🔅 基于LLM](#-llm-based-3)\n    - [非LLM基](#non-llm-based-1)\n    - [数据集](#datasets-3)\n  - [多模态生成](#generation-with-multiple-modalities)\n    - [🔅 基于LLM](#-llm-based-4)\n    - [非LLM基](#non-llm-based-2)\n- [📍 多模态编辑](#-multimodal-editing)\n  - [图像编辑](#image-editing)\n    - [🔅 基于LLM](#-llm-based-5)\n    - [非LLM基（Clip\u002FT5）](#non-llm-based-clipt5-2)\n  - [视频编辑](#video-editing)\n    - [🔅 基于LLM](#-llm-based-6)\n    - [非LLM基（Clip\u002FT5）](#non-llm-based-clipt5-3)\n  - [3D编辑](#3d-editing)\n    - [🔅 基于LLM](#-llm-based-7)\n    - [非LLM基（Clip\u002FT5）](#non-llm-based-clipt5-4)\n  - [音频编辑](#audio-editing)\n    - [🔅 基于LLM](#-llm-based-8)\n    - [非LLM基（Clip\u002FT5）](#non-llm-based-clipt5-5)\n- [📍 多模态智能体](#-multimodal-agents)\n- [📍 LLM驱动的多模态理解](#-multimodal-understanding-with-llms)\n  - [多模态](#multiple-modalities)\n  - [图像理解](#image-understanding)\n  - [视频理解](#video-understanding)\n  - [3D理解](#3d-understanding)\n  - [音频理解](#audio-understanding)\n- [📍 多模态LLM安全](#-multimodal-llm-safety)\n  - [攻击](#attack)\n  - [防御与检测](#defense-and-detect)\n  - [对齐](#alignment)\n  - [数据集](#datasets-4)\n  - [3D、视频和音频安全](#3d-video-and-audio-safety)\n- [📍 相关综述](#-related-surveys)\n  - [LLM](#llm)\n  - [视觉](#vision)\n- [👨‍💻 团队](#-team)\n- [😉 引用](#-citation)\n- [⭐️ 星标历史](#️-star-history)\n\n\n\n# 💘 小贴士\n- **✅ 通过目录搜索论文**：直接点击目录内容，选择您感兴趣的研究领域，即可浏览相关论文。\n- **✅ 通过作者姓名搜索论文**：您可以使用 `ctrl + F` 快捷键并输入作者姓名来查找特定作者的论文。搜索时，作者下拉列表会自动展开。\n- **✅ 通过标签搜索论文**：您还可以通过以下标签搜索相关论文：`customization`、`interactive`、`human motion generation`、`tokenizer`。（更多标签正在持续更新中）\n\n\n# 📍 多模态生成\n\n## 图像生成\n\n### 🔅 基于LLM\n\n\n+ **我思故我扩散：在扩散模型中实现多模态上下文推理**（2025年2月12日）\u003Cdetails>\u003Csummary>米振兴、王冠杰、钱国成等\u003C\u002Fsummary>米振兴、王冠杰、钱国成、叶汉荣、刘润涛、谢尔盖·图利亚科夫、克菲尔·阿伯曼、徐丹\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10458)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiZhenxing\u002FThinkDiff.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMiZhenxing\u002FThinkDiff?tab=readme-ov-file)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmizhenxing.github.io\u002FThinkDiff\u002F)\n\n+ **MetaMorph：通过指令微调实现多模态理解和生成**（2024年12月18日）\u003Cdetails>\u003Csummary>童升邦、范大卫、朱嘉辰等\u003C\u002Fsummary>童升邦、范大卫、朱嘉辰、熊云阳、陈新磊、考斯图夫·辛哈、迈克尔·拉巴特、扬·勒丘恩、谢赛宁、刘壮\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14164v1)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Ftsb0601.github.io\u002Fmetamorph\u002F)\n\n+ **X-Prompt：迈向自回归视觉语言基础模型中的通用上下文图像生成**（2024年12月2日）\u003Cdetails>\u003Csummary>孙泽义、储子洋、张攀等\u003C\u002Fsummary>孙泽义、储子洋、张攀、吴彤、董晓艺、臧宇航、熊元俊、林大华、王佳琪\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01824)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSunzeY\u002FX-Prompt.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSunzeY\u002FX-Prompt)\n\n\n+ **Cosmos Tokenizer：一套图像和视频神经分词器**（2024年11月6日）\u003Cdetails>\u003Csummary>菲茨姆·雷达、顾金伟、刘贤等\u003C\u002Fsummary>菲茨姆·雷达、顾金伟、刘贤、葛松伟、王廷春、王浩翔、刘明宇\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FCosmos-Tokenizer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FCosmos-Tokenizer?tab=readme-ov-file)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fdir\u002Fcosmos-tokenizer\u002F)`tokenizer`\n\n\n\n+ **【ICLR 2025 Spotlight】稀有到常见：借助LLM指导，在罕见概念上释放扩散模型的组合生成能力**（2024年10月29日）\u003Cdetails>\u003Csummary>朴东民、金世彬、文泰洪等\u003C\u002Fsummary>朴东民、金世彬、文泰洪、金珉奎、李康旭、曹在雄\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22376)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLargeWorldModel\u002FElasticTok.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fkrafton-ai\u002FRare-to-Frequent)\n\n+ **ElasticTok：面向图像和视频的自适应分词**（2024年10月10日）\u003Cdetails>\u003Csummary>威尔逊·颜、马泰·扎哈里亚、沃洛迪米尔·姆尼赫等\u003C\u002Fsummary>威尔逊·颜、马泰·扎哈里亚、沃洛迪米尔·姆尼赫、皮特·阿贝尔、亚历山德拉·福斯特、刘浩\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08368)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLargeWorldModel\u002FElasticTok.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLargeWorldModel\u002FElasticTok)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flargeworldmodel.github.io\u002Felastictok\u002F)`tokenizer`\n\n+ **DART：用于可扩展文本到图像生成的去噪自回归Transformer**（2024年10月10日）\u003Cdetails>\u003Csummary>贾涛·顾、王宇阳、张一哲等\u003C\u002Fsummary>贾涛·顾、王宇阳、张一哲、张启航、张丁怀、纳夫迪普·贾特利、乔什·萨斯金德、翟双飞\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08159)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=61bfd967a6ba21e50276e52f353fa74dd68990a6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FDART%3A-Denoising-Autoregressive-Transformer-for-Gu-Wang\u002F61bfd967a6ba21e50276e52f353fa74dd68990a6)\n\n\n\n+ **VILA-U：整合视觉理解和生成的统一基础模型**（2024年9月6日）\u003Cdetails>\u003Csummary>吴业成、张卓洋、陈俊宇等\u003C\u002Fsummary>吴业成、张卓洋、陈俊宇、唐浩天、李大成、方云昊、朱立耕、谢恩泽、尹宏旭、李毅、韩松、陆瑶\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.04429)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmit-han-lab\u002Fvila-u.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fvila-u)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fhanlab.mit.edu\u002Fprojects\u002Fvila-u)\n\n\n+ **OmniTokenizer：用于视觉生成的联合图像-视频分词器**（2024年6月13日）\u003Cdetails>\u003Csummary>王俊科、江毅、袁泽寰等\u003C\u002Fsummary>王俊科、江毅、袁泽寰、彭彬悦、吴祖轩、蒋宇刚\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09399)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=8613c1081a6ab34e2f980e35c06a1af461d7314e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FOmniTokenizer%3A-A-Joint-Image-Video-Tokenizer-for-Wang-Jiang\u002F8613c1081a6ab34e2f980e35c06a1af461d7314e)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFoundationVision\u002FLlamaGen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFoundationVision\u002FOmniTokenizer)\n`tokenizer`\n\n+ **InstantUnify：将多模态LLM集成到扩散模型中**（2024年8月）\u003Cdetails>\u003Csummary>王奇勋、白旭、王睿等\u003C\u002Fsummary>王奇勋、白旭、王睿、王浩凡\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinstantX-research\u002FInstantUnify.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FinstantX-research\u002FInstantUnify)\n\n\n+ **Show-o：一个单一Transformer统一多模态理解和生成**（2024年8月22日）\u003Cdetails>\u003Csummary>谢金恒、毛伟佳、白泽辰等\u003C\u002Fsummary>谢金恒、毛伟佳、白泽辰、大卫·张俊豪、王伟浩、林庆鸿、顾宇超、陈志杰、杨振亨、郑守迈\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.12528)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-49-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FShow-o%3A-One-Single-Transformer-to-Unify-Multimodal-Xie-Mao\u002F9dc337bd18bc0201839942931b12aff8ec1c93f5)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FShow-o.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FShow-o)\n\n+ **图像文本化：创建准确且详细图像描述的自动化框架**（2024年6月11日）\u003Cdetails>\u003Csummary>皮仁杰、张建树、张继鹏等\u003C\u002Fsummary>皮仁杰、张建树、张继鹏、潘锐、陈哲凯、张彤\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07502)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F91b4f447bb06d081a7947b42df57491a04fa46f9)\n\n\n\n+ **T2S-GPT：基于动态向量量化，从文本自动生成手语**（2024年6月11日）\u003Cdetails>\u003Csummary>[ACL 2024] 尹傲雄、李浩源、沈凯等\u003C\u002Fsummary>尹傲雄、李浩源、沈凯、汤思亮、庄玉婷\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07119)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F186910d697bf7eb605aa055aee78fd91ce3ce9fe)\n\n\n+ **通过多模态提示进行开放世界人-物交互检测**（2024年6月11日）\u003Cdetails>\u003Csummary>杨杰、李炳良、曾爱玲等\u003C\u002Fsummary>杨杰、李炳良、曾爱玲、张雷、张瑞茂\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07119)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F186910d697bf7eb605aa055aee78fd91ce3ce9fe)\n\n\n+ **常识-T2I挑战：文本到图像生成模型能否理解常识？**（2024年6月11日）\u003Cdetails>\u003Csummary>傅兴宇、何木雨、陆宇洁等\u003C\u002Fsummary>傅兴宇、何木雨、陆宇洁、威廉·杨王、丹·罗斯\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07221v1)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff0acf2a2293d963c3786e83bb198c75612adc446)\n\n\n\n+ **一张图像对于重建和生成而言，价值相当于32个token**（2024年6月11日）\u003Cdetails>\u003Csummary>于启航、马克·韦伯、邓雪晴等\u003C\u002Fsummary>于启航、马克·韦伯、邓雪晴、申晓辉、丹尼尔·克雷默斯、陈良杰\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07550)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1e31f4a4ccfc0d1e461be05361d77b5e045f4d37)\n\n+ **TRINS：迈向可阅读的多模态语言模型**（2024年6月10日）\u003Cdetails>\u003Csummary>[CVPR 2024] 张睿毅、张彦哲、陈健等\u003C\u002Fsummary> 张睿毅、张彦哲、陈健、周宇凡、顾九翔、陈昌友、孙彤\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06730)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F784260e953b2ce34df52086804681e35d57e5843)\n\n+ **[LlamaGen] 自回归模型胜过扩散模型：用于可扩展图像生成的Llama**（2024年6月10日）\u003Cdetails>\u003Csummary>孙培泽、蒋毅、陈寿发等\u003C\u002Fsummary>孙培泽、蒋毅、陈寿发、张士龙、彭冰悦、罗平、袁泽寰\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06525)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-42-blue.svg?paper=b15e6e2b1d81bc110f8fc98c3caf2e25e2512539)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FAutoregressive-Model-Beats-Diffusion%3A-Llama-for-Sun-Jiang\u002Fb15e6e2b1d81bc110f8fc98c3caf2e25e2512539)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFoundationVision\u002FLlamaGen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFoundationVision\u002FLlamaGen)\n\n\n\n\u003C!-- + **Genie：生成式交互环境**（2024年2月26日）\u003Cdetails>\u003Csummary>杰克·布鲁斯、迈克尔·丹尼斯、阿什利·爱德华兹等\u003C\u002Fsummary> 杰克·布鲁斯、迈克尔·丹尼斯、阿什利·爱德华兹、杰克·帕克-霍尔德、石宇格、爱德华·休斯、马修·赖、阿迪蒂·马瓦兰卡尔、里奇·施泰格瓦尔德、克里斯·阿普斯、尤苏夫·艾塔尔、萨拉·贝希特勒、费里亚尔·贝赫巴哈尼、斯蒂芬妮·陈、尼古拉斯·希斯、露西·冈萨雷斯、西蒙·奥辛德罗、谢尔吉尔·奥扎伊尔、斯科特·里德、张景伟、孔拉德·佐尔纳、杰夫·克鲁恩、南多·德弗雷塔斯、萨廷德·辛格、蒂姆·罗克特舍尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15391v1) -->\n\n+ **Chameleon：混合模态早期融合基础模型**（2024年5月16日）\u003Cdetails>\u003Csummary>Chameleon团队\u003C\u002Fsummary>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.09818)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=32112b798f70faab00e14806f51d46058cf5e597)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FChameleon%3A-Mixed-Modal-Early-Fusion-Foundation-Team\u002F32112b798f70faab00e14806f51d46058cf5e597?utm_source=direct_link)\n\n+ **SEED-X：具有统一多粒度理解与生成能力的多模态模型**（2024年4月22日）\u003Cdetails>\u003Csummary>葛雨莹、赵思洁、朱金国等\u003C\u002Fsummary>葛雨莹、赵思洁、朱金国、葛一骁、易坤、宋琳、李晨、丁晓涵、单颖\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14396)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-44-blue.svg?paper=9259b476c31ba52b7e9ed059e5fbce2125092738)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FSEED-X%3A-Multimodal-Models-with-Unified-and-Ge-Zhao\u002Fef03c907ee24e2e6c0f2d24639551c82862d1080)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FSEED-X.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FSEED-X)\n\n\n+ **大型多模态模型辅助平面设计**（2024年4月22日）\u003Cdetails>\u003Csummary>程宇涛、张钊、杨茂科等\u003C\u002Fsummary> 程宇涛、张钊、杨茂科、聂辉、李春元、吴兴隆和邵杰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14368)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=9259b476c31ba52b7e9ed059e5fbce2125092738)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9259b476c31ba52b7e9ed059e5fbce2125092738)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgraphic-design-ai\u002Fgraphist.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgraphic-design-ai\u002Fgraphist)\n\n\n+ **PMG：基于大型语言模型的个性化多模态生成**（2024年4月7日）\u003Cdetails>\u003Csummary>申晓腾、张锐、赵晓燕等\u003C\u002Fsummary>申晓腾、张锐、赵晓燕、朱继明、肖曦\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.08677)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=cfb9eba1b5c55bb0052df41eaaff8716f9c420bd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FPMG-%3A-Personalized-Multimodal-Generation-with-Large-Shen-Zhang\u002Fcfb9eba1b5c55bb0052df41eaaff8716f9c420bd)\n\n\n+ **MineDreamer：通过想象链学习指令以实现模拟世界控制**（2024年3月19日）\u003Cdetails>\u003Csummary>周恩深、秦怡然、尹振飞等\u003C\u002Fsummary>周恩深、秦怡然、尹振飞、黄宇舟、张瑞茂、盛陆、乔宇、邵静\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12037)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=ae06df762adcc4221162e83a737ea63cff47e65d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fae06df762adcc4221162e83a737ea63cff47e65d)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZhoues\u002FMineDreamer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FZhoues\u002FMineDreamer)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsites.google.com\u002Fview\u002Fminedreamer\u002Fmain)\n\n\n\n+ **ELLA：为扩散模型配备LLM以增强语义对齐**（2024年3月8日）\u003Cdetails>\u003Csummary>胡锡伟、王睿、方一骁等\u003C\u002Fsummary>胡锡伟、王睿、方一骁、傅斌、程沛、于刚\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05135)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fda0d382c7fa981ba185ca633868442b75cb76de6)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FELLA-Diffusion\u002FELLA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FELLA-Diffusion\u002FELLA)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fella-diffusion.github.io\u002F)\n\n\n+ **StrokeNUWA：用于矢量图形合成的笔画分词**（2024年1月30日）\u003Cdetails>\u003Csummary>唐泽成、吴晨菲、张泽凯等\u003C\u002Fsummary>唐泽成、吴晨菲、张泽凯、倪明恒、殷圣明、刘宇、杨正源、王丽娟、刘子诚、李俊涛、段楠\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.17093)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=b2f6830afe63eb477294f17f0d3a6923135950f9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FStrokeNUWA%3A-Tokenizing-Strokes-for-Vector-Graphic-Tang-Wu\u002Fb2f6830afe63eb477294f17f0d3a6923135950f9)\n`tokenizer`\n\n+ **DiffusionGPT：由大语言模型驱动的文本到图像生成系统**（2024年1月18日）\u003Cdetails>\u003Csummary>秦杰、吴杰、陈伟峰等\u003C\u002Fsummary> 秦杰、吴杰、陈伟峰、任宇希、李慧霞、吴和峰、肖雪峰、王锐、温士磊\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10061)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=d4b1a1c62a03ccffcf24983eb4fe22335cbb89b6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd4b1a1c62a03ccffcf24983eb4fe22335cbb89b6)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDiffusionGPT\u002FDiffusionGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDiffusionGPT\u002FDiffusionGPT)\n\n\n+ **StarVector：从图像生成可扩展矢量图形代码**（2023年12月17日）\u003Cdetails>\u003Csummary>胡安·A·罗德里格斯、舒巴姆·阿加瓦尔、伊萨姆·H·拉拉吉等\u003C\u002Fsummary> 胡安·A·罗德里格斯、舒巴姆·阿加瓦尔、伊萨姆·H·拉拉吉、保·罗德里格斯、大卫·巴斯克斯、克里斯托弗·帕尔、马可·佩德罗索利\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.11556)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=60d3ade5c0085f5de1f5ab944cc058c78706ac66)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F60d3ade5c0085f5de1f5ab944cc058c78706ac66)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjoanrod\u002Fstar-vector.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fjoanrod\u002Fstar-vector)\n\n\n+ **VL-GPT：用于视觉与语言理解及生成的生成式预训练Transformer**（2023年12月14日）\u003Cdetails>\u003Csummary>朱金国、丁晓涵、葛一骁等\u003C\u002Fsummary> 朱金国、丁晓涵、葛一骁、葛雨莹、赵思杰、赵恒爽、王晓华、单颖\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.09251)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=ea6982a936a2b263bbf46ff6eb27fc0b63fddaf7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fea6982a936a2b263bbf46ff6eb27fc0b63fddaf7)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FVL-GPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FVL-GPT)\n\n\n+ **StoryGPT-V：大型语言模型作为一致的故事可视化工具**（2023年12月13日）\u003Cdetails>\u003Csummary>沈小倩、穆罕默德·埃尔霍赛尼\u003C\u002Fsummary> 沈小倩、穆罕默德·埃尔霍赛尼\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02252)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=e49cb2ab3a7990e3d05042197ae8b3fd934453de)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe49cb2ab3a7990e3d05042197ae8b3fd934453de)\n\n\n\n+ **GENIXER：赋能多模态大型语言模型成为强大的数据生成器**（2023年12月11日）\u003Cdetails>\u003Csummary>赵亨元、周攀、郑守迈克\u003C\u002Fsummary> 赵亨元、周攀、郑守迈克\u003C\u002Fdetails> \n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06731)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=cb2295766b2f8f35524f6a9f93ae39d948d50bd4)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcb2295766b2f8f35524f6a9f93ae39d948d50bd4)\n\n\n+ **文本到图像生成的定制化助手**（2023年12月5日）\u003Cdetails>\u003Csummary>周宇凡、张睿怡、顾九翔等\u003C\u002Fsummary> 周宇凡、张睿怡、顾九翔、孙彤\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03045)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=f30bb09dbd95845d792bdac217a9a652635ee8a5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff30bb09dbd95845d792bdac217a9a652635ee8a5)`customization`\n\n\n\n+ **ChatIllusion：高效对齐视觉指令模型的交错生成能力**（2023年11月29日） \u003Cdetails>\u003Csummary>迟晓伟、刘义江、蒋正凯等\u003C\u002Fsummary> 迟晓伟、刘义江、蒋正凯、张荣宇、林子毅、张仁瑞、高鹏、傅朝友、张尚航、刘启峰、郭益科\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17963)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=22d55c52f43f59634586ab95fefbb7dba8c8b190)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F22d55c52f43f59634586ab95fefbb7dba8c8b190)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flitwellchi\u002FChatIllusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Flitwellchi\u002FChatIllusion)\n\n\n+ **DreamSync：将文本到图像生成与图像理解反馈对齐**（2023年11月29日） \u003Cdetails>\u003Csummary>孙娇、付德庆、胡宇诗等\u003C\u002Fsummary> 孙娇、付德庆、胡宇诗、王苏、拉西尼·罗伊、Juan Da-Cheng、达娜·阿隆、查尔斯·赫尔曼、斯约尔德·范·斯滕基斯特、兰杰·克里希纳、塞勒斯·拉斯奇安\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17946)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=d16f72b7be526dee5eb49e5afffeea2bddba5e66)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd16f72b7be526dee5eb49e5afffeea2bddba5e66)\n\n\n\n\n+ **COLE：面向平面设计的层次化生成框架**（2023年11月28日） \u003Cdetails>\u003Csummary>贾培东、李晨轩、刘泽宇等\u003C\u002Fsummary> 贾培东、李晨轩、刘泽宇、申一超、陈星如、袁玉辉、郑英琳、陈栋、李济、谢晓东、张尚航、郭百宁\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16974)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=8441c30ad4abdca9ee380aa6f22ffd731b10231b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8441c30ad4abdca9ee380aa6f22ffd731b10231b)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgraphic-design-generation.github.io\u002F)\n\n\n\n+ **TextDiffuser-2：释放语言模型在文本渲染中的强大能力**（2023年11月28日） \u003Cdetails>\u003Csummary>陈景业、黄宇潘、吕腾超等\u003C\u002Fsummary> 陈景业、黄宇潘、吕腾超、崔雷、陈启峰、魏福儒\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16465)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=1c6e2a4da1ead685a95c079751bf4d7a727d8180)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1c6e2a4da1ead685a95c079751bf4d7a727d8180)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjingyechen.github.io\u002Ftextdiffuser2\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Ftextdiffuser-2)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FJingyeChen22\u002FTextDiffuser-2)\n\n+ **LLMGA：基于多模态大语言模型的生成助手**（2023年11月27日）\u003Cdetails>\u003Csummary>夏彬、王世银、陶英凡等\u003C\u002Fsummary> 夏彬、王世银、陶英凡、王一桐、贾佳亚\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16500)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=769a924d0af014acec326f50c15c5d70d258a969)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F769a924d0af014acec326f50c15c5d70d258a969)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLMGA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLMGA)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllmga.github.io\u002F)\n\n\n\n\n\n+ **自纠正的LLM控制扩散模型**（2023年11月27日）\u003Cdetails>\u003Csummary>吴宗翰、连龙、约瑟夫·E·冈萨雷斯等\u003C\u002Fsummary> 吴宗翰、连龙、约瑟夫·E·冈萨雷斯、李博毅、特雷弗·达雷尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16090)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=42c4315b5d2e33d7d9a0afdf84e6a47ccd7a700e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F42c4315b5d2e33d7d9a0afdf84e6a47ccd7a700e)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftsunghan-wu\u002FSLD.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftsunghan-wu\u002FSLD)\n\n\n+ **[ParaDiffusion] 基于信息增强扩散模型的段落到图像生成**（2023年11月29日）\u003Cdetails>\u003Csummary>吴伟嘉、李壮、何业飞等\u003C\u002Fsummary> 吴伟嘉、李壮、何业飞、Mike Zheng Shou、沈春华、程乐乐、李燕、高婷婷、张迪、王中元\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.14284)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-19-blue.svg?paper=ed4f1b0f6c09f59d07e817e532a25f4d25e94dbc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fed4f1b0f6c09f59d07e817e532a25f4d25e94dbc)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fweijiawu\u002FParaDiffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fweijiawu\u002FParaDiffusion)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fweijiawu.github.io\u002FParaDiffusionPage\u002F)\n\n\n+ **为多模态大语言模型对所有内容进行分词和嵌入**（2023年11月8日）\u003Cdetails>\u003Csummary>杨振、张颖雪、孟凡东等\u003C\u002Fsummary> 杨振、张颖雪、孟凡东、周杰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.04589)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=59d716b442ab760a78f58de6748c0fa1d507bfc1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F59d716b442ab760a78f58de6748c0fa1d507bfc1)\n`tokenizer`\n\n\n+ **WordArt设计师：利用大语言模型实现用户驱动的艺术字体合成**（2023年10月20日）\u003Cdetails>\u003Csummary>何俊彦、程志奇、李晨阳等\u003C\u002Fsummary> 何俊彦、程志奇、李晨阳、孙京东、向望蒙、林贤辉、康晓阳、金增科、胡宇森、罗斌、耿义峰、谢宣松、周景仁\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.18332)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=58b77dc0603eb52559d98a383bf9649fd31d0bc5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F58b77dc0603eb52559d98a383bf9649fd31d0bc5)\n\n\n+ **LLM蓝图：通过复杂而详细的提示实现文本到图像生成**（2023年10月16日）\u003Cdetails>\u003Csummary>[ICLR 2024] 哈南·加尼、沙里克·法鲁克·巴特、穆扎马尔·纳西尔等\u003C\u002Fsummary> 哈南·加尼、沙里克·法鲁克·巴特、穆扎马尔·纳西尔、萨尔曼·汗、彼得·翁卡\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.10640)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=4cb2c262ce34f41974f1b1623fc5a6e32956ded3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4cb2c262ce34f41974f1b1623fc5a6e32956ded3)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhananshafi\u002Fllmblueprint.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhananshafi\u002Fllmblueprint)\n\n\n\n+ **让多模态生成更简单：当扩散模型遇到大语言模型**（2023年10月13日）\u003Cdetails>\u003Csummary>赵翔宇、刘波、刘琪琼等\u003C\u002Fsummary> 赵翔宇、刘波、刘琪琼、史广源、吴小明\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08949v1)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=833cdd713c27ab5899bb912a1d511c10af61cefb)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F833cdd713c27ab5899bb912a1d511c10af61cefb)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzxy556677\u002FEasyGen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzxy556677\u002FEasyGen)\n\n\n+ **Idea2Img：利用GPT-4V（vision）进行迭代自我精炼，实现自动图像设计与生成**（2023年10月12日）\u003Cdetails>\u003Csummary>杨正元、王建峰、李林洁等\u003C\u002Fsummary> 杨正元、王建峰、李林洁、凯文·林、林忠清、刘子成、王丽娟\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08541)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=1d14a708622917da4b9820ada6d32af24fc1651a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1d14a708622917da4b9820ada6d32af24fc1651a)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fidea2img.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzyang-ur\u002FIdea2Img.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzyang-ur\u002FIdea2Img)\n\n\n+ **OpenLEAF：开放域交叉图像-文本生成与评估**（2023年10月11日）\u003Cdetails>\u003Csummary>安杰、杨正元、李林洁等\u003C\u002Fsummary> 安杰、杨正元、李林洁、王建峰、凯文·林、刘子成、王丽娟、罗继波\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07749)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=7f1ba5630c3baa09b11cc665b3f71cdb117e5ffb)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7f1ba5630c3baa09b11cc665b3f71cdb117e5ffb)\n\n\n\n+ **Mini-DALLE3：通过提示大语言模型实现交互式文本到图像生成**（2023年10月11日）\u003Cdetails>\u003Csummary>赖泽强、朱锡洲、戴继峰等\u003C\u002Fsummary> 赖泽强、朱锡洲、戴继峰、乔宇、王文海\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07653)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=f669d7a6fab0147253178a6fc854e05e3d92fb3f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff669d7a6fab0147253178a6fc854e05e3d92fb3f)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fminidalle3.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZeqiang-Lai\u002FMini-DALLE3.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FZeqiang-Lai\u002FMini-DALLE3)\n\n+ **[DALL-E 3] 通过更优质的描述文本提升图像生成效果** \u003Cdetails>\u003Csummary>詹姆斯·贝特克、加布里埃尔·戈、李静等\u003C\u002Fsummary>詹姆斯·贝特克、加布里埃尔·戈、李静、蒂姆·布鲁克斯、王建峰、李林杰、龙欧阳、庄俊堂、乔伊斯·李、郭宇飞、韦萨姆·马纳萨拉、普拉富拉·达里瓦尔、凯西·楚、焦云鑫、阿迪提亚·拉梅什\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fdall-e-3.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-146-blue.svg?paper=cfee1826dd4743eab44c6e27a0cc5970effa4d80)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcfee1826dd4743eab44c6e27a0cc5970effa4d80)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fopenai.com\u002Fdall-e-3)\n\n\n+ **MiniGPT-5：基于生成式 Voken 的视觉与语言交替生成**（2023年10月3日）\\\n郑凯志、何学海、王新埃里克。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02239)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-26-blue.svg?paper=e7d09b6f2bc878cf2c993acf675f409d0b55f35a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe7d09b6f2bc878cf2c993acf675f409d0b55f35a)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Feric-ai-lab.github.io\u002Fminigpt-5.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Feric-ai-lab\u002FMiniGPT-5.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Feric-ai-lab\u002FMiniGPT-5)\n\n\n+ **借助 SEED 分词器让 LLaMA “看见”并“绘画”**（2023年10月2日）\u003Cdetails>\u003Csummary>葛雨莹、赵思杰、曾子云等\u003C\u002Fsummary>葛雨莹、赵思杰、曾子云、葛一骁、李晨、王新涛、单颖\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01218)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-19-blue.svg?paper=5ba1525dc6d382ee0a4a1ca3c64fc5907ca64c67)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5ba1525dc6d382ee0a4a1ca3c64fc5907ca64c67)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fseed\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FSEED.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FSEED)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fdad1ed9a9fb76fe83b.gradio.live\u002F)\n`分词器`\n\n\n+ **InstructCV：作为视觉通用模型的指令微调文本到图像扩散模型**（2023年9月30日）\u003Cdetails>\u003Csummary>甘玉露、朴成佑、亚历山大·舒伯特等\u003C\u002Fsummary>甘玉露、朴成佑、亚历山大·舒伯特、安东尼·菲利帕基斯、艾哈迈德·M·阿拉\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00390)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=819f477065088220a6f706cd9ef76dbcb4b4c134)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F819f477065088220a6f706cd9ef76dbcb4b4c134)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlaaLab\u002FInstructCV.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAlaaLab\u002FInstructCV)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Falaa-lab\u002FInstructCV)\n\n+ **InternLM-XComposer：用于高级文本-图像理解与创作的视觉-语言大模型**（2023年9月26日）\u003Cdetails>\u003Csummary>张攀、董晓义、王斌等\u003C\u002Fsummary>张攀、董晓义、王斌、曹宇航、徐超、欧阳林科、赵志远、段浩东、张松阳、丁双瑞、张文伟、严航、张欣悦、李伟、李静雯、陈凯、何聪辉、张兴成、乔宇、林大华、王佳琪\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15112)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-48-blue.svg?paper=c1e450284e7d6cac1855330a1197df8537df653f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc1e450284e7d6cac1855330a1197df8537df653f)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer)\n\n\n\n\n+ **抽象概念的文本到图像生成**（2023年9月26日） \u003Cdetails>\u003Csummary>廖嘉怡、陈旭、傅强等\u003C\u002Fsummary>廖嘉怡、陈旭、傅强、杜伦、何湘南、王翔、韩世、张冬梅\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.14623)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=0d38f1edac66b4645cf5fa05abaf9d92cba5d5d3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0d38f1edac66b4645cf5fa05abaf9d92cba5d5d3)\n\n\n\n\n+ **DreamLLM：协同的多模态理解与创作**（2023年9月20日）\u003Cdetails>\u003Csummary>[ICLR 2024] 董润沛、韩春锐、彭元等\u003C\u002Fsummary>董润沛、韩春锐、彭元、齐泽坤、葛正、杨金荣、赵亮、孙建建、周洪宇、魏浩然、孔祥文、张祥宇、马凯胜、李毅\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.11499)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-38-blue.svg?paper=7b689adb8c156d6158660f90d1c86888ee281f63)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7b689adb8c156d6158660f90d1c86888ee281f63)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdreamllm.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRunpeiDong\u002FDreamLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FRunpeiDong\u002FDreamLLM)\n\n\n+ **SwitchGPT：将大型语言模型适配为非文本输出**（2023年9月14日）\\\n王新宇、庄博涵、吴奇。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.07623)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=366564d210768814bc880e391b909cfbd95f8964)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F366564d210768814bc880e391b909cfbd95f8964)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxinke-wang\u002FSwitchGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fxinke-wang\u002FSwitchGPT)\n\n+ **NExT-GPT：任意模态之间的多模态大模型**（2023年9月11日）\u003Cdetails>\u003Csummary>吴圣琼、费浩、屈雷刚等\u003C\u002Fsummary>吴圣琼、费浩、屈雷刚、季伟、蔡添顺\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05519)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-94-blue.svg?paper=fa75a55760e6ea49b39b83cb85c99a22e1088254)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffa75a55760e6ea49b39b83cb85c99a22e1088254)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fnext-gpt.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNExT-GPT\u002FNExT-GPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNExT-GPT\u002FNExT-GPT)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002F9704af1b453125102e.gradio.live\u002F)\n\n+ **LayoutLLM-T2I：从大型語言模型中提取佈局指導以進行文本到圖像生成**（2023年8月9日）\u003Cdetails>\u003Csummary>屈雷剛、吳聖瓊、費浩等。ACM MM 2023\u003C\u002Fsummary>屈雷剛、吳聖瓊、費浩、聶立強、蔡達生\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.05095)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=7d78238a9bad60433d616abdd93c735087d99670)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7d78238a9bad60433d616abdd93c735087d99670)\n[![項目頁面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flayoutllm-t2i.github.io\u002F)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLayoutLLM-T2I\u002FLayoutLLM-T2I.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLayoutLLM-T2I\u002FLayoutLLM-T2I)\n\n+ **在大型語言模型中播下視覺的種子**（2023年7月16日）\u003Cdetails>\u003Csummary>葛宇瑩、葛一暁、曾子雲等。\u003C\u002Fsummary>葛宇瑩、葛一暁、曾子雲、王新濤、山穎\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.08041)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-30-blue.svg?paper=40298b8d50109c52fc10763eddc64a07cf8acb31)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F40298b8d50109c52fc10763eddc64a07cf8acb31)\n[![項目頁面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fseed\u002F)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FSEED.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FSEED)\n\n+ **多模態中的生成式預訓練**（2023年7月11日）\u003Cdetails>\u003Csummary>孫權、余琪英、崔宇峰等。\u003C\u002Fsummary>孫權、余琪英、崔宇峰、張帆、張曉松、王悅澤、高洪成、劉靜靜、黃鐵軍、王欣龍\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.05222)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-43-blue.svg?paper=94053805cd59f2e9a47fe3f080c7e7afefb337cc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F94053805cd59f2e9a47fe3f080c7e7afefb337cc)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](http:\u002F\u002F218.91.113.230:9002)\n\n+ **SPAE：用凍結的大型語言模型進行多模態生成的語義金字塔自編碼器**（2023年6月30日） \u003Cdetails>\u003Csummary>[NeurIPS 2023 Spotlight] 劉俊宇、程勇、王志若等。\u003C\u002Fsummary>劉俊宇、程勇、王志若、維韋克·庫馬爾、沃爾夫岡·馬赫雷、黃彥平、大衛·A·羅斯、伊爾凡·埃薩、約納坦·比斯克、楊明軒、凱文·墨菲、亞歷山大·G·豪普特曼、江璐\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.17842)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=376f494126d1ea4f571ea0263c43ac2b6331800a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F376f494126d1ea4f571ea0263c43ac2b6331800a)\n\n+ **使用GPT-4進行可控的文本到圖像生成**（2023年5月29日） \u003Cdetails>\u003Csummary>張天駿、張毅、維巴夫·維尼特等。\u003C\u002Fsummary>張天駿、張毅、維巴夫·維尼特、尼爾·喬希、王鑫\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18583)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-20-blue.svg?paper=3a79545719fb193a6b4042ef7d1d87cfd267be06)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3a79545719fb193a6b4042ef7d1d87cfd267be06)\n[![項目頁面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002Ftianjunz\u002FControl-GPT)\n\n+ **用多模態語言模型生成圖像**（2023年5月26日）\\\n[NeurIPS 2023] 科赫、景宇、丹尼爾·弗里德和魯斯蘭·薩拉胡丁諾夫。 \\\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17216)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-73-blue.svg?paper=6fb5c0eff3696ef252aca9638e10176ecce7cecb)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6fb5c0eff3696ef252aca9638e10176ecce7cecb)\n[![項目頁面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjykoh.com\u002Fgill)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkohjingyu\u002Fgill.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fkohjingyu\u002Fgill)\n\n+ **LayoutGPT：利用大型語言模型進行組合式的視覺規劃與生成**（2023年5月24日）\u003Cdetails>\u003Csummary>[NeurIPS 2023] 馮偉熙、朱婉蓉、傅次睿等。\u003C\u002Fsummary>冯伟熙、朱婉蓉、傅次睿、瓦倫·詹帕尼、阿俊·阿庫拉、何學海、蘇加托·巴蘇、王欣艾瑞克、威廉·楊·王\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15393)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-31-blue.svg?paper=66d755730f5d08a6f4fcc5e81f24982ba389dca9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F66d755730f5d08a6f4fcc5e81f24982ba389dca9)\n[![項目頁面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flayoutgpt.github.io\u002F)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fweixi-feng\u002FLayoutGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fweixi-feng\u002FLayoutGPT)\n\n+ **用於文本到圖像生成與評估的視覺編程**（2023年5月24日）\\\n[NeurIPS 2023] 曹載民、阿貝·扎拉、莫希特·班薩爾。\\\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15328)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-24-blue.svg?paper=9837349417e36ef5be06da0fd6c74042148bdaa2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9837349417e36ef5be06da0fd6c74042148bdaa2)\n[![項目頁面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvp-t2i.github.io\u002F)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fj-min\u002FVPGen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fj-min\u002FVPGen)\n\n+ **LLM驅動的擴散模型：利用大型語言模型增強文本到圖像擴散模型的提示理解能力**（2023年5月23日） \u003Cdetails>\u003Csummary>連龍、李博義、亞當·雅拉等。\u003C\u002Fsummary>連龍、李博義、亞當·雅拉、特雷弗·達雷爾\u003C\u002Fdetails>[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13655)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-39-blue.svg?paper=e9ae0c76a71b8f302eb17b1c4462b9cc97d87cd0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe9ae0c76a71b8f302eb17b1c4462b9cc97d87cd0)\n[![項目頁面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllm-grounded-diffusion.github.io\u002F)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTonyLianLong\u002FLLM-groundedDiffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTonyLianLong\u002FLLM-groundedDiffusion)\n\n+ **基于LLMs-AIGCs协作的系统性视觉适配交互式数据合成**（2023年5月22日）\u003Cdetails>\u003Csummary>于齐凡、李俊成、叶文涛等\u003C\u002Fsummary>于齐凡、李俊成、叶文涛、唐思亮、庄宇婷\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.12799)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=43a55dbd95c9d5cd82de8db276f41adeec4a937d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F43a55dbd95c9d5cd82de8db276f41adeec4a937d)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuqifan1117\u002FLabal-Anything-Pipeline.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYuqifan1117\u002FLabal-Anything-Pipeline)\n\n+ **LLMScore：揭示大型语言模型在文本到图像合成评估中的强大能力**（2023年5月18日）\u003Cdetails>\u003Csummary>[NeurIPS 2023] 陆宇杰、杨贤俊、李秀军等\u003C\u002Fsummary>陆宇杰、杨贤俊、李秀军、王新埃里克、威廉·杨·王\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11116)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=972501b057e2b84d6ce6506f70bcac697bab7872)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F972501b057e2b84d6ce6506f70bcac697bab7872)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYujieLu10\u002FLLMScore.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYujieLu10\u002FLLMScore)\n\n+ **SUR-adapter：利用大型语言模型增强文本到图像预训练扩散模型**（2023年5月9日）\u003Cdetails>\u003Csummary>[ACM MM 2023] 钟珊珊、黄中展、温武绍等\u003C\u002Fsummary>钟珊珊、黄中展、温武绍、秦景辉、林亮\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.05189)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQrange-group\u002FSUR-adapter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FQrange-group\u002FSUR-adapter)\n\n+ **将语言模型与图像对齐以实现多模态输入和输出**（2023年1月31日）\\\n[ICML 2023] 科赫、京宇、鲁斯兰·萨拉胡丁诺夫和丹尼尔·弗里德。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.13823)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-35-blue.svg?paper=6173520a1eb2814d067e8c5fd16212b7cbf6ee78)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6173520a1eb2814d067e8c5fd16212b7cbf6ee78)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjykoh.com\u002Ffromage)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkohjingyu\u002Ffromage.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fkohjingyu\u002Ffromage)\n\n+ **[RPG-DiffusionMaster] 掌握文本到图像扩散：使用多模态LLM进行重新描述、规划和生成**（2024年1月22日） \u003Cdetails>\u003Csummary>[ICML 2024] 杨凌、于兆辰、孟晨琳等\u003C\u002Fsummary>杨凌、于兆辰、孟晨琳、徐敏凯、斯特法诺·埃尔蒙、崔斌\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.11708)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-23-blue.svg?paper=140cfda71bfff852c3e205b7ad61854b78c76982)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F140cfda71bfff852c3e205b7ad61854b78c76982)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYangLing0818\u002FRPG-DiffusionMaster.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYangLing0818\u002FRPG-DiffusionMaster)\n\n+ **RealCompo：平衡真实感与组合性可提升文本到图像扩散模型性能**（2024年2月20日）\u003Cdetails>\u003Csummary>张欣晨、杨凌、蔡雅琪等\u003C\u002Fsummary>张欣晨、杨凌、蔡雅琪、于兆辰、王凯妮、谢佳科、田烨、徐敏凯、唐勇、杨友久、崔斌\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12908)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=9c2ba04c376f127da506b63c566887fca2861b25)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9c2ba04c376f127da506b63c566887fca2861b25)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcominclip.github.io\u002FRealCompo_Page\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYangLing0818\u002FRealCompo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYangLing0818\u002FRealCompo)\n\n\n\n### 非LLM类（Clip\u002FT5）\n+ **Edify Image：基于像素空间拉普拉斯扩散模型的高质量图像生成**（2024年11月11日）\u003Cdetails>\u003Csummary>NVIDIA：尤瓦尔·阿茨蒙、马切伊·巴拉、约格什·巴拉吉等\u003C\u002Fsummary>NVIDIA：尤瓦尔·阿茨蒙、马切伊·巴拉、约格什·巴拉吉、蒂芙尼·蔡、尹翠、焦娇凡、云浩葛、西达尔特·古鲁拉尼、雅各布·哈夫曼、罗纳德·艾萨克、波亚·詹纳蒂、泰罗·卡拉斯、格蕾丝·拉姆、J. P. 路易斯、亚伦·利卡塔、颜辰林、明宇刘、千莉马、阿伦·马利亚、阿什莉·马蒂诺-塔尔、道格·门德斯、承俊娜、克里斯·普鲁特、菲茨姆·雷达、贾明宋、廷春王、方银魏、晓辉曾、宇曾、秦生张\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.06959)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fdir\u002Fedify-image\u002F)\n\n\n+ **InstantStyle：文本到图像生成中风格保留的免费午餐**（2024年4月3日）\u003Cdetails>\u003Csummary>王浩帆、马泰奥·斯皮内利、王奇勋等\u003C\u002Fsummary>王浩帆、马泰奥·斯皮内利、王奇勋、白旭、秦泽奎、安东尼·陈\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.02733)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=6b5fc164c4f21e4a4f151df60bfd5e32b061a903)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FInstantStyle%3A-Free-Lunch-towards-Style-Preserving-Wang-Spinelli\u002F6b5fc164c4f21e4a4f151df60bfd5e32b061a903)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Finstantstyle.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinstantX-research\u002FInstantStyle.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FinstantX-research\u002FInstantStyle)\n\n+ **InstantID：零样本身份保留生成，几秒钟内完成**（2024年1月15日）\u003Cdetails>\u003Csummary>王奇勋、白旭、王浩帆等\u003C\u002Fsummary>王奇勋、白旭、王浩帆、秦泽奎、安东尼·陈、李华夏、唐旭、胡耀\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.07519)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=0f9b66c9208b11369e9d94d85b7dc23bcc5115e9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FInstantID%3A-Zero-shot-Identity-Preserving-Generation-Wang-Bai\u002F0f9b66c9208b11369e9d94d85b7dc23bcc5115e9)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Finstantid.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinstantX-research\u002FInstantID.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FinstantX-research\u002FInstantID)\n\n+ **PIXART-α：用于照片级真实感文本到图像合成的扩散Transformer快速训练**（2023年9月30日）\u003Cdetails>\u003Csummary>[ICLR 2024] 陈俊松、于金成、葛崇健等\u003C\u002Fsummary>陈俊松、于金成、葛崇健、姚乐威、谢恩泽、吴岳、王中道、郭明达、罗平、陆虎川、李振国\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00426)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=7dfe1c9f1d7120102499c7e561efc2326e7a0358)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7dfe1c9f1d7120102499c7e561efc2326e7a0358)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fpixart-alpha.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPixArt-alpha\u002FPixArt-alpha.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPixArt-alpha\u002FPixArt-alpha)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FPixArt-alpha\u002FPixArt-alpha)\n\n+ **TextDiffuser：作为文本画家的扩散模型**（2023年5月18日） \u003Cdetails>\u003Csummary>[NeurIPS 2023] 陈景业、黄宇攀、吕腾超等\u003C\u002Fsummary>陈景业、黄宇攀、吕腾超、崔磊、陈奇峰、魏福如\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10855)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=e779781f1bea273573fc9d3f1a5e874bcff2cd2b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe779781f1bea273573fc9d3f1a5e874bcff2cd2b)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjingyechen.github.io\u002Ftextdiffuser\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Ftextdiffuser)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FJingyeChen22\u002FTextDiffuser)\n\n+ **TiGAN：基于文本的交互式图像生成与操控**（2022年12月）\u003Cdetails>\u003Csummary>[AAAI 2022] 周宇凡、张睿毅、顾九翔等\u003C\u002Fsummary>周宇凡、张睿毅、顾九翔、克里斯·滕斯迈尔、于彤、陈昌友、徐锦辉、孙彤\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAAAI\u002Farticle\u002Fview\u002F20270)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=839dc73c1adae268144d9cfb9d70985b2001304f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F839dc73c1adae268144d9cfb9d70985b2001304f)\n标签：`交互`\n\n+ **文本到图像扩散模型的多概念定制化**（2022年12月8日）\u003Cdetails>\u003Csummary>[CVPR 2023] 努普尔·库玛丽、张炳亮、理查德·张等\u003C\u002Fsummary>努普尔·库玛丽、张炳亮、理查德·张、埃利·谢赫特曼、朱俊彦\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.04488)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-307-blue.svg?paper=144eca44e250cc462f6fc3a172abb865978f66f5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F144eca44e250cc462f6fc3a172abb865978f66f5)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.cs.cmu.edu\u002F~custom-diffusion\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fadobe-research\u002Fcustom-diffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fadobe-research\u002Fcustom-diffusion)\\\n标签：`定制化`\n\n+ **DreamBooth：针对特定主题生成的文本到图像扩散模型微调**（2022年8月25日）\u003Cdetails>\u003Csummary>[CVPR 2023] 纳塔尼尔·鲁伊斯、李元珍、瓦伦·詹帕尼等\u003C\u002Fsummary>纳塔尼尔·鲁伊斯、李元珍、瓦伦·詹帕尼、雅埃尔·普里奇、迈克尔·鲁宾斯坦、基菲尔·阿伯曼\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.12242)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1090-blue.svg?paper=5b19bf6c3f4b25cac96362c98b930cf4b37f6744)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5b19bf6c3f4b25cac96362c98b930cf4b37f6744)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdreambooth.github.io\u002F)\\\n标签：`定制化`\n\n+ **一张图胜过千言万语：利用文本反演个性化文本到图像生成**（2022年8月2日）\u003Cdetails>\u003Csummary>里农·加尔、尤瓦尔·阿拉卢夫、尤瓦尔·阿茨蒙等\u003C\u002Fsummary>里农·加尔、尤瓦尔·阿拉卢夫、尤瓦尔·阿茨蒙、奥尔·帕塔什尼克、阿米特·H·贝尔马诺、加尔·切奇克、丹尼尔·科恩-奥尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.01618)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1090-blue.svg?paper=5b19bf6c3f4b25cac96362c98b930cf4b37f6744)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5b19bf6c3f4b25cac96362c98b930cf4b37f6744)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdreambooth.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frinongal\u002Ftextual_inversion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Frinongal\u002Ftextual_inversion)\\\n标签：`定制化`\n\n+ **具有深度语言理解的照片级真实感文本到图像扩散模型**（2022年5月23日）\\\n[NeurIPS 2022] \u003Cdetails>\u003Csummary>萨哈里亚、奇特万·钱、威廉·萨克塞纳、索拉布·李、拉拉·黄、杰伊·登顿、艾米丽·L·加塞米普尔、卡米亚尔·贡蒂霍·洛佩斯、拉斐尔·卡拉戈尔·阿扬、布尔丘·萨利曼斯、蒂姆及其他\u003C\u002Fsummary>萨哈里亚、奇特万·钱、威廉·萨克塞纳、索拉布·李、拉拉·黄、杰伊·登顿、艾米丽·L·加塞米普尔、卡米亚尔·贡蒂霍·洛佩斯、拉斐尔·卡拉戈尔·阿扬、布尔丘·萨利曼斯、蒂姆及其他\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.11487)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2844-blue.svg?paper=9695824d7a01fad57ba9c01d7d76a519d78d65e7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9695824d7a01fad57ba9c01d7d76a519d78d65e7)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fimagen.research.google\u002F)\n\n+ **基于潜在扩散模型的高分辨率图像合成**（2021年12月20日）\\\n[CVPR 2022（口头报告）] \u003Cdetails>\u003Csummary>隆巴赫、罗宾·布拉特曼、安德烈亚斯·洛伦茨等\u003C\u002Fsummary>隆巴赫、罗宾·布拉特曼、安德烈亚斯·洛伦茨、多米尼克·埃瑟、帕特里克·奥默、比约恩\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.10752)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5641-blue.svg?paper=c10075b3746a9f3dd5811970e93c8ca3ad39b39d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc10075b3746a9f3dd5811970e93c8ca3ad39b39d)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fommer-lab.com\u002Fresearch\u002Flatent-diffusion-models\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCompVis\u002Fstable-diffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FCompVis\u002Fstable-diffusion)\n\n\n\n### 数据集\n\n+ **MIMIC-IT：多模态上下文指令微调**（2023年6月8日）\u003Cdetails>\u003Csummary>[NeurIPS 2023] Bo Li、Yuanhan Zhang、Liangyu Chen 等\u003C\u002Fsummary>Bo Li、Yuanhan Zhang、Liangyu Chen、Jinghao Wang、Fanyi Pu、Jingkang Yang、Chunyuan Li、Ziwei Liu\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05425)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-78-blue.svg?paper=d47524cd5c3c4b57af2e5a29f6f91c420310f236)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd47524cd5c3c4b57af2e5a29f6f91c420310f236)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLuodian\u002Fotter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLuodian\u002Fotter)\n\n\n\n+ **[LAION-Glyph] GlyphControl：用于视觉文本生成的字形条件控制**（2023年5月29日）\u003Cdetails>\u003Csummary>[NeurIPS 2023] Yukang Yang、Dongnan Gui、Yuhui Yuan 等\u003C\u002Fsummary>Yukang Yang、Dongnan Gui、Yuhui Yuan、Weicong Liang、Haisong Ding、Han Hu、Kai Chen\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18259)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=5fbe4c92791fbecb179c1ab79bba9a59b2e155ba)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5fbe4c92791fbecb179c1ab79bba9a59b2e155ba)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIGText\u002FGlyphControl-release.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAIGText\u002FGlyphControl-release)\n\n\n\n\n\n+ **[MARIO-10M] TextDiffuser：作为文本画家的扩散模型**（2023年5月18日）\u003Cdetails>\u003Csummary>[NeurIPS 2023] Jingye Chen、Yupan Huang、Tengchao Lv 等\u003C\u002Fsummary>Jingye Chen、Yupan Huang、Tengchao Lv、Lei Cui、Qifeng Chen、Furu Wei\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.14108)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=e779781f1bea273573fc9d3f1a5e874bcff2cd2b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe779781f1bea273573fc9d3f1a5e874bcff2cd2b)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjingyechen.github.io\u002Ftextdiffuser\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Funilm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm)\n\n\n\n+ **DataComp：寻找下一代多模态数据集**（2023年4月27日）\u003Cdetails>\u003Csummary>[NeurIPS 2023] Samir Yitzhak Gadre、Gabriel Ilharco、Alex Fang 等\u003C\u002Fsummary>Samir Yitzhak Gadre、Gabriel Ilharco、Alex Fang、Jonathan Hayase、Georgios Smyrnis、Thao Nguyen、Ryan Marten、Mitchell Wortsman、Dhruba Ghosh、Jieyu Zhang、Eyal Orgad、Rahim Entezari、Giannis Daras、Sarah Pratt、Vivek Ramanujan、Yonatan Bitton、Kalyani Marathe、Stephen Mussmann、Richard Vencu、Mehdi Cherti、Ranjay Krishna、Pang Wei Koh、Olga Saukh、Alexander Ratner、Shuran Song、Hannaneh Hajishirzi、Ali Farhadi、Romain Beaumont、Sewoong Oh、Alex Dimakis、Jenia Jitsev、Yair Carmon、Vaishaal Shankar、Ludwig Schmidt\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.14108)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-103-blue.svg?paper=f9570989919338079088270a9cf1a7afc8db8093)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff9570989919338079088270a9cf1a7afc8db8093)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.datacomp.ai\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlfoundations\u002Fdatacomp.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fdatacomp)\n\n+ **[LLava-instruct] 视觉指令微调**（2023年4月17日）\u003Cdetails>\u003Csummary>[NeurIPS 2023] Haotian Liu、Chunyuan Li、Qingyang Wu 等\u003C\u002Fsummary>Haotian Liu、Chunyuan Li、Qingyang Wu、Yong Jae Lee\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08485)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-820-blue.svg?paper=a5036f31f0e629dc661f120b8c3b1f374d479ab8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fa5036f31f0e629dc661f120b8c3b1f374d479ab8)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllava-vl.github.io\u002F\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaotian-liu\u002FLLaVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA)\n\n\n\n+ **多模态C4：一个开放的、十亿级规模的图文混合语料库**（2023年4月14日）\u003Cdetails>\u003Csummary>[NeurIPS 2023] Wanrong Zhu、Jack Hessel、Anas Awadalla 等\u003C\u002Fsummary>Wanrong Zhu、Jack Hessel、Anas Awadalla、Samir Yitzhak Gadre、Jesse Dodge、Alex Fang、Youngjae Yu、Ludwig Schmidt、William Yang Wang、Yejin Choi\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.06939)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-64-blue.svg?paper=df958800014d310b6df34ad83d771314d68fbb2d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdf958800014d310b6df34ad83d771314d68fbb2d)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Fmmc4.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fmmc4)\n\n\n\n\n\n+ **语言并非一切：将感知与语言模型对齐**（2023年2月27日）\u003Cdetails>\u003Csummary>[NeurIPS 2023] Shaohan Huang、Li Dong、Wenhui Wang 等\u003C\u002Fsummary>Shaohan Huang、Li Dong、Wenhui Wang、Yaru Hao、Saksham Singhal、Shuming Ma、Tengchao Lv、Lei Cui、Owais Khan Mohammed、Barun Patra、Qiang Liu、Kriti Aggarwal、Zewen Chi、Johan Bjorck、Vishrav Chaudhary、Subhojit Som、Xia Song、Furu Wei\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.14045)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-261-blue.svg?paper=fbfef4723d8c8467d7bd523e1d0b703cce0e0f9c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffbfef4723d8c8467d7bd523e1d0b703cce0e0f9c)\n\n\n\n+ **COYO-700M：图文对数据集**（2022年8月31日）\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkakaobrain\u002Fcoyo-dataset.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fcoyo-dataset)\n\n\n+ **LAION-5B：用于训练下一代图文模型的开放大型数据集**（2022年10月16日）\u003Cdetails>\u003Csummary>[NeurIPS 2022] Christoph Schuhmann、Romain Beaumont、Richard Vencu 等\u003C\u002Fsummary>Christoph Schuhmann、Romain Beaumont、Richard Vencu、Cade Gordon、Ross Wightman、Mehdi Cherti、Theo Coombes、Aarush Katta、Clayton Mullis、Mitchell Wortsman、Patrick Schramowski、Srivatsa Kundurthy、Katherine Crowson、Ludwig Schmidt、Robert Kaczmarczyk、Jenia Jitsev\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.08402)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1288-blue.svg?paper=e5c8960eb2ec034ffbd353ef39fd1cb541d3c7c9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe5c8960eb2ec034ffbd353ef39fd1cb541d3c7c9)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flaion.ai\u002Fblog\u002Flaion-5b\u002F)\n\n+ **LAION COCO：来自LAION2B-EN的6亿张合成字幕**（2022年9月15日）\u003Cdetails>\u003Csummary>克里斯托夫·舒曼、安德烈亚斯·科普、西奥·库姆布斯等\u003C\u002Fsummary>克里斯托夫·舒曼、安德烈亚斯·科普、西奥·库姆布斯、理查德·文库、本杰明·特罗姆、罗曼·博蒙\u003C\u002Fdetails>\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flaion.ai\u002Fblog\u002Flaion-coco\u002F)\n\n\n+ **[M3W] Flamingo：用于少样本学习的视觉语言模型**（2022年4月29日）\u003Cdetails>\u003Csummary>[NeurIPS 2022] 让-巴蒂斯特·阿拉伊拉克、杰夫·多纳休、保琳·吕克等\u003C\u002Fsummary>让-巴蒂斯特·阿拉伊拉克、杰夫·多纳休、保琳·吕克、安托万·米埃赫、伊恩·巴尔、雅娜·哈松、卡雷尔·伦茨、阿图尔·门施、凯蒂·米利坎、马尔科姆·雷诺兹、罗曼·林格、伊丽莎·卢瑟福、塞尔坎·卡比、韩腾达、龚志涛、萨纳·萨曼古伊、玛丽安娜·蒙特罗、雅各布·梅尼克、塞巴斯蒂安·博尔戈、安德鲁·布洛克、艾达·内马扎德、萨汉德·沙里夫扎德、米科拉·宾科夫斯基、里卡多·巴雷拉、奥里奥尔·维尼亚尔斯、安德鲁·齐瑟曼、卡伦·西莫尼扬\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.14198)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1483-blue.svg?paper=26218bdcc3945c7edae7aa2adbfba4cd820a2df3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F26218bdcc3945c7edae7aa2adbfba4cd820a2df3)\n\n\n\n\n+ **[LAION-FACE] 基于视觉-语言方式的通用人脸表征学习**（2021年12月6日）\u003Cdetails>\u003Csummary>[NeurIPS 2021] 郑英琳、杨浩、张婷等\u003C\u002Fsummary>郑英琳、杨浩、张婷、鲍建民、陈东东、黄阳宇、袁璐、陈栋、曾明、温芳\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.03109)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-70-blue.svg?paper=037bab9d26ef7da11ee32d7682836604d2cc8a72)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F037bab9d26ef7da11ee32d7682836604d2cc8a72)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFacePerceiver\u002FFaRL.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FFacePerceiver\u002FFaRL)\n\n\n\n\n+ **[LAION-400M] CLIP过滤后的4亿对图文开放数据集**（2021年11月3日）\u003Cdetails>\u003Csummary>[NeurIPS 2021] 克里斯托夫·舒曼、理查德·文库、罗曼·博蒙等\u003C\u002Fsummary>克里斯托夫·舒曼、理查德·文库、罗曼·博蒙、罗伯特·卡奇马尔奇克、克莱顿·穆利斯、阿鲁什·卡塔、西奥·库姆布斯、珍妮娅·吉采夫、阿拉恩·小松崎\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.02114)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-756-blue.svg?paper=b668ce936cff0b0ca8b635cd5f25a62eaf4eb3df)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb668ce936cff0b0ca8b635cd5f25a62eaf4eb3df)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flaion.ai\u002Flaion-400-open-dataset\u002F)\n\n\n\n\n+ **WIT：基于维基百科的多模态多语言机器学习图像文本数据集**（2021年3月2日）\u003Cdetails>\u003Csummary>[SIGIR 2021] 克里希纳·斯里尼瓦桑、卡尔蒂克·拉曼、陈洁超等\u003C\u002Fsummary>克里希纳·斯里尼瓦桑、卡尔蒂克·拉曼、陈洁超、迈克尔·本德斯基、马克·纳约尔克\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.01913)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-186-blue.svg?paper=98e565fa06f6c7bf7c46833b5106b26dc45130c4)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F98e565fa06f6c7bf7c46833b5106b26dc45130c4)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002Fwit)\n\n+ **Conceptual 12M：将网络规模的图文预训练推向长尾视觉概念识别**（2021年2月17日）\u003Cdetails>\u003Csummary>[CVPR 2021] 索拉维特·昌皮尼奥、皮尤什·夏尔马、丁楠等\u003C\u002Fsummary>索拉维特·昌皮尼奥、皮尤什·夏尔马、丁楠、拉杜·索里库特\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.08981)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-617-blue.svg?paper=394be105b87e9bfe72c20efe6338de10604e1a11)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F394be105b87e9bfe72c20efe6338de10604e1a11)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002Fconceptual-12m)\n\n\n+ **[ALIGN] 利用噪声文本监督扩展视觉及视觉-语言表征学习**（2021年2月11日）\u003Cdetails>\u003Csummary>[ICML 2021] 贾超、杨音飞、夏叶等\u003C\u002Fsummary>贾超、杨音飞、夏叶、陈怡婷、扎拉娜·帕雷克、辉·范、阮国越、宋云轩、李振、汤姆·杜里格\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.05918)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2120-blue.svg?paper=141a5033d9994242b18bb3b217e79582f1ee9306)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F141a5033d9994242b18bb3b217e79582f1ee9306)\n\n\n\n+ **[MS COCO] 微软COCO：上下文中的常见物体**（2014年5月1日）\u003Cdetails>\u003Csummary>[ECCV 2014] 林宗义、迈克尔·梅尔、塞尔日·贝隆吉等\u003C\u002Fsummary>林宗义、迈克尔·梅尔、塞尔日·贝隆吉、卢博米尔·布尔代夫、罗斯·吉尔希克、詹姆斯·海斯、皮耶特罗·佩罗纳、德瓦·拉马南、C·劳伦斯·齐特尼克、皮奥特尔·多拉尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1405.0312)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-33630-blue.svg?paper=71b7178df5d2b112d07e45038cb5637208659ff7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F71b7178df5d2b112d07e45038cb5637208659ff7)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcocodataset-org.translate.goog\u002F?_x_tr_sl=en&_x_tr_tl=zh-CN&_x_tr_hl=zh-CN&_x_tr_pto=sc#home)\n\n\n\n+ **[Im2Text] 使用100万张带字幕的照片描述图像**（2011年12月12日）\\\n[NeurIPS 2011] 维森特·奥尔多涅斯、吉里什·库尔卡尼、塔玛拉·伯格\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fpapers.nips.cc\u002Fpaper_files\u002Fpaper\u002F2011\u002Fhash\u002F5dd9db5e033da9c6fb5ba83c7a7ebea9-Abstract.html)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1192-blue.svg?paper=8e080b98efbe65c02a116439205ca2344b9f7cd4)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8e080b98efbe65c02a116439205ca2344b9f7cd4)\n\n\n\n\n\n## 视频生成\n\n### 🔅 基于LLM\n\n\n+ **Loong：利用自回归语言模型生成分钟级长视频**（2024年10月3日）\u003Cdetails>\u003Csummary>王玉清、熊天伟、周大泉等\u003C\u002Fsummary>王玉清、熊天伟、周大泉、林志杰、赵洋、康炳义、冯家诗、刘熙辉\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02757)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=1ac7fc5a55ce5843fb8a19d9f62b623e822bb7de)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FLoong%3A-Generating-Minute-level-Long-Videos-with-Wang-Xiong\u002F1ac7fc5a55ce5843fb8a19d9f62b623e822bb7de)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fepiphqny.github.io\u002FLoong-video\u002F)\n\n+ **基于LLM导演的组合式3D感知视频生成**（2024年8月31日）\u003Cdetails>\u003Csummary>朱瀚鑫、何天宇、唐安妮等\u003C\u002Fsummary>朱瀚鑫、何天宇、唐安妮、郭俊良、陈志博、卞江\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.00558)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fproject\u002Fcompositional-3d-aware-video-generation\u002F)\n\n\n+ **Anim-Director：用于可控动画视频生成的大规模多模态模型驱动智能体**（2024年8月19日）\u003Cdetails>\u003Csummary>[SIGGRAPH Asia 2024] 李云欣、史浩源、胡宝田等\u003C\u002Fsummary>李云欣、史浩源、胡宝田、王龙跃、朱嘉顺、徐金义、赵振、张敏\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.09787)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHITsz-TMG\u002FAnim-Director.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FHITsz-TMG\u002FAnim-Director?tab=readme-ov-file)\n\n+ **[BSQ-ViT] 基于二进制球面量化进行图像和视频标记化**（2024年6月11日）\\\n[技术报告] 赵越、熊元俊、菲利普·克雷亨布尔\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07548v1)  `tokenizer`\n\n\n+ **DriveDreamer-2：用于多样化驾驶视频生成的LLM增强型世界模型**（2024年3月11日）\u003Cdetails>\u003Csummary>赵国生、王晓峰、朱正等\u003C\u002Fsummary>赵国生、王晓峰、朱正、陈新泽、黄冠、包晓怡、王兴刚\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06845)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=b34fb645165da381e27077282d69ff224dd2d5f5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FDriveDreamer-2%3A-LLM-Enhanced-World-Models-for-Video-Zhao-Wang\u002Fb34fb645165da381e27077282d69ff224dd2d5f5)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdrivedreamer2.github.io\u002F)\n\n\n+ **[Sora] 视频生成模型作为世界模拟器**（2024年2月15日）\u003Cdetails>\u003Csummary>蒂姆·布鲁克斯、比尔·皮布尔斯、康纳·霍姆斯等\u003C\u002Fsummary>蒂姆·布鲁克斯、比尔·皮布尔斯、康纳·霍姆斯、威尔·德普、郭宇飞、李静、大卫·施努尔、乔·泰勒、特洛伊·卢曼、埃里克·卢曼、克拉伦斯·吴、瑞奇·王、阿迪提亚·拉梅什\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fopenai.com\u002Fresearch\u002Fvideo-generation-models-as-world-simulators)\n\n\n+ **[LWM] 基于分块环形注意力机制的百万级视频与语言世界模型**（2024年2月13日）\u003Cdetails>\u003Csummary>刘浩、颜威尔、扎哈里亚等\u003C\u002Fsummary>刘浩、颜威尔、扎哈里亚、彼得·阿贝尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.08268)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-38-blue.svg?paper=9259b476c31ba52b7e9ed059e5fbce2125092738)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FWorld-Model-on-Million-Length-Video-And-Language-Liu-Yan\u002Fdb7498f569be9852a04b2bb5bd68bd2885820bea)\n\n\n+ **[LGVI] 通过多模态大型语言模型实现语言驱动的视频修复**（2024年1月18日）\u003Cdetails>\u003Csummary>吴建宗、李向泰、司晨阳等\u003C\u002Fsummary>吴建宗、李向泰、司晨阳、周尚辰、杨景康、张江宁、李依宁、陈凯、童云海、刘子威、陈昌乐\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10226)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=02d96eb0da4a282831f14923d1a65976952b7177)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FTowards-Language-Driven-Video-Inpainting-via-Large-Wu-Li\u002F02d96eb0da4a282831f14923d1a65976952b7177)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjianzongwu.github.io\u002Fprojects\u002Frovi\u002F)\n\n+ **Video-LaVIT：采用解耦视觉-运动标记化的统一视频-语言预训练——利用LLM实现内容一致的多场景视频生成**（2024年1月2日）\u003Cdetails>\u003Csummary>金洋、孙志成、许坤等\u003C\u002Fsummary>金洋、孙志成、许坤、许坤、陈立伟、姜浩、黄曲哲、宋承儒、刘宇梁、张迪、宋洋、盖坤、穆亚东\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.03161)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=c1b5195bc09a2232ec2b69e5a2a6bd39b3162c62)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc1b5195bc09a2232ec2b69e5a2a6bd39b3162c62)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvideo-lavit.github.io\u002F)\n`tokenizer`\n\n+ **VideoDrafter：利用LLM实现内容一致的多场景视频生成**（2024年1月2日）\u003Cdetails>\u003Csummary>龙福臣、邱兆凡、姚婷等\u003C\u002Fsummary>龙福臣、邱兆凡、姚婷、梅涛\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.01256)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=fc84fcf269a37ed7ddcb1b0f2d7d1a00f677eaea)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffc84fcf269a37ed7ddcb1b0f2d7d1a00f677eaea)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvideodrafter.github.io\u002F)\n\n+ **[PRO-Motion] 计划、姿态与行动：迈向开放世界文本到动作生成**（2023年12月22日）\u003Cdetails>\u003Csummary>刘金鹏、戴文勋、王春雨等\u003C\u002Fsummary>刘金鹏、戴文勋、王春雨、程一吉、唐燕松、佟欣\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14828)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=4599d5af850da482f591a02a3b17d56e0d358771)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4599d5af850da482f591a02a3b17d56e0d358771)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmoonsliu.github.io\u002FPro-Motion\u002F)\n\n+ **VideoPoet：用于零样本视频生成的大语言模型**（2023年12月21日）\u003Cdetails>\u003Csummary>丹·孔德拉秋克、于丽君、顾秀叶等\u003C\u002Fsummary>丹·孔德拉秋克、于丽君、顾秀叶、何塞·莱萨马、乔纳森·黄、瑞秋·霍尔农、哈特维格·亚当、哈桑·阿克巴里、亚伊尔·阿隆、维格内什·比罗德卡尔、永程、明昌·邱、乔什·迪伦、伊尔凡·埃萨、阿格里姆·古普塔、米拉·哈恩、安雅·豪斯、大卫·亨登、阿隆索·马丁内斯、大卫·米嫩、大卫·罗斯、格兰特·辛德勒、米哈伊尔·西罗滕科、基休克·孙、克里希纳·索曼德帕利、惠生·王、吉米·严、明轩·杨、玄杨、布莱恩·塞博尔德、陆江\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14125)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-18-blue.svg?paper=0c4f46e4dcae5527018e6432fb60cfe8c3354e97)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0c4f46e4dcae5527018e6432fb60cfe8c3354e97)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsites.research.google\u002Fvideopoet\u002F)\n\n+ **FlowZero：基于大语言模型驱动的动态场景语法的零样本文生视频合成**（2023年11月27日）\u003Cdetails>\u003Csummary>[arXiv 2023] 陆宇、朱林超、范鹤鹤等\u003C\u002Fsummary>陆宇、朱林超、范鹤鹤、杨毅\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15813)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=9cdb7e415a96795dc6705e66f3b798238b4dec2c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9cdb7e415a96795dc6705e66f3b798238b4dec2c)\n\n+ **InterControl：通过控制每个关节生成人体运动交互**（2023年11月27日）\u003Cdetails>\u003Csummary>王振志、王景博、林达华等\u003C\u002Fsummary>王振志、王景博、林达华、戴波\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15864)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=9cdb7e415a96795dc6705e66f3b798238b4dec2c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9cdb7e415a96795dc6705e66f3b798238b4dec2c)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhenzhiwang\u002Fintercontrol.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzhenzhiwang\u002Fintercontrol)\\\n标签：`人体运动生成`\n+ **MotionLLM：基于大语言模型的多模态运动-语言学习**（2024年5月27日）\u003Cdetails>\u003Csummary>吴琪、赵宇博、王一帆等\u003C\u002Fsummary>吴琪、赵宇博、王一帆、戴宇英、唐志康\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.17013)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=480da1ac2d39b5e036ce786af081366c23f08d1b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F480da1ac2d39b5e036ce786af081366c23f08d1b)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fknoxzhao.github.io\u002FMotionLLM\u002F)\\\n标签：`通用人体运动生成`\n+ **GPT4Motion：通过面向Blender的GPT规划在文生视频中编写物理运动脚本**（2023年11月21日）\u003Cdetails>\u003Csummary>吕嘉熙、黄毅、严明富等\u003C\u002Fsummary>吕嘉熙、黄毅、严明富、黄建诚、刘建庄、刘一凡、温亚飞、陈晓欣、陈世峰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.12631)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=9cdb7e415a96795dc6705e66f3b798238b4dec2c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9cdb7e415a96795dc6705e66f3b798238b4dec2c)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgpt4motion.github.io\u002F)\n\n+ **[LVD] 基于大语言模型的视频扩散模型**（2023年9月29日）\u003Cdetails>\u003Csummary>连龙、史百丰、亚当·亚拉等\u003C\u002Fsummary>连龙、史百丰、亚当·亚拉、特雷弗·达雷尔、李博伊\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17444)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=87bf66eb6d22df17f70170a0e575b4f12c4813ef)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F87bf66eb6d22df17f70170a0e575b4f12c4813ef)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllm-grounded-video-diffusion.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTonyLianLong\u002FLLM-groundedVideoDiffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTonyLianLong\u002FLLM-groundedVideoDiffusion)\n\n+ **VideoDirectorGPT：基于大语言模型引导规划的一致性多场景视频生成**（2023年9月26日）\u003Cdetails>\u003Csummary>[arXiv 2023] 林涵、阿拜·扎拉、曹载民等\u003C\u002Fsummary>林涵、阿拜·扎拉、曹载民、莫希特·班萨尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15091)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=16753e0317730e8c1b297338300a8c6163dd06f2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F16753e0317730e8c1b297338300a8c6163dd06f2)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvideodirectorgpt.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHL-hanlin\u002FVideoDirectorGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FHL-hanlin\u002FVideoDirectorGPT)\n\n+ **Free-Bloom：具有大语言模型导演和LDM动画师的零样本文生视频生成器**（2023年9月25日）\u003Cdetails>\u003Csummary>[NIPS 2023] 黄瀚卓、冯宇凡、施成等\u003C\u002Fsummary>黄瀚卓、冯宇凡、施成、徐兰、俞静怡、杨思贝\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.14494)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-24-blue.svg?paper=120aca3e415b6641a0b0cd20695ab85ed7789612)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F120aca3e415b6641a0b0cd20695ab85ed7789612)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSooLab\u002FFree-Bloom.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSooLab\u002FFree-Bloom)\n\n+ **[Dysen-VDM] 利用大语言模型赋能动态感知的文生视频扩散模型**（2023年8月26日）\u003Cdetails>\u003Csummary>[CVPR 2024] 费浩、吴圣琼、季伟等\u003C\u002Fsummary>费浩、吴圣琼、季伟、张汉旺、蔡德成\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.13812)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=d0a7f7fe31e0e0c42b471b4c47a313bd8c8e5206)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd0a7f7fe31e0e0c42b471b4c47a313bd8c8e5206)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](http:\u002F\u002Fhaofei.vip\u002FDysen-VDM\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fscofield7419\u002FDysen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fscofield7419\u002FDysen)\n\n+ **[DirecT2V] 大语言模型是零样本文生视频生成的帧级导演**（2023年5月23日）\u003Cdetails>\u003Csummary>[arXiv 2023] 洪秀成、徐俊英、洪承焕等\u003C\u002Fsummary>洪秀成、徐俊英、洪承焕、申熙成、金承龙\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14330)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-15-blue.svg?paper=b1750d2a6e3480e690999916a86c8b3876577b39)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb1750d2a6e3480e690999916a86c8b3876577b39)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKU-CVLAB\u002FDirecT2V.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FKU-CVLAB\u002FDirecT2V)\n\n+ **Text2Motion：从自然语言指令到可行计划**（2023年3月21日）\u003Cdetails>\u003Csummary>[自主机器人2023] 凯文·林、克里斯托弗·阿吉亚、托基·米吉松等\u003C\u002Fsummary>凯文·林、克里斯托弗·阿吉亚、托基·米吉松、马可·帕沃内、珍妮特·博格\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12153)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-110-blue.svg?paper=8f2d4758e6d525509ae36bb30224dc9259027e6b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8f2d4758e6d525509ae36bb30224dc9259027e6b)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsites.google.com\u002Fstanford.edu\u002Ftext2motion)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKU-CVLAB\u002FDirecT2V.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FKU-CVLAB\u002FDirecT2V)\n\n\n\n### 非LLM类\n\n\n+ **OSV：一步即可实现高质量图像到视频生成**（2024年9月17日）\u003Cdetails>\u003Csummary>毛晓峰、蒋正凯、王福云等\u003C\u002Fsummary>毛晓峰、蒋正凯、王福云、朱文兵、张江宁、陈浩、迟明敏、王亚彪\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2409.11367)\n\n+ **[PAB] 基于金字塔注意力广播的实时视频生成**（2024年6月26日）\u003Cdetails>\u003Csummary>赵轩磊、金晓龙、王凯等\u003C\u002Fsummary>赵轩磊、金晓龙、王凯、杨友\u003C\u002Fdetails>\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Foahzxl.github.io\u002FPAB\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNUS-HPC-AI-Lab\u002FOpenDiT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNUS-HPC-AI-Lab\u002FOpenDiT)\n\n\n\n+ **Video-Infinity：分布式长视频生成**（2024年6月24日）\u003Cdetails>\u003Csummary>谭振雄、杨兴义、刘松华等\u003C\u002Fsummary>谭振雄、杨兴义、刘松华、王新超\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16260)\n\n\n\n\n+ **潘多拉：基于自然语言动作与视频的通用世界模型**（2024年6月12日）\u003Cdetails>\u003Csummary>项建南、刘广义、顾毅等\u003C\u002Fsummary>项建南、刘广义、顾毅、高琪悦、宁宇婷、查宇恒、冯泽宇、陶天华、郝世博、史业民、刘正中、埃里克·P·邢、胡志廷\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09455)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fworld-model.maitrix.org\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FyhZhai\u002Fmcm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmaitrix-org\u002FPandora)\n\n\n\n+ **文本动画师：可控视觉文本视频生成**（2024年6月25日）\u003Cdetails>\u003Csummary>刘琳、刘全德、钱圣居等\u003C\u002Fsummary>刘琳、刘全德、钱圣居、周源、周文刚、李厚强、谢凌溪、田琦\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09455)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flaulampaul.github.io\u002Ftext-animator.html)\n\n\n+ **运动展台：运动感知的定制化文本到视频生成**（2024年6月25日）\u003Cdetails>\u003Csummary>吴建宗、李向泰、曾艳红等\u003C\u002Fsummary>吴建宗、李向泰、曾艳红、张江宁、周倩玉、李依宁、童云海、陈凯\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.17758v1)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjianzongwu.github.io\u002Fprojects\u002Fmotionbooth\u002F)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=7178bbc5e8d2d9b11c890c60486ba2cc2b79b784)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FMotionBooth%3A-Motion-Aware-Customized-Text-to-Video-Wu-Li\u002F7178bbc5e8d2d9b11c890c60486ba2cc2b79b784)\n\n+ **FreeTraj：视频扩散模型中的免调参轨迹控制**（2024年6月24日）\u003Cdetails>\u003Csummary>邱浩楠、陈兆熙、王周霞等\u003C\u002Fsummary>邱浩楠、陈兆熙、王周霞、何英青、夏梦涵、刘子威\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16863)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](http:\u002F\u002Fhaonanqiu.com\u002Fprojects\u002FFreeTraj.html)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=1868d2c2f56a92044908a789049fdd44094fc8f2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FFreeTraj%3A-Tuning-Free-Trajectory-Control-in-Video-Qiu-Chen\u002F1868d2c2f56a92044908a789049fdd44094fc8f2)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Farthur-qiu\u002FFreeTraj.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Farthur-qiu\u002FFreeTraj)\n\n\n+ **识别并解决图像到视频扩散模型中的条件图像泄露问题**（2024年6月22日）\u003Cdetails>\u003Csummary>赵敏、朱洪洲、向晨东等\u003C\u002Fsummary>赵敏、朱洪洲、向晨东、郑凯文、李崇宣、朱俊\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.15735v1)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcond-image-leak.github.io\u002F)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=ebf4f746d24d79d61c070f8c354b3371f461aafb)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FIdentifying-and-Solving-Conditional-Image-Leakage-Zhao-Zhu\u002Febf4f746d24d79d61c070f8c354b3371f461aafb)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002Fcond-image-leakage.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002Fcond-image-leakage\u002F)\n\n\n+ **图像指挥家：交互式视频合成的精准控制**（2024年6月21日）\u003Cdetails>\u003Csummary>李耀伟、王新涛、张兆阳等\u003C\u002Fsummary>李耀伟、王新涛、张兆阳、王周霞、袁子洋、谢良斌、邹月仙、山莹\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.15339)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcond-image-leak.github.io\u002F)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=b0bd64273dc8075db530fd696ee7eecb179bb908)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FImage-Conductor%3A-Precision-Control-for-Interactive-Li-Wang\u002Fb0bd64273dc8075db530fd696ee7eecb179bb908)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliyaowei-stu\u002FImageConductor.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fliyaowei-stu\u002FImageConductor)\n\n+ **VIDEOSCORE：构建自动指标以模拟视频生成中的细粒度人类反馈**（2024年6月21日）\u003Cdetails>\u003Csummary>何轩、蒋东富、张戈等\u003C\u002Fsummary>何轩、蒋东富、张戈、Max Ku、Achint Soni、Sherman Siu、陈浩楠、Abhranil Chandra、姜子言、Aaran Arulraj、王凯、杜贵德、倪元生、吕博涵、Yaswanth Narsupalli、范荣奇、吕志恒、林宇辰、陈文虎\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.15252)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Ftiger-ai-lab.github.io\u002FVideoScore\u002F)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=1680eedc706ef081c0b103457bb52c071ab924b8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVideoScore%3A-Building-Automatic-Metrics-to-Simulate-He-Jiang\u002F1680eedc706ef081c0b103457bb52c071ab924b8)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTIGER-AI-Lab\u002FVideoScore\u002F.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTIGER-AI-Lab\u002FVideoScore\u002F)\n\n\n+ **Dreamitate：通过视频生成实现的真实世界视觉运动策略学习**（2024年6月24日）\u003Cdetails>\u003Csummary>梁俊邦、刘若诗、厄格·奥兹古罗格鲁等\u003C\u002Fsummary>梁俊邦、刘若诗、厄格·奥兹古罗格鲁、Sruthi Sudhakar、Achal Dave、Pavel Tokmakov、宋舒然、Carl Vondrick\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16862)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdreamitate.cs.columbia.edu\u002F)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=b0ac4f62f55bcf0427008e18f1b4b5bf7ee43df2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FDreamitate%3A-Real-World-Visuomotor-Policy-Learning-Liang-Liu\u002Fb0ac4f62f55bcf0427008e18f1b4b5bf7ee43df2)\n\n\n\n\n+ **[MCM] 运动一致性模型：通过解耦的运动-外观蒸馏加速视频扩散**（2024年6月11日）\u003Cdetails>\u003Csummary>翟远豪、林凯文、杨正源等\u003C\u002Fsummary>翟远豪、林凯文、杨正源、李林杰、王建峰、林忠清、戴维·多尔曼、袁军松、王丽娟\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06890v1)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fyhzhai.github.io\u002Fmcm\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FyhZhai\u002Fmcm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FyhZhai\u002Fmcm)\n\n\n+ **搜索先验知识使文本到视频合成效果更好**（2024年6月5日）\u003Cdetails>\u003Csummary>程浩然、彭亮、夏林轩等\u003C\u002Fsummary>程浩然、彭亮、夏林轩、胡岳鹏、李恒嘉、陆青林、何晓飞、吴博熙\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.03215)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=0dd0e0bdff37973e102a042f82cd882b890476cc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FSearching-Priors-Makes-Text-to-Video-Synthesis-Cheng-Peng\u002F0dd0e0bdff37973e102a042f82cd882b890476cc)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fhrcheng98.github.io\u002FSearch_T2V\u002F)\n\n+ **ZeroSmooth：用于高帧率视频生成的免训练扩散模型适配方法**（2024年6月3日）\u003Cdetails>\u003Csummary>杨绍书、张勇、寸晓东等\u003C\u002Fsummary>杨绍书、张勇、寸晓东、山英、何冉\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.00908v1)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=2917706886df4e3bf57acd0b41bd4e396be77506)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FZeroSmooth%3A-Training-free-Diffuser-Adaptation-for-Yang-Zhang\u002F2917706886df4e3bf57acd0b41bd4e396be77506#cited-papers)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fssyang2020.github.io\u002Fzerosmooth.github.io\u002F)\n\n\n+ **EasyAnimate：基于Transformer架构的高性能长视频生成方法**（2024年5月30日）\u003Cdetails>\u003Csummary>赵思杰、张勇、寸晓东等\u003C\u002Fsummary>赵思杰、张勇、寸晓东、杨绍书、牛牧瑶、李晓宇、胡文博、山英\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.20279)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=70569a07d841f86faf8914aea435a1696f911a32)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FCV-VAE%3A-A-Compatible-Video-VAE-for-Latent-Video-Zhao-Zhang\u002F70569a07d841f86faf8914aea435a1696f911a32)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fcvvae\u002Findex.html)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FCV-VAE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FCV-VAE)\n\n+ **[MOFT] 视频扩散模型是免训练的运动解释器和控制器**（2024年3月23日）\u003Cdetails>\u003Csummary>肖泽奇、周一帆、杨帅等\u003C\u002Fsummary>肖泽奇、周一帆、杨帅、潘星刚\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14864)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=4f3e62c0fea3dc43f345e775192c972760b9d113)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVideo-Diffusion-Models-are-Training-free-Motion-and-Xiao-Zhou\u002F4f3e62c0fea3dc43f345e775192c972760b9d113)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fxizaoqu.github.io\u002Fmoft\u002F)\n\n+ **StreamingT2V：从文本生成一致、动态且可扩展的长视频**（2024年3月21日）\u003Cdetails>\u003Csummary>罗伯托·亨舍尔、列翁·哈恰特良、达尼尔·海拉佩蒂扬等\u003C\u002Fsummary>罗伯托·亨舍尔、列翁·哈恰特良、达尼尔·海拉佩蒂扬、海克·波戈相、瓦赫拉姆·塔德沃相、王章阳、尚特·纳瓦萨尔迪扬、洪普里·史\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.14773)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=21a77ed349c8621d0a0ef8407eb744e3de3b13c5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FStreamingT2V%3A-Consistent%2C-Dynamic%2C-and-Extendable-Henschel-Khachatryan\u002F21a77ed349c8621d0a0ef8407eb744e3de3b13c5)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPicsart-AI-Research\u002FStreamingT2V.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPicsart-AI-Research\u002FStreamingT2V)\n\n+ **Snap Video: 用于文本到视频合成的可扩展时空Transformer**（2024年2月22日）\u003Cdetails>\u003Csummary>威利·梅纳帕切、阿列克桑德尔·西亚罗欣、伊万·斯科罗霍多夫等\u003C\u002Fsummary>威利·梅纳帕切、阿列克桑德尔·西亚罗欣、伊万·斯科罗霍多夫、叶卡捷琳娜·杰涅卡、蔡申·陈、阿尼尔·卡格、方宇伟、阿列克谢·斯托利亚尔、埃丽莎·里奇、任健、谢尔盖·图利亚科夫\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.14797)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=97cb2eb0d0517e34bf4202f0593600bb6fa043cd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FSnap-Video%3A-Scaled-Spatiotemporal-Transformers-for-Menapace-Siarohin\u002F97cb2eb0d0517e34bf4202f0593600bb6fa043cd)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsnap-research.github.io\u002Fsnapvideo\u002F)\n\n+ **VideoCrafter2: 克服高质量视频扩散模型的数据限制**（2024年1月17日）\u003Cdetails>\u003Csummary>陈浩鑫、张勇、孙晓东等\u003C\u002Fsummary>陈浩鑫、张勇、孙晓东、夏梦涵、王新涛、翁超、单颖\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09047)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=492bc8339d8aac442c4ec13f8c1d59e980a3af2f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVideoCrafter2%3A-Overcoming-Data-Limitations-for-Chen-Zhang\u002F492bc8339d8aac442c4ec13f8c1d59e980a3af2f)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fvideocrafter2\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FVideoCrafter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FVideoCrafter)\n\n\n+ **稳定视频扩散：将潜在视频扩散模型扩展至大规模数据集**（2023年11月25日）\u003Cdetails>\u003Csummary>安德烈亚斯·布拉特曼、蒂姆·多克霍恩、苏米特·库拉尔等\u003C\u002Fsummary>安德烈亚斯·布拉特曼、蒂姆·多克霍恩、苏米特·库拉尔、丹尼尔·门德列维奇、马切伊·基利安、多米尼克·洛伦茨、亚姆·莱维、锡安·英格利什、维克拉姆·沃莱蒂、亚当·莱茨、瓦伦·詹帕尼、罗宾·隆巴赫\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15127)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-51-blue.svg?paper=1206b05eae5a06ba662ae79fb291b50e359c4f42)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1206b05eae5a06ba662ae79fb291b50e359c4f42)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fstability.ai\u002Fresearch\u002Fstable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FStability-AI\u002Fgenerative-models.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FStability-AI\u002Fgenerative-models)\n\n+ **VideoCrafter1: 用于高质量视频生成的开放扩散模型**（2023年10月30日）\u003Cdetails>\u003Csummary>陈浩鑫、夏梦涵、何英青等\u003C\u002Fsummary>陈浩鑫、夏梦涵、何英青、张勇、孙晓东、杨绍书、邢金波、刘耀芳、陈启峰、王新涛、翁超、单颖\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.19512)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-38-blue.svg?paper=1891c3756f870d902a0b793a1dcd5cc34c778612)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1891c3756f870d902a0b793a1dcd5cc34c778612)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fvideocrafter\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FVideoCrafter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FVideoCrafter)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVideoCrafter\u002FVideoCrafter)\n\n+ **DynamiCrafter: 利用视频扩散先验为开放域图像添加动画效果**（2023年10月18日）\u003Cdetails>\u003Csummary>邢金波、夏梦涵、张勇等\u003C\u002Fsummary>邢金波、夏梦涵、张勇、陈浩鑫、于旺博、刘汉源、王新涛、黄天赐、单颖\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.12190)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-42-blue.svg?paper=083bab4a967c2221d9f4da9110fe37d8ca679078)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FDynamiCrafter%3A-Animating-Open-domain-Images-with-Xing-Xia\u002F083bab4a967c2221d9f4da9110fe37d8ca679078)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdoubiiu.github.io\u002Fprojects\u002FDynamiCrafter\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDoubiiu\u002FDynamiCrafter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDoubiiu\u002FDynamiCrafter)\n\n+ **FreeNoise: 通过噪声重调度实现无需调优的更长视频扩散**（2023年10月23日）\u003Cdetails>\u003Csummary>邱浩楠、夏梦涵、张勇等\u003C\u002Fsummary>邱浩楠、夏梦涵、张勇、何英青、王新涛、单颖、刘子威\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.15169)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=d831988859f0c077b38094446d8585a8340af223)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd831988859f0c077b38094446d8585a8340af223)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](http:\u002F\u002Fhaonanqiu.com\u002Fprojects\u002FFreeNoise.html)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Farthur-qiu\u002FLongerCrafter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Farthur-qiu\u002FLongerCrafter)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FMoonQiu\u002FLongerCrafter)\n\n+ **Animate-A-Story: 基于检索增强的视频生成进行故事讲述**（2023年7月13日）\u003Cdetails>\u003Csummary>何英青、夏梦涵、陈浩鑫等\u003C\u002Fsummary>何英青、夏梦涵、陈浩鑫、孙晓东、龚元、邢金波、张勇、王新涛、翁超、单颖、陈启峰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06940)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-18-blue.svg?paper=77040969110fab39a55699cb06f9edf68789445a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F77040969110fab39a55699cb06f9edf68789445a)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002FAnimate-A-Story\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FAnimate-A-Story.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FAnimate-A-Story)\n\n+ **Make-Your-Video：基于文本和结构引导的定制化视频生成**（2023年6月1日）\u003Cdetails>\u003Csummary>邢金波、夏梦涵、刘宇欣等\u003C\u002Fsummary>邢金波、夏梦涵、刘宇欣、张岳辰、张勇、何英青、刘瀚源、陈浩鑫、寸晓东、王新涛、单颖、王天津\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00943)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-31-blue.svg?paper=52b10ae66d025e99fbb602935e155f97f4f0696f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F52b10ae66d025e99fbb602935e155f97f4f0696f)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdoubiiu.github.io\u002Fprojects\u002FMake-Your-Video\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FMake-Your-Video.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FMake-Your-Video)\n\n+ **跟随你的姿态：利用无姿态视频进行姿态引导的文生视频生成**（2023年4月3日）\u003Cdetails>\u003Csummary>马悦、何英青、寸晓东等\u003C\u002Fsummary>马悦、何英青、寸晓东、王新涛、陈思然、单颖、李秀、陈启峰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.01186)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-53-blue.svg?paper=ee73edebd42626d9c2d91e35fd2ed3cdb0fb26d0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fee73edebd42626d9c2d91e35fd2ed3cdb0fb26d0)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Ffollow-your-pose.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmayuelala\u002FFollowYourPose.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmayuelala\u002FFollowYourPose)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FYueMafighting\u002FFollowYourPose)\n\n+ **图像与视频的实时可控去噪**（2023年3月29日）\u003Cdetails>\u003Csummary>[CVPR 2023] 张兆阳、蒋一桐、邵文琪等\u003C\u002Fsummary>张兆阳、蒋一桐、邵文琪、王小刚、罗平、林凯莫、顾金伟\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.16425)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-108-blue.svg?paper=3f3746c3c64212e97c877bd3d862b578fa24632c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FReal-Time-Controllable-Denoising-for-Image-and-Zhang-Jiang\u002F3f3746c3c64212e97c877bd3d862b578fa24632c)\n\n+ **VideoFusion：用于高质量视频生成的分解扩散模型**（2023年3月15日）\u003Cdetails>\u003Csummary>罗正雄、陈大有、张颖雅等\u003C\u002Fsummary>罗正雄、陈大有、张颖雅、黄燕、王亮、沈宇俊、赵德利、周景仁、谭天牛\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.08320)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-108-blue.svg?paper=26c6090b7e7ba4513f82aa28d41360c60770c618)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F26c6090b7e7ba4513f82aa28d41360c60770c618)\n\n### 视频 VAE\u002F分词器\n\n\n+ **DLFR-VAE：用于视频生成的动态潜在帧率 VAE**（2025年2月17日）\u003Cdetails>\u003Csummary>袁志航、王思源、谢睿等\u003C\u002Fsummary>袁志航、王思源、谢睿、张汉岭、方通成、尚宇章、颜圣根、戴国豪、王宇\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11897)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-nics\u002FDLFR-VAE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthu-nics\u002FDLFR-VAE)\n\n+ **VideoVAE+：基于跨模态视频 VAE 的大规模运动视频自动编码**（2024年12月23日）\u003Cdetails>\u003Csummary>邢亚周、费阳、何英清等\u003C\u002Fsummary>邢亚周、费阳、何英清、陈景业、谢佳欣、迟晓伟、陈启峰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17805)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fyzxing87.github.io\u002Fvae\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVideoVerses\u002FVideoVAEPlus.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVideoVerses\u002FVideoVAEPlus?tab=readme-ov-file)\n\n+ **VidTwin：具有解耦结构与动力学的视频 VAE**（2024年12月23日）\u003Cdetails>\u003Csummary>王宇驰、郭俊亮、谢心怡等\u003C\u002Fsummary>王宇驰、郭俊亮、谢心怡、何天宇、孙旭、卞江\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17726)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FVidTok.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVidTok\u002Ftree\u002Fmain\u002Fvidtwin)\n\n\n+ **VidTok：一款多功能且开源的视频分词器**（2024年12月17日）\u003Cdetails>\u003Csummary>唐安妮、何天宇、郭俊亮等\u003C\u002Fsummary>唐安妮、何天宇、郭俊亮、程新乐、宋莉、卞江\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.13061)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FVidTok.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVidTok)\n\n\n+ **[CVPR 2025] WF-VAE：通过小波驱动的能量流提升潜在视频扩散模型的视频 VAE**（2024年11月26日）\u003Cdetails>\u003Csummary>李宗健、林斌、叶阳等\u003C\u002Fsummary>李宗健、林斌、叶阳、陈刘翰、程新华、袁圣海、袁立\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.17459)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=58d9eaa0868e971687c20d0588de3058b7780b51)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FWF-VAE%3A-Enhancing-Video-VAE-by-Wavelet-Driven-Flow-Li-Lin\u002F58d9eaa0868e971687c20d0588de3058b7780b51)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fcvvae\u002Findex.html)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FWF-VAE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FWF-VAE)\n\n\n+ **[CVPR 2025] [IV-VAE] 针对潜在视频扩散模型的改进型视频 VAE**（2024年11月10日）\u003Cdetails>\u003Csummary>吴平宇、朱凯、刘宇等\u003C\u002Fsummary>吴平宇、朱凯、刘宇、赵利明、翟伟、曹阳、查正军\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.06449)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=4e073da5a37753fba320719baaa17ca593e6a094)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FImproved-Video-VAE-for-Latent-Video-Diffusion-Model-Wu-Zhu\u002F4e073da5a37753fba320719baaa17ca593e6a094)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwpy1999.github.io\u002FIV-VAE\u002F)\n\n\n\n+ **[技术报告] Cosmos 分词器：一套图像和视频神经网络分词器**（2024年11月6日）\u003Cdetails>\u003Csummary>菲茨姆·雷达、顾金伟、刘贤等\u003C\u002Fsummary>菲茨姆·雷达、顾金伟、刘贤、葛松伟、王廷春、王浩翔、刘明宇\u003C\u002Fdetails>\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fdir\u002Fcosmos-tokenizer\u002F)\n[![TokenBench](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTokenBench-00000.svg)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FTokenBench?tab=readme-ov-file)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FCosmos-Tokenizer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FCosmos-Tokenizer)\n\n\n+ **[NeurIPS 2024] CV-VAE：一种兼容潜在生成式视频模型的视频 VAE**（2024年5月30日）\u003Cdetails>\u003Csummary>赵思杰、张勇、寸晓东等\u003C\u002Fsummary>赵思杰、张勇、寸晓东、杨绍书、牛牧瑶、李晓宇、胡文博、山英\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.20279)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=70569a07d841f86faf8914aea435a1696f911a32)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FCV-VAE%3A-A-Compatible-Video-VAE-for-Latent-Video-Zhao-Zhang\u002F70569a07d841f86faf8914aea435a1696f911a32)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002Fcvvae\u002Findex.html)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FCV-VAE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FCV-VAE)\n\n\n\n+ **[ICLR 2024] [MAGVIT-v2] 语言模型胜过扩散模型——分词器是视觉生成的关键**（2023年10月9日）\u003Cdetails>\u003Csummary>于立军、何塞·莱萨马、尼特什·B·贡达瓦拉普等\u003C\u002Fsummary>于立军、何塞·莱萨马、尼特什·B·贡达瓦拉普、卢卡·韦尔萨里、苏基赫·索恩、大卫·米嫩、程勇、阿格里姆·古普塔、顾秀叶、亚历山大·G·豪普特曼、龚博庆、杨明轩、伊尔凡·埃萨、大卫·A·罗斯、蒋璐\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05737)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-15-blue.svg?paper=985f0c89c5a607742ec43c1fdc2cbfe54541cbad)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F985f0c89c5a607742ec43c1fdc2cbfe54541cbad)\n\n\n\n### 音频-视频\n****\n\n+ **JavisDiT：具有层次化时空先验同步的联合音频-视频扩散 Transformer**（2025年3月30日）\u003Cdetails>\u003Csummary>刘凯、李伟、陈来等\u003C\u002Fsummary>刘凯、李伟、陈来、吴盛琼、郑彦浩、姬嘉仪、周帆、姜荣鑫、罗杰波、费浩、蔡德成\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23377)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjavisdit.github.io\u002F)\n\n+ **[LVAS-Agent] 多智能体协作的长视频音频合成** (2025年3月13日)\n  \u003Cdetails>\u003Csummary>张业航、徐新力、徐晓杰等\u003C\u002Fsummary>张业航、徐新力、徐晓杰、刘莉、陈英聪\u003C\u002Fdetails>\n  \n  [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10719)\n  [![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=2503.10719)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2503.10719)\n  [![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flvas-agent.github.io\u002F)\n  [![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZYH-Lightyear\u002FLVAS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FZYH-Lightyear\u002FLVAS)\n\n\n+ **UniForm: 面向音视频生成的统一扩散Transformer** (2025年2月6日)\n    \n    \u003Cdetails>\u003Csummary>赵磊、冯林峰、葛东旭等\u003C\u002Fsummary>赵磊、冯林峰、葛东旭、易方秋、张驰、张小雷、李学龙\u003C\u002Fdetails>\n\n    [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03897)\n    [![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=7a4436209d8cf0d868b13000c8abff63d72daf0f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FUniForm%3A-A-Unified-Diffusion-Transformer-for-Zhao-Feng\u002F7a4436209d8cf0d868b13000c8abff63d72daf0f)\n    [![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Funiform-t2av.github.io)\n\n\n+ **TIA2V：基于文本—图像—音频三模态条件的视频生成** (2025年1月4日)\u003Cdetails>\u003Csummary>赵明禄、王文敏、张睿等。\u003C\u002Fsummary>赵明禄、王文敏、张睿、贾浩美、陈琪\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-b31b1b.svg)](https:\u002F\u002Fdoi.org\u002F10.1016\u002Fj.eswa.2024.126278)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMinglu58\u002FTIA2V.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMinglu58\u002FTIA2V)\n\n\n\n+ **SAVGBench：空间对齐音视频生成基准测试** (2024年12月18日)\n\n    \u003Cdetails>\u003Csummary>岛田一树、克里斯蒂安·西蒙、涩谷隆史等。\u003C\u002Fsummary>岛田一树、克里斯蒂安·西蒙、涩谷隆史、高桥修介、光藤由纪\u003C\u002Fdetails>\n\n    [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.13462)\n\n\n+ **AV-Link：用于跨模态音视频生成的时间对齐扩散特征** (2024年12月19日)\n\n    \u003Cdetails>\u003Csummary>莫亚德·哈吉-阿里、威利·梅纳帕切、阿利亚克桑德尔·谢罗欣等。\u003C\u002Fsummary>莫亚德·哈吉-阿里、威利·梅纳帕切、阿利亚克桑德尔·谢罗欣、伊万·斯科罗霍多夫、阿尔珀·坎贝尔克、郭森李、比森特·奥尔多涅斯、谢尔盖·图利亚科夫\u003C\u002Fdetails>\n\n    [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15191)\n    [![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsnap-research.github.io\u002FAVLink\u002F)\n\n\n+ **SyncFlow：基于文本的时序对齐联合音视频生成** (2024年12月3日)\n\n    \u003Cdetails>\u003Csummary>刘浩赫、盖尔·勒朗、梅新浩等。\u003C\u002Fsummary>刘浩赫、盖尔·勒朗、梅新浩、倪兆恒、阿努拉格·库马尔、瓦伦·纳加拉贾、王文武、马克·D·普兰布利、石阳阳、维卡斯·钱德拉\u003C\u002Fdetails>\n\n    [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15220)\n\n\n\n+ **用于声音化视频生成的简单而强大的基线：音频与视频扩散模型在联合生成中的有效适配** (2024年9月26日)\n\n    \u003Cdetails>\u003Csummary>石井正人、早川昭夫、涩谷隆史。\u003C\u002Fsummary>石井正人、早川昭夫、涩谷隆史、光藤由纪\u003C\u002Fdetails>\n\n    [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17550)\n\n\n+ **AV-DiT：用于联合音视频生成的高效视听扩散Transformer** (2024年6月11日)\n\n    \u003Cdetails>\u003Csummary>王凯、邓世健、施静等。\u003C\u002Fsummary>王凯、邓世健、施静、迪米特里奥斯·哈津纳科斯、田亚鹏\u003C\u002Fdetails>\n\n    [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07686)\n\n\n+ **判别器引导的协同扩散用于联合音视频生成** (2024年5月28日)\u003Cdetails>\u003Csummary>早川昭夫、石井正人、涩谷隆史等。\u003C\u002Fsummary>早川昭夫、石井正人、涩谷隆史、光藤由纪\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.17842)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSonyResearch\u002FMMDisCo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSonyResearch\u002FMMDisCo)\n\n\n\n+ **AudioScenic：音频驱动的视频场景编辑** (2024年4月25日)\u003Cdetails>\u003Csummary>沈凯欣、全瑞洁、朱林超等。\u003C\u002Fsummary>沈凯欣、全瑞洁、朱林超、肖俊、杨毅\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.16581)\n\n\n\n+ **具有噪声水平混合的多功能扩散Transformer用于视听生成** (2024年5月22日)\u003Cdetails>\u003Csummary>金光贤、阿隆索·马丁内斯、苏宇川等。\u003C\u002Fsummary>金光贤、阿隆索·马丁内斯、苏宇川、布伦丹·周、何塞·莱萨马、阿格里姆·古普塔、于立军、江璐、阿伦·扬森、雅各布·沃克、克里希纳·索曼德帕利\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.13762)\n\n\n\n\n+ **利用多模态语言大模型进行语义一致的视频转音频生成** (2024年4月25日)\u003Cdetails>\u003Csummary>陈戈辉、王冠安、黄晓文等。\u003C\u002Fsummary>陈戈辉、王冠安、黄晓文、桑继涛\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.16305)\n\n\n+ **TAVGBench：文本到可听视频生成基准测试** (2024年4月22日)\u003Cdetails>\u003Csummary>毛宇鑫、申旭阳、张静等。\u003C\u002Fsummary>毛宇鑫、申旭阳、张静、秦振、周锦星、向墨初、钟怡然、戴宇超\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14381)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FTAVGBench)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenNLPLab\u002FTAVGBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FTAVGBench)\n\n\n+ **[ECCV 2024 口头报告] ASVA：音频同步的视觉动画** (2024年3月8日)\u003Cdetails>\u003Csummary>张琳、莫申通、张艺京等。\u003C\u002Fsummary>张琳、莫申通、张艺京、佩德罗·莫尔加多\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05659)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flzhangbj.github.io\u002Fprojects\u002Fasva\u002Fasva.html)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flzhangbj\u002FASVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Flzhangbj\u002FASVA)\n\n+ **[CVPR 2024] 看与听：基于扩散潜码对齐器的开放域视觉-音频生成**（2024年2月27日）\u003Cdetails>\u003Csummary>邢亚舟、何英青、田泽岳等\u003C\u002Fsummary>邢亚舟、何英青、田泽岳、王新涛、陈启峰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17723)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyzxing87\u002FSeeing-and-Hearing.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyzxing87\u002FSeeing-and-Hearing)\n\n\n+ **TräumerAI：使用StyleGAN创作梦幻音乐**（2021年2月9日）\u003Cdetails>\u003Csummary>郑多暹、都承宪、权泰均（NeurIPS 2020研讨会）\u003C\u002Fsummary>郑多暹、都承宪、权泰均\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.04680)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-259-blue.svg?paper=8a1384e041cc6ea2735b01c734aeef666dc92884)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8a1384e041cc6ea2735b01c734aeef666dc92884)\n\n\n+ **Sound2Sight：从声音和上下文生成视觉动态**（2020年7月23日）\u003Cdetails>\u003Csummary>阿努普·切里安、莫伊特雷亚·查特吉、纳伦德拉·阿胡贾。（ECCV 2020）\u003C\u002Fsummary>阿努普·切里安、莫伊特雷亚·查特吉、纳伦德拉·阿胡贾\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.12130)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=02f3ced09497c5db59985b2a5db9d3d0aebe5074)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F02f3ced09497c5db59985b2a5db9d3d0aebe5074)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmerlresearch\u002FSound2Sight.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmerlresearch\u002FSound2Sight)\n\n\n\n\n\n\n\n\n\n### 基准测试\n\n\n+ **VBench++：视频生成模型的全面且多功能基准测试套件**（2024年11月20日）\u003Cdetails>\u003Csummary>黄子琪、张帆、徐晓杰等\u003C\u002Fsummary>黄子琪、张帆、徐晓杰、何一楠、于嘉硕、董子悦、马倩莉、纳塔波尔·灿派西特、司晨阳、蒋宇明、王耀辉、陈欣源、陈英聪、王利民、林大华、乔宇、刘子威\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13503)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvchitect.github.io\u002FVBench-project\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVchitect\u002FVBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVchitect\u002FVBench)\n\n\n+ **[VideoGen-Eval] 视频生成的黎明：基于SORA类模型的初步探索**（2024年10月7日）\u003Cdetails>\u003Csummary>曾爱玲、杨宇航、陈卫东等\u003C\u002Fsummary>曾爱玲、杨宇航、陈卫东、刘伟\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05227)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Failab-cvc.github.io\u002FVideoGen-Eval\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAILab-CVC\u002FVideoGen-Eval.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FVideoGen-Eval)\n\n\n+ **ChronoMagic-Bench：文本到延时视频生成的变形评估基准**（2024年6月26日）\u003Cdetails>\u003Csummary>袁圣海、黄金发、许永奇等\u003C\u002Fsummary>袁圣海、黄金发、许永奇、刘耀扬、张绍峰、史宇君、朱睿杰、程新华、罗杰波、李元\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.18522v1)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fpku-yuangroup.github.io\u002FChronoMagic-Bench\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FChronoMagic-Bench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FChronoMagic-Bench)\n\n\n+ **TAVGBench：文本到可听视频生成的基准测试**（2024年4月22日）\u003Cdetails>\u003Csummary>毛宇鑫、沈旭阳、张静等\u003C\u002Fsummary>毛宇鑫、沈旭阳、张静、秦振、周锦星、向墨初、钟怡然、戴宇超\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.14381)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=21a77ed349c8621d0a0ef8407eb744e3de3b13c5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FTAVGBench%3A-Benchmarking-Text-to-Audible-Video-Mao-Shen\u002F4ba90678411ddc0a2eb997e1184b059bdc955fd5)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenNLPLab\u002FTAVGBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FTAVGBench)\n\n\n+ **Sora生成具有惊人几何一致性的视频**（2024年2月27日）\u003Cdetails>\u003Csummary>李轩毅、周大泉、张辰旭等\u003C\u002Fsummary>李轩毅、周大泉、张辰旭、魏绍东、侯启斌、程明明\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17403)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsora-geometrical-consistency.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmeteorshowers\u002FSora-Generates-Videos-with-Stunning-Geometrical-Consistency.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmeteorshowers\u002FSora-Generates-Videos-with-Stunning-Geometrical-Consistency)\n\n\n+ **[CVPR 2024亮点] VBench：视频生成模型的综合基准测试套件**（2023年11月29日）\u003Cdetails>\u003Csummary>黄子琪、何一楠、于嘉硕等\u003C\u002Fsummary>黄子琪、何一楠、于嘉硕、张帆、司晨阳、蒋宇明、张远涵、吴天行、金庆阳、纳塔波尔·灿派西特、王耀辉、陈欣源、王利民、林大华、乔宇、刘子威\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17982)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=4e9a8141da2a8c603722b07d096109207f8e0b66)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4e9a8141da2a8c603722b07d096109207f8e0b66)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvchitect.github.io\u002FVBench-project\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVchitect\u002FVBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVchitect\u002FVBench)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVchitect\u002FVBench_Leaderboard)\n\n\n+ **[CVPR 2024] EvalCrafter：大型视频生成模型的基准测试与评估**（2024年3月23日）\u003Cdetails>\u003Csummary>刘耀芳、孙晓东、刘学博等\u003C\u002Fsummary>刘耀芳、孙晓东、刘学博、王新涛、张勇、陈浩鑫、刘洋、曾铁勇、陈Raymond、山英\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17982)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fevalcrafter.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEvalCrafter\u002FEvalCrafter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FEvalCrafter\u002FEvalCrafter)\n\n\n### 数据集\n\n+ **VidGen-1M：用于文本到视频生成的大规模数据集**（2024年8月5日）\u003Cdetails>\u003Csummary>谭志宇、杨晓梦、秦洛铮等\u003C\u002Fsummary>谭志宇、杨晓梦、秦洛铮、李浩\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.02629)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=8af933a6e0d45e041a1ca35d461aad92022aa957)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVidGen-1M%3A-A-Large-Scale-Dataset-for-Text-to-video-Tan-Yang\u002F8af933a6e0d45e041a1ca35d461aad92022aa957)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmutonix\u002FVript.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSAIS-FUXI\u002FVidGen)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsais-fuxi.github.io\u002Fprojects\u002Fvidgen-1m\u002F)\n\n\n+ **Vript：一图胜千言**（2024年6月10日）\u003Cdetails>\u003Csummary>[NIPS 2024 数据集与基准赛道] 杨东杰、黄苏源、陆成强等\u003C\u002Fsummary>杨东杰、黄苏源、陆成强、韩晓东、张浩鑫、高岩、胡耀、赵海\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06040)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=2fb2a76be0f261fb2660457a6cec8a8384b33a19)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVript%3A-A-Video-Is-Worth-Thousands-of-Words-Yang-Huang\u002F2fb2a76be0f261fb2660457a6cec8a8384b33a19)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmutonix\u002FVript.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmutonix\u002FVript)\n\n\n+ **MMTrail：包含语言和音乐描述的多模态预告片视频数据集**（2024年7月30日）\u003Cdetails>\u003Csummary>迟晓伟、王雅婷、程傲松等\u003C\u002Fsummary>迟晓伟、王雅婷、程傲松、方鹏俊、田泽悦、何英青、刘兆阳、齐兴群、潘嘉豪、张荣宇、李孟菲、袁瑞斌、蒋延兵、薛伟、罗文翰、陈启峰、张尚航、刘启峰、郭义克\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20962)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=0d8ea8cf8fcadc0eb52304258e254e01c62dfe52)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FMMTrail%3A-A-Multimodal-Trailer-Video-Dataset-with-Chi-Wang\u002F0d8ea8cf8fcadc0eb52304258e254e01c62dfe52)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flitwellchi\u002FMMTrail.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Flitwellchi\u002FMMTrail)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmattie-e.github.io\u002FMMTrail\u002F)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flitwell\u002FMMTrail-20M)\n\n\n+ **InternVid：用于多模态理解和生成的大规模视频-文本数据集**（2023年7月13日）\u003Cdetails>\u003Csummary>[ICLR 2024 Spotlight] 王毅、何一楠、李一卓等\u003C\u002Fsummary>王毅、何一楠、李一卓、李坤昌、于家硕、马欣、李新浩、陈国、陈新元、王耀辉、何聪慧、罗平、刘子威、王亚丽、王利民、乔宇\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06942)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-36-blue.svg?paper=369b449415d50387fba048bbd4d26ee890df84b5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F369b449415d50387fba048bbd4d26ee890df84b5)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVideo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVideo)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenGVLab\u002FInternVid)\n\n+ **[HD-VG-130M] VideoFactory：在时空扩散模型中引入交换注意力以实现文本到视频生成**（2023年5月18日）\u003Cdetails>\u003Csummary>王文静、杨欢、拓子熙等\u003C\u002Fsummary>王文静、杨欢、拓子熙、何会国、朱俊辰、傅建龙、刘佳颖\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10874)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-43-blue.svg?paper=50bbf2c11984d18aa14f964a4909ac25f07e50ea)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F50bbf2c11984d18aa14f964a4909ac25f07e50ea)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVideo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdaooshee\u002FHD-VG-130M?tab=readme-ov-file)\n\n+ **[VideoCC3M] 从图像字幕中学习音频-视频模态**（2023年5月18日）\u003Cdetails>\u003Csummary>[ECCV 2022] 阿尔沙·纳格拉尼、保罗·洪淑·徐、布莱恩·塞博尔德等\u003C\u002Fsummary>阿尔沙·纳格拉尼、保罗·洪淑·徐、布莱恩·塞博尔德、安雅·豪斯、圣地亚哥·马嫩、孙晨、科黛莉亚·施密德\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.00679)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-48-blue.svg?paper=aa1b722485106c84e52c5e35b2d4b2f8c7fb3135)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Faa1b722485106c84e52c5e35b2d4b2f8c7fb3135)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVideo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002FvideoCC-data)\n\n+ **CelebV-Text：大规模人脸文本-视频数据集**（2023年3月26日）\u003Cdetails>\u003Csummary>[CVPR 2023] 俞建辉、朱浩、姜立明等\u003C\u002Fsummary>俞建辉、朱浩、姜立明、陈昌·洛伊、蔡卫东、吴伟\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.14717)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=484d2194ce8459bfa9da906e556f63812c6ca999)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F484d2194ce8459bfa9da906e556f63812c6ca999)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcelebv-text.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCelebV-Text\u002FCelebV-Text.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FCelebV-Text\u002FCelebV-Text)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0TS1hQwjNWw)\n\n+ **[HD-VILA-100M] 利用大规模视频转录推进高分辨率视频-语言表示**（2021年11月19日）\u003Cdetails>\u003Csummary>[CVPR 2022] 薛宏伟、杭天凯、曾艳红等\u003C\u002Fsummary>薛宏伟、杭天凯、曾艳红、孙玉冲、刘贝、杨欢、傅建龙、郭百宁\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.10337)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-84-blue.svg?paper=e1a3e6856b6ac6af3600b5954392e5368603fd1b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe1a3e6856b6ac6af3600b5954392e5368603fd1b)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FXPretrain.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FXPretrain\u002Fblob\u002Fmain\u002Fhd-vila-100m\u002FREADME.md)\n\n+ **[YT-Temporal-180M] MERLOT：多模态神经脚本知识模型**（2021年6月4日）\u003Cdetails>\u003Csummary>[NeurIPS 2021] 罗温·泽勒斯、卢希明、杰克·赫塞尔等。\u003C\u002Fsummary>罗温·泽勒斯、卢希明、杰克·赫塞尔、柳英宰、朴在成、曹继泽、阿里·法拉哈迪、崔艺珍\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.02636)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-277-blue.svg?paper=90357a6dc817e2f7cec477a51156675fbf545cf1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F90357a6dc817e2f7cec477a51156675fbf545cf1)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frowanz\u002Fmerlot.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Frowanz\u002Fmerlot)\n\n+ **[WebVid-10M] 凝固时间：用于端到端检索的视频与图像联合编码器**（2021年4月1日）\u003Cdetails>\u003Csummary>[ICCV 2021] 马克斯·贝恩、阿尔莎·纳格拉尼、居尔·瓦罗尔等。\u003C\u002Fsummary>马克斯·贝恩、阿尔莎·纳格拉尼、居尔·瓦罗尔、安德鲁·齐塞曼\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.00650)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-585-blue.svg?paper=bac87bdb1cabc35fafb8176a234d332ebcc02864)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbac87bdb1cabc35fafb8176a234d332ebcc02864)\n\n+ **[WTS70M] 基于文本网络监督学习视频表示**（2020年7月29日）\u003Cdetails>\u003Csummary>乔纳森·C·斯特劳德、陆志超、陈孙等。\u003C\u002Fsummary>乔纳森·C·斯特劳德、陆志超、陈孙、邓佳、拉胡尔·苏坎塔尔、科黛莉亚·施密德、戴维·A·罗斯\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.14937)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-42-blue.svg?paper=da55208bc9b56b5f394c242239d8cd0734bd5a87)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fda55208bc9b56b5f394c242239d8cd0734bd5a87)\n\n+ **HowTo100M：通过观看一亿个带旁白的视频片段学习文本-视频嵌入**（2019年6月7日）\u003Cdetails>\u003Csummary>[ICCV 2019] 安托万·米埃什、季米特里·朱科夫、让-巴蒂斯特·阿拉伊拉克等。\u003C\u002Fsummary>安托万·米埃什、季米特里·朱科夫、让-巴蒂斯特·阿拉伊拉克、马卡兰德·塔帕斯维、伊万·拉普捷夫、约瑟夫·西维奇\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.03327)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-847-blue.svg?paper=9311779489e597315488749ee6c386bfa3f3512e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9311779489e597315488749ee6c386bfa3f3512e)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.di.ens.fr\u002Fwillow\u002Fresearch\u002Fhowto100m\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fantoine77340\u002Fhowto100m.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fantoine77340\u002Fhowto100m?tab=readme-ov-file)\n\n+ **VATEX：面向视频与语言研究的大规模高质量多语言数据集**（2019年4月6日）\u003Cdetails>\u003Csummary>[ICCV 2019 口头报告] 王欣、吴嘉伟、陈俊坤等。\u003C\u002Fsummary>王欣、吴嘉伟、陈俊坤、李磊、王元芳、威廉·杨·王\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.03493)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-373-blue.svg?paper=28b74bb7c8b08cceb2430ec2d54dfa0f3225d796)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F28b74bb7c8b08cceb2430ec2d54dfa0f3225d796)\n\n+ **How2：大规模多模态语言理解数据集**（2019年6月7日）\u003Cdetails>\u003Csummary>[NeurIPS 2018] 拉蒙·萨纳布里亚、奥赞·卡格拉扬、舒鲁蒂·帕拉斯卡尔等。\u003C\u002Fsummary>拉蒙·萨纳布里亚、奥赞·卡格拉扬、舒鲁蒂·帕拉斯卡尔、德斯蒙德·埃利奥特、洛伊克·巴拉尔、露西亚·斯佩夏、弗洛里安·梅策\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.00347)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-223-blue.svg?paper=f56cb5dc32b5b280546998418fda7769d0858629)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff56cb5dc32b5b280546998418fda7769d0858629)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsrvk.github.io\u002Fhow2-dataset\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsrvk\u002Fhow2-dataset.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsrvk\u002Fhow2-dataset)\n\n+ **[ActivityNet Captions] 视频中事件的密集描述**（2017年5月2日）\u003Cdetails>\u003Csummary>[ICCV 2017] 兰贾伊·克里希纳、健二·畑、弗雷德里克·任等。\u003C\u002Fsummary>兰贾伊·克里希纳、健二·畑、弗雷德里克·任、李飞飞、胡安·卡洛斯·涅布雷斯\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.00754)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-934-blue.svg?paper=96dd1fc39a368d23291816d57763bc6eb4f7b8d6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F96dd1fc39a368d23291816d57763bc6eb4f7b8d6)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Franjaykrishna\u002Fdensevid\u002F)\n\n+ **[LSMDC] 电影描述**（2016年5月12日）\u003Cdetails>\u003Csummary>[IJCV 2017] 安娜·罗尔巴赫、阿图萨·托拉比、马库斯·罗尔巴赫等。\u003C\u002Fsummary>安娜·罗尔巴赫、阿图萨·托拉比、马库斯·罗尔巴赫、尼凯特·坦东、克里斯托弗·帕尔、于戈·拉罗谢尔、阿伦·库维尔、伯恩特·席勒\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1605.03705)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-278-blue.svg?paper=154c22ca5eef149aedc8a986fa684ca1fd14e7dc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F154c22ca5eef149aedc8a986fa684ca1fd14e7dc)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.mpi-inf.mpg.de\u002Fdepartments\u002Fcomputer-vision-and-machine-learning\u002Fresearch\u002Fvision-and-language\u002Fmpii-movie-description-dataset)\n\n\n+ **MSR-VTT：用于连接视频与语言的大型视频描述数据集**（2021年4月1日）\u003Cdetails>\u003Csummary>[CVPR 2016] 徐军、梅涛、姚婷等。\u003C\u002Fsummary>徐军、梅涛、姚婷和阮勇\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_cvpr_2016\u002Fpapers\u002FXu_MSR-VTT_A_Large_CVPR_2016_paper.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1395-blue.svg?paper=b8e2e9f3ba008e28257195ec69a00e07f260131d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb8e2e9f3ba008e28257195ec69a00e07f260131d)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcrux82\u002Fmsr-vtt-it.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fcrux82\u002Fmsr-vtt-it)\n\n\n\n\n\n\n\n## 3D生成\n\n### 🔅 基于大语言模型\n+ **SceneCraft：用于以Blender代码合成3D场景的大语言模型代理**（2024年3月2日）\u003Cdetails>\u003Csummary>胡子宇、艾哈迈特·伊斯琴、阿希·贾因等。\u003C\u002Fsummary>胡子宇、艾哈迈特·伊斯琴、阿希·贾因、托马斯·基普夫、伊桑·岳、大卫·A·罗斯、科黛莉亚·施密德、阿里雷扎·法蒂\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01248v1)\n\n\n+ **MotionScript：用于富有表现力的3D人体动作的自然语言描述**（2023年12月19日）\u003Cdetails>\u003Csummary>帕亚姆·乔梅·亚兹迪安、埃里克·刘、李成等。\u003C\u002Fsummary>帕亚姆·乔梅·亚兹迪安、埃里克·刘、李成、安杰莉卡·林\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.12634)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=816792e66f463be2aa1888e4ecb51f8fb2b4dd79)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F816792e66f463be2aa1888e4ecb51f8fb2b4dd79)\n\n+ **HOLODECK：语言引导的3D具身AI环境生成**（2023年12月19日）\u003Cdetails>\u003Csummary>[CVPR 2024]杨悦、孙凡云、卢卡·魏斯等。\u003C\u002Fsummary>杨悦、孙凡云、卢卡·魏斯、伊莱·范德比尔特、阿尔瓦罗·埃拉斯特、温森·韩、吴嘉俊、尼克·哈伯、兰杰·克里希纳、刘凌洁、克里斯·卡利森-伯奇、马克·亚茨卡尔、阿尼鲁达·肯布哈维、克里斯托弗·克拉克\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.09067)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=1dbc2cdcae3e17c3d721d12a5a2d98ced727681a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1dbc2cdcae3e17c3d721d12a5a2d98ced727681a)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002FHolodeck.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fallenai\u002FHolodeck)\n\n+ **PoseGPT：关于3D人体姿态的对话**（2023年11月30日）\u003Cdetails>\u003Csummary>冯耀、林静、萨伊·库马尔·德维迪等。\u003C\u002Fsummary>[CVPR 2024]冯耀、林静、萨伊·库马尔·德维迪、孙宇、普里扬卡·帕特尔、迈克尔·J·布莱克\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.18836)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=4673c2ac4abb4b055da87171231acb60801ffe74)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4673c2ac4abb4b055da87171231acb60801ffe74)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfeng95\u002FPoseGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyfeng95\u002FPoseGPT)\n\n\n+ **3D-GPT：利用大型语言模型进行程序化3D建模**（2023年10月19日）\u003Cdetails>\u003Csummary>孙春毅*、韩俊林*、邓伟健等。\u003C\u002Fsummary>孙春毅、韩俊林、邓伟健、王新龙、秦志山、斯蒂芬·古尔德\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.12945)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=588930cdd801f335b5e524d13f99aa94136a20a0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F588930cdd801f335b5e524d13f99aa94136a20a0)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChuny1\u002F3DGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FChuny1\u002F3DGPT)\n\n\n\n\n### 非LLM-based（Clip\u002FT5）\n+ **DreamPolisher：通过几何扩散实现高质量文本到3D生成**（2024年3月12日）\u003Cdetails>\u003Csummary>林元泽、罗纳德·克拉克、菲利普·托尔。\u003C\u002Fsummary>林元泽、罗纳德·克拉克、菲利普·托尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.17237)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=72e54db6eebb99d4039cd66cb5dad6a40b31cf87)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F72e54db6eebb99d4039cd66cb5dad6a40b31cf87)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyuanze-lin\u002FDreamPolisher.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyuanze-lin\u002FDreamPolisher)\n\n+ **Consistent3D：基于确定性采样先验的一致性高保真文本到3D生成**（2024年3月12日）\u003Cdetails>\u003Csummary>吴子科、周攀、易轩宇等。\u003C\u002Fsummary>[CVPR 2024]吴子科、周攀、易轩宇、袁晓东、张汉旺\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09050)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=834f595fb25b8306b46f9e744ef8150f4971322f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F834f595fb25b8306b46f9e744ef8150f4971322f)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsail-sg\u002FConsistent3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FConsistent3D)\n\n+ **AToM：使用2D扩散的摊销式文本到网格生成**（2024年2月1日）\u003Cdetails>\u003Csummary>钱国成、曹俊立、阿利亚克桑德尔·西亚罗欣等。\u003C\u002Fsummary>钱国成、曹俊立、阿利亚克桑德尔·西亚罗欣、雅什·坎特、王超阳、米哈伊尔·瓦西利科夫斯基、李欣莹、方宇威、伊万·斯科罗霍多夫、庄培业、伊戈尔·吉利琴斯基、任坚、伯纳德·加内姆、克菲尔·阿伯曼、谢尔盖·图利亚科夫\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.00867)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=ce2edcc6a0ef7592c385c5d8fd0924f79707e223)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fce2edcc6a0ef7592c385c5d8fd0924f79707e223)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsnap-research\u002FAToM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsnap-research\u002FAToM)\n\n\n+ **DreamControl：基于控制的3D自先验文本到3D生成**（2024年3月12日）\u003Cdetails>\u003Csummary>黄天宇、曾义涵、张志陆等。\u003C\u002Fsummary>[CVPR 2024]黄天宇、曾义涵、张志陆、许婉、许航、许松岑、劳仁森·W·H、左望盟\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06439)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=c15380dcda5a010827e3b014dcebe95b1218c680)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc15380dcda5a010827e3b014dcebe95b1218c680)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftyhuang0428\u002FDreamControl.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftyhuang0428\u002FDreamControl)\n\n\n+ **UniDream：用于可重光照文本到3D生成的统一扩散先验**（2023年12月14日）\u003Cdetails>\u003Csummary>刘泽翔、李阳光、林友田等。\u003C\u002Fsummary>刘泽翔、李阳光、林友田、辛宇、彭思达、曹燕佩、齐小娟、黄晓水、梁丁、欧阳万里\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08754)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=210e63599d49abdb848a4440d4244cdcdedeadff)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F210e63599d49abdb848a4440d4244cdcdedeadff)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYG256Li\u002FUniDream.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYG256Li\u002FUniDream)\n\n+ **Sherpa3D：通过粗略的3D先验增强高保真文本到3D生成**（2023年12月11日）\u003Cdetails>\u003Csummary>[CVPR 2024] 刘方富、吴典坤、魏毅等 \u003C\u002Fsummary>刘方富、吴典坤、魏毅、饶勇明、段悦琪\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06655)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=fe1b3f0d074974ce946f10f3bbf52e8351bc0156)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffe1b3f0d074974ce946f10f3bbf52e8351bc0156)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fliuff19\u002FSherpa3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fliuff19\u002FSherpa3D)\n\n+ **学习优化去噪分数以用于3D生成：NeRF与3D高斯溅射上的统一且改进的扩散先验**（2023年12月8日）\u003Cdetails>\u003Csummary>杨晓峰、陈怡文、陈诚等 \u003C\u002Fsummary>杨晓峰、陈怡文、陈诚、张驰、许毅、杨旭雷、刘法耀、林国生\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04820)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=6d10f9b0e0a579a1359df7dfbdef00bc798d5714)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6d10f9b0e0a579a1359df7dfbdef00bc798d5714)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangxiaofeng\u002FLODS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyangxiaofeng\u002FLODS)\n\n+ **DreamPropeller：通过并行采样加速文本到3D生成**（2023年11月28日）\u003Cdetails>\u003Csummary>周林奇、Andy Shih、孟晨琳等 \u003C\u002Fsummary>周林奇、Andy Shih、孟晨琳、斯特凡诺·埃尔蒙\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16918)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=e88d5399956c9d9519a5cfd49308b7d439167543)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe88d5399956c9d9519a5cfd49308b7d439167543)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falexzhou907\u002FDreamPropeller.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Falexzhou907\u002FDreamPropeller)\n\n+ **RichDreamer：一种可泛化的法线-深度扩散模型，用于提升文本到3D生成中的细节丰富度**（2023年11月28日）\u003Cdetails>\u003Csummary>[CVPR 2024] 邱凌腾、陈冠英、顾晓东等 \u003C\u002Fsummary>邱凌腾、陈冠英、顾晓东、左琦、徐牧天、吴雨霜、袁伟浩、董子龙、薄立峰、韩晓光\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17082)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=cf60a639add75cdea4273697269ee463024b7926)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcf60a639add75cdea4273697269ee463024b7926)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmodelscope\u002Frichdreamer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Frichdreamer)\n\n+ **DreamAvatar：基于扩散模型的文本与形状引导的3D人体化身生成**（2023年11月30日）\u003Cdetails>\u003Csummary>[CVPR 2024] 曹宇康、曹燕佩、韩凯等 \u003C\u002Fsummary>曹宇康、曹燕佩、韩凯、山英、王宽义\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.00916)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-69-blue.svg?paper=0fa1501c7378a0dca2ac913fce9dcdcc2b1958a7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0fa1501c7378a0dca2ac913fce9dcdcc2b1958a7)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyukangcao\u002FDreamAvatar.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyukangcao\u002FDreamAvatar)\n\n+ **LucidDreamer：通过区间分数匹配实现高保真文本到3D生成**（2023年12月2日）\u003Cdetails>\u003Csummary>[CVPR 2024] 梁一迅、杨欣、林建涛等 \u003C\u002Fsummary>梁一迅、杨欣、林建涛、李浩东、徐晓刚、陈颖聪\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.11284)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=6f709278506813d04a074e6fa20188cce9bb927b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6f709278506813d04a074e6fa20188cce9bb927b)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEnVision-Research\u002FLucidDreamer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FEnVision-Research\u002FLucidDreamer)\n\n\n+ **GaussianDreamer：通过桥接2D与3D扩散模型，实现从文本到3D高斯点云的快速生成**（2023年10月12日）\u003Cdetails>\u003Csummary>[CVPR 2024] 易涛然、方继民、王俊杰等 \u003C\u002Fsummary>易涛然、方继民、王俊杰、吴冠军、谢凌熙、张小鹏、刘文宇、田齐、王兴刚\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08529)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-230-blue.svg?paper=c5e9fd131cde68c218d0ea69cd617a67c7f35d42)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc5e9fd131cde68c218d0ea69cd617a67c7f35d42)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhustvl\u002FGaussianDreamer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FGaussianDreamer)\n\n+ **使用高斯溅射进行文本到3D生成**（2023年9月28日）\u003Cdetails>\u003Csummary>[CVPR 2024] 陈子龙、王峰、刘华萍 \u003C\u002Fsummary>陈子龙、王峰、刘华萍\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.16585)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-57-blue.svg?paper=86b5318b0a69ccdeec17abb0120e4bd7688a4b59)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F86b5318b0a69ccdeec17abb0120e4bd7688a4b59)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgsgen3d\u002Fgsgen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgsgen3d\u002Fgsgen)\n\n+ **EfficientDreamer：通过正交视图扩散先验实现高保真且鲁本的3D创作**（2023年9月10日）\u003Cdetails>\u003Csummary>[CVPR 2024] 胡志鹏、赵敏达、赵超逸、梁新悦、李林成、赵增、范昌杰、周晓伟、于鑫 \u003C\u002Fsummary>胡志鹏、赵敏达、赵超逸、梁新悦、李林成、赵增、范昌杰、周晓伟、于鑫\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.13223)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-23-blue.svg?paper=fec17239569efd6914f0df9e25b66b310969d3c5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffec17239569efd6914f0df9e25b66b310969d3c5)\n\n+ **TADA！文本到可动画数字化身**（2023年8月21日）\u003Cdetails>\u003Csummary>[3DV 2024] 廖婷婷、易洪伟、修玉良等 \u003C\u002Fsummary>廖婷婷、易洪伟、修玉良、唐家兴、黄阳毅、Justus Thies、迈克尔·J·布莱克\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.10899)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-36-blue.svg?paper=303f466fb823112f79a9f36637c7084dd8363fc5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F303f466fb823112f79a9f36637c7084dd8363fc5)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTingtingLiao\u002FTADA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTingtingLiao\u002FTADA)\n\n+ **SweetDreamer：在 2D 扩散模型中对齐几何先验以实现一致的文本到 3D 生成**（2023 年 10 月 20 日）\u003Cdetails>\u003Csummary>[ICLR 2024] 李伟宇、陈睿、陈雪琳等\u003C\u002Fsummary>李伟宇、陈睿、陈雪琳、谭平\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02596)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-29-blue.svg?paper=438e9fb79c9e37d43223e61bb575ebd2dae0b0a7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F438e9fb79c9e37d43223e61bb575ebd2dae0b0a7)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwyysf-98\u002FSweetDreamer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fwyysf-98\u002FSweetDreamer)\n\n+ **无噪声分数蒸馏**（2023 年 10 月 26 日）\u003Cdetails>\u003Csummary>[ICLR 2024] 奥伦·卡齐尔、奥尔·帕塔什尼克、丹尼尔·科恩-奥尔等\u003C\u002Fsummary>奥伦·卡齐尔、奥尔·帕塔什尼克、丹尼尔·科恩-奥尔、达尼·利希金斯基\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.17590)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-14-blue.svg?paper=85a70c0a048cba4f53dcf332ee73f6032a2e53bc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F85a70c0a048cba4f53dcf332ee73f6032a2e53bc)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Forenkatzir\u002Fnfsd.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Forenkatzir\u002Fnfsd)\n\n+ **基于分类器分数蒸馏的文本到 3D 生成**（2023 年 10 月 26 日）\u003Cdetails>\u003Csummary>[ICLR 2024] 辛宇、郭元辰、李阳光等\u003C\u002Fsummary>辛宇、郭元辰、李阳光、丁亮、张松海、戚小娟\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.19415)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=4e21879b564cc2e803b16edf0dda9f1edb91b497)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4e21879b564cc2e803b16edf0dda9f1edb91b497)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCVMI-Lab\u002FClassifier-Score-Distillation.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FCVMI-Lab\u002FClassifier-Score-Distillation)\n\n+ **HiFA：基于高级扩散引导的高保真文本到 3D 生成**（2023 年 11 月 28 日）\u003Cdetails>\u003Csummary>[ICLR 2024] 朱俊哲、庄培业\u003C\u002Fsummary>朱俊哲、庄培业\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18766)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-17-blue.svg?paper=daf3b117f789b2b95223e58592979fb57627515e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdaf3b117f789b2b95223e58592979fb57627515e)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJunzheJosephZhu\u002FHiFA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FJunzheJosephZhu\u002FHiFA)\n\n+ **MVDream：用于 3D 生成的多视角扩散模型**（2023 年 8 月 31 日）\u003Cdetails>\u003Csummary>[ICLR 2024] 史一春、王鹏、叶江龙等\u003C\u002Fsummary>史一春、王鹏、叶江龙、麦龙、李克杰、肖洋\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.16512)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-154-blue.svg?paper=9aa01997226b5c4d705ae2e2f52c32681006654b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9aa01997226b5c4d705ae2e2f52c32681006654b)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FMVDream.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FMVDream)\n\n+ **DreamGaussian：用于高效 3D 内容创作的生成式高斯泼溅**（2023 年 9 月 28 日）\u003Cdetails>\u003Csummary>[ICLR 2024] 唐家祥、任嘉伟、周航等\u003C\u002Fsummary>唐家祥、任嘉伟、周航、刘子威、曾刚\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.16653)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-99-blue.svg?paper=cc1a674bb164d09a060cf5b26fe518c02fae0ddc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcc1a674bb164d09a060cf5b26fe518c02fae0ddc)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdreamgaussian\u002Fdreamgaussian.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdreamgaussian\u002Fdreamgaussian)\n\n+ **让 2D 扩散模型了解 3D 一致性，以实现稳健的文本到 3D 生成**（2023 年 4 月 11 日）\u003Cdetails>\u003Csummary>[ICLR 2024] 徐俊英、张宇锡、郭敏燮等\u003C\u002Fsummary>徐俊英、张宇锡、郭敏燮、金贤洙、高在勋、金俊浩、金镇华、李智英、金承龙\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.07937)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-81-blue.svg?paper=5356c3dac654854a0842753bcc2e3433dc4a2afd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5356c3dac654854a0842753bcc2e3433dc4a2afd)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Feladrich\u002Flatent-nerf.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Feladrich\u002Flatent-nerf)\n\n\n+ **IT3D：通过显式视图合成改进的文本到 3D 生成**（2023 年 8 月 22 日）\u003Cdetails>\u003Csummary>[AAAI 2024] 陈义文、张驰、杨晓峰等\u003C\u002Fsummary>陈义文、张驰、杨晓峰、蔡中刚、于刚、杨磊、林国胜\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.11473)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-30-blue.svg?paper=2b94785cbfd865a01cc68d7d4c7500b710e5e2fb)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2b94785cbfd865a01cc68d7d4c7500b710e5e2fb)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbuaacyw\u002FIT3D-text-to-3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fbuaacyw\u002FIT3D-text-to-3D)\n\n\n+ **HD-Fusion：利用多重噪声估计实现细节丰富的文本到 3D 生成**（2023 年 7 月 30 日）\u003Cdetails>\u003Csummary>[WACV 2024] 吴金波、高晓波、刘星等\u003C\u002Fsummary>吴金波、高晓波、刘星、沈正阳、赵晨、冯浩成、刘景涛、丁尔瑞\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16183)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=d8aaed01dffc621488aecbb0ef01b50f86e44bc1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd8aaed01dffc621488aecbb0ef01b50f86e44bc1)\n\n\n+ **重新构想负提示算法：将 2D 扩散模型转化为 3D，缓解雅努斯问题及更多**（2023 年 4 月 11 日）\u003Cdetails>\u003Csummary>穆罕默德雷扎·阿曼德普尔、阿里·萨德吉安、郑黄杰等\u003C\u002Fsummary>穆罕默德雷扎·阿曼德普尔、阿里·萨德吉安、郑黄杰、阿米尔·萨德吉安、周明远\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.04968)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-50-blue.svg?paper=b19ca192a5bebbc3473be61989baf085ff21daa5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb19ca192a5bebbc3473be61989baf085ff21daa5)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPerp-Neg\u002FPerp-Neg-stablediffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPerp-Neg\u002FPerp-Neg-stablediffusion)\n\n+ **基于潜在NeRF的形状引导三维形状与纹理生成**（2022年11月14日）\u003Cdetails>\u003Csummary>[CVPR 2023] Gal Metzer、Elad Richardson、Or Patashnik等\u003C\u002Fsummary>Gal Metzer、Elad Richardson、Or Patashnik、Raja Giryes、Daniel Cohen-Or\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.07600)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-212-blue.svg?paper=793939b83e10903f58d8edbb7534963df627a1fe)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F793939b83e10903f58d8edbb7534963df627a1fe)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Feladrich\u002Flatent-nerf.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Feladrich\u002Flatent-nerf)\n\n+ **Magic3D：高分辨率文本到3D内容创作**（2022年11月18日）\u003Cdetails>\u003Csummary>[CVPR 2023亮点] Chen-Hsuan Lin、Jun Gao、Luming Tang等\u003C\u002Fsummary>Chen-Hsuan Lin、Jun Gao、Luming Tang、Towaki Takikawa、Xiaohui Zeng、Xun Huang、Karsten Kreis、Sanja Fidler、Ming-Yu Liu、Tsung-Yi Lin\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FLin_Magic3D_High-Resolution_Text-to-3D_Content_Creation_CVPR_2023_paper.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-466-blue.svg?paper=bdf4af8311637c681904e71cf50f96fd0026f578)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbdf4af8311637c681904e71cf50f96fd0026f578)\n\n+ **分数雅可比链：将预训练的2D扩散模型扩展用于3D生成**（2022年12月1日）\u003Cdetails>\u003Csummary>[CVPR 2023] Haochen Wang、Xiaodan Du、Jiahao Li等\u003C\u002Fsummary>Haochen Wang、Xiaodan Du、Jiahao Li、Raymond A. Yeh、Greg Shakhnarovich\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.00774)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-238-blue.svg?paper=fc011ed5ee986332523a62d2783adee1179dc1ed)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffc011ed5ee986332523a62d2783adee1179dc1ed)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpals-ttic\u002Fsjc.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpals-ttic\u002Fsjc)\n\n+ **基于自然语言描述的高保真3D人脸生成**（2023年5月5日）\u003Cdetails>\u003Csummary>[CVPR 2023] Menghua Wu、Hao Zhu、Linjia Huang等\u003C\u002Fsummary>Menghua Wu、Hao Zhu、Linjia Huang、Yiyu Zhuang、Yuanxun Lu、Xun Cao\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.03302)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=012d7d3ee690e5acadf416787651a8fe425e8eb3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F012d7d3ee690e5acadf416787651a8fe425e8eb3)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhuhao-nju\u002Fdescribe3d.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzhuhao-nju\u002Fdescribe3d)\n\n+ **RODIN：利用扩散技术雕刻3D数字化身的生成模型**（2022年12月12日）\u003Cdetails>\u003Csummary>[CVPR 2023亮点] Tengfei Wang、Bo Zhang、Ting Zhang等\u003C\u002Fsummary>Tengfei Wang、Bo Zhang、Ting Zhang、Shuyang Gu、Jianmin Bao、Tadas Baltrusaitis、Jingjing Shen、Dong Chen、Fang Wen、Qifeng Chen、Baining Guo\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.06135)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-158-blue.svg?paper=7e993a9ca01dcd4538362454aaac29a18a63c000)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7e993a9ca01dcd4538362454aaac29a18a63c000)\n\n+ **ClipFace：基于文本指导的带纹理3D可变形模型编辑**（2023年4月24日）\u003Cdetails>\u003Csummary>[SIGGRAPH 2023] Tengfei Wang、Bo Zhang、Ting Zhang等\u003C\u002Fsummary>Tengfei Wang、Bo Zhang、Ting Zhang、Shuyang Gu、Jianmin Bao、Tadas Baltrusaitis、Jingjing Shen、Dong Chen、Fang Wen、Qifeng Chen、Baining Guo\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.01406)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-40-blue.svg?paper=f21e8eddf42580d1f38a11ec5acd8891c0454a1f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff21e8eddf42580d1f38a11ec5acd8891c0454a1f)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshivangi-aneja\u002FClipFace.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshivangi-aneja\u002FClipFace)\n\n+ **DreamFusion：使用2D扩散实现文本到3D**（2022年9月29日）\u003Cdetails>\u003Csummary>[ICLR 2023口头报告] Ben Poole、Ajay Jain、Jonathan T. Barron等\u003C\u002Fsummary>Ben Poole、Ajay Jain、Jonathan T. Barron、Ben Mildenhall\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.14988)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-908-blue.svg?paper=4c94d04afa4309ec2f06bdd0fe3781f91461b362)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4c94d04afa4309ec2f06bdd0fe3781f91461b362)\n\n+ **ProlificDreamer：通过变分分数蒸馏实现高保真且多样化的文本到3D生成**（2023年5月25日）\u003Cdetails>\u003Csummary>[NeurIPS 2023 Spotlight] Zhengyi Wang、Cheng Lu、Yikai Wang等\u003C\u002Fsummary>Zhengyi Wang、Cheng Lu、Yikai Wang、Fan Bao、Chongxuan Li、Hang Su、Jun Zhu\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.13873)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-230-blue.svg?paper=c5e9fd131cde68c218d0ea69cd617a67c7f35d42)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc5e9fd131cde68c218d0ea69cd617a67c7f35d42)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKU-CVLAB\u002F3DFuse-threestudio.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FKU-CVLAB\u002F3DFuse-threestudio)\n\n+ **HeadSculpt：用文本打造3D头部化身**（2023年5月25日）\u003Cdetails>\u003Csummary>[NeurIPS 2023] Xiao Han、Yukang Cao、Kai Han等\u003C\u002Fsummary>Xiao Han、Yukang Cao、Kai Han、Xiatian Zhu、Jiankang Deng、Yi-Zhe Song、Tao Xiang、Kwan-Yee K. Wong\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.03038)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-230-blue.svg?paper=4e8cf9602d4ef714dcdb8580de40e1a2a717ab11)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4e8cf9602d4ef714dcdb8580de40e1a2a717ab11)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBrandonHanx\u002FHeadSculpt.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FBrandonHanx\u002FHeadSculpt)\n\n+ **ATT3D：摊销式文本到3D物体合成**（2023年6月6日）\u003Cdetails>\u003Csummary>[ICCV 2023] Jonathan Lorraine、Kevin Xie、Xiaohui Zeng等\u003C\u002Fsummary>Jonathan Lorraine、Kevin Xie、Xiaohui Zeng、Chen-Hsuan Lin、Towaki Takikawa、Nicholas Sharp、Tsung-Yi Lin、Ming-Yu Liu、Sanja Fidler、James Lucas\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07349)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-25-blue.svg?paper=1e8403af2e1e7a8f803d8df9e8daac584f99c2a0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1e8403af2e1e7a8f803d8df9e8daac584f99c2a0)\n\n+ **Fantasia3D：解耦几何与外观以实现高质量文本到3D内容生成**（2023年3月24日）\u003Cdetails>\u003Csummary>[ICCV 2023] 陈睿、陈勇伟、焦宁欣等\u003C\u002Fsummary>陈睿、陈勇伟、焦宁欣、贾奎\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.13873)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-202-blue.svg?paper=0cbb518c364067200476a51e5ce7476a4f582770)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0cbb518c364067200476a51e5ce7476a4f582770)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGorilla-Lab-SCUT\u002FFantasia3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FGorilla-Lab-SCUT\u002FFantasia3D)\n\n+ **Text2Room：从2D文本到图像模型中提取带纹理的3D网格**（2023年9月10日）\u003Cdetails>\u003Csummary>[ICCV 2023] 卢卡斯·霍莱因、曹昂、安德鲁·欧文斯等\u003C\u002Fsummary>卢卡斯·霍莱因、曹昂、安德鲁·欧文斯、贾斯汀·约翰逊、马蒂亚斯·尼斯纳尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.11989)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-55-blue.svg?paper=95aa6fa4e42387561cff22378348d528adea37f2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F95aa6fa4e42387561cff22378348d528adea37f2)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlukasHoel\u002Ftext2room.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FlukasHoel\u002Ftext2room)\n\n+ **X-Mesh：通过动态文本引导实现快速且准确的文本驱动3D风格化**（2023年3月28日）\u003Cdetails>\u003Csummary>[ICCV 2023] 马毅伟、张晓青、孙晓帅等\u003C\u002Fsummary>马毅伟、张晓青、孙晓帅、季佳怡、王浩伟、蒋冠楠、庄伟林、季荣荣\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.15764)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-18-blue.svg?paper=f8bf2225a2993e3ead73d886b5797378d6e53186)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff8bf2225a2993e3ead73d886b5797378d6e53186)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxmu-xiaoma666\u002FX-Mesh.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fxmu-xiaoma666\u002FX-Mesh)\n\n+ **StyleAvatar3D：利用图文扩散模型生成高保真3D虚拟形象**（2023年5月31日）\u003Cdetails>\u003Csummary>张驰、陈艺文、傅一军等\u003C\u002Fsummary>张驰、陈艺文、傅一军、周正林、于刚、Billzb Wang、傅斌、陈涛、林国生、沈春华\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.19012)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=b980d98c81252dfbed334728c46625e58f54dd9d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb980d98c81252dfbed334728c46625e58f54dd9d)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ficoz69\u002FStyleAvatar3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ficoz69\u002FStyleAvatar3D)\n\n+ **TextMesh：根据文本提示生成逼真的3D网格**（2023年4月24日）\u003Cdetails>\u003Csummary>[3DV 2023] 克里斯蒂娜·察利科格鲁、法比安·曼哈特、阿莱西奥·托尼奥尼等\u003C\u002Fsummary>克里斯蒂娜·察利科格鲁、法比安·曼哈特、阿莱西奥·托尼奥尼、迈克尔·尼迈耶、费德里科·汤巴里\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.12439)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-67-blue.svg?paper=2c6392491b6a942e08db46c8fff0ef5ba1fd9de8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2c6392491b6a942e08db46c8fff0ef5ba1fd9de8)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthreestudio-project\u002Fthreestudio.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthreestudio-project\u002Fthreestudio)\n\n+ **Clip-forge：迈向零样本文本到形状生成**（2022年4月28日）\u003Cdetails>\u003Csummary>[CVPR 2022] 阿迪提亚·桑吉、朱航、约瑟夫·G·兰伯恩等\u003C\u002Fsummary>阿迪提亚·桑吉、朱航、约瑟夫·G·兰伯恩、王晔、程锦义、马可·富梅罗、卡马尔·拉希米·马莱克尚\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.02624)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-197-blue.svg?paper=738e3e0623054da29dc57fc6aee5e6711867c4e8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F738e3e0623054da29dc57fc6aee5e6711867c4e8)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAutodeskAILab\u002FClip-Forge.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAutodeskAILab\u002FClip-Forge)\n\n+ **基于Dream Fields的零样本文本引导对象生成**（2021年12月2日）\u003Cdetails>\u003Csummary>[CVPR 2022] 阿杰·贾因、本·米尔登霍尔、乔纳森·T·巴伦等\u003C\u002Fsummary>阿杰·贾因、本·米尔登霍尔、乔纳森·T·巴伦、皮特·阿贝尔、本·普尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.01455)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-345-blue.svg?paper=03e1c3b5fdad9b21bbed3d13af7e8d6c73cbcfa6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F03e1c3b5fdad9b21bbed3d13af7e8d6c73cbcfa6)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fajayj.com\u002Fdreamfields)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle-research\u002Fgoogle-research.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002F)\n\n+ **Text2Mesh：基于文本的神经风格化用于网格**（2021年12月6日）\u003Cdetails>\u003Csummary>[CVPR 2022] 奥斯卡·米歇尔、罗伊·巴尔-翁、理查德·刘等\u003C\u002Fsummary>奥斯卡·米歇尔、罗伊·巴尔-翁、理查德·刘、萨吉·贝奈姆、拉娜·哈诺卡\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.02624)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-222-blue.svg?paper=d15b27edf3630728cdb40f49946365d9011641cf)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd15b27edf3630728cdb40f49946365d9011641cf)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthreedle\u002Ftext2mesh.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthreedle\u002Ftext2mesh)\n\n+ **TANGO：通过光照分解实现文本驱动的写实且稳健的3D风格化**（2022年10月20日）\u003Cdetails>\u003Csummary>[NeurIPS 2022 Spotlight] 陈勇伟、陈睿、雷家宝等\u003C\u002Fsummary>陈勇伟、陈睿、雷家宝、张亚彬、贾奎\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.11277)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-47-blue.svg?paper=44e49f72fb6b97f52c25a30f0adc68c2384430ba)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F44e49f72fb6b97f52c25a30f0adc68c2384430ba)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGorilla-Lab-SCUT\u002Ftango.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FGorilla-Lab-SCUT\u002Ftango)\n\n+ **CLIP-Mesh：利用预训练的图文模型从文本生成带纹理的网格模型**（2022年3月24日） \u003Cdetails>\u003Csummary>[SIGGRAPH ASIA 2022] 纳西尔·穆罕默德·哈立德、谢天浩、尤金·贝利洛夫斯基等 \u003C\u002Fsummary>纳西尔·穆罕默德·哈立德、谢天浩、尤金·贝利洛夫斯基、蒂贝里乌·波帕\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.13333)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-167-blue.svg?paper=8941e477b2f39eb92712f04400412da60d349ec1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8941e477b2f39eb92712f04400412da60d349ec1)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNasirKhalid24\u002FCLIP-Mesh.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNasirKhalid24\u002FCLIP-Mesh)\n\n+ **MotionCLIP：将人体运动生成引入CLIP空间**（2022年3月15日） \u003Cdetails>\u003Csummary>[ECCV 2022] 盖伊·特韦特、布赖恩·戈登、阿米尔·赫兹等 \u003C\u002Fsummary>盖伊·特韦特、布赖恩·戈登、阿米尔·赫兹、阿米特·H·伯曼诺、丹尼尔·科恩-奥尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.08063)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-140-blue.svg?paper=e82df4b6a3628501fce67835ad8316d6525ad133)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe82df4b6a3628501fce67835ad8316d6525ad133)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGuyTevet\u002FMotionCLIP.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FGuyTevet\u002FMotionCLIP)\n\n  \n\n\n### 数据集\n\n+ **Objaverse-XL：包含1000多万个3D对象的宇宙**（2023年7月11日） \u003Cdetails>\u003Csummary>马特·戴特克、达斯汀·施文克、乔迪·萨尔瓦多等 \u003C\u002Fsummary>马特·戴特克、刘若诗、马修·沃灵福德、黄芳、奥斯卡·米歇尔、阿迪提亚·库苏帕蒂、艾伦·范、克里斯蒂安·拉福尔特、维克拉姆·沃莱蒂、萨米尔·伊扎克·加德雷、埃利·范德比尔特、阿尼鲁达·肯巴维、卡尔·冯德里克、乔治娅·吉奥克萨里、基安娜·埃赫萨尼、路德维希·施密特、阿里·法尔哈迪\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.05663)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-70-blue.svg?paper=1b90e9e9734bed6b379ae87d688cb3b887baf597)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1b90e9e9734bed6b379ae87d688cb3b887baf597)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Fobjaverse-xl.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fobjaverse-xl)\n\n+ **Objaverse：一个包含标注3D对象的宇宙**（2022年12月15日） \u003Cdetails>\u003Csummary>[CVPR 2023] 马特·戴特克、达斯汀·施文克、乔迪·萨尔瓦多等 \u003C\u002Fsummary>马特·戴特克、达斯汀·施文克、乔迪·萨尔瓦多、卢卡·魏斯、奥斯卡·米歇尔、埃利·范德比尔特、路德维希·施密特、基安娜·埃赫萨尼、阿尼鲁达·肯巴维、阿里·法尔哈迪\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.08051)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-230-blue.svg?paper=1b31dbf44e68b698120552366df03e6e35a1e428)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1b31dbf44e68b698120552366df03e6e35a1e428)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Fobjaverse-xl.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fobjaverse-xl)\n\n \n## 音频生成\n\n### 🔅 基于大语言模型\n\n+ **SongComposer：用于歌曲创作中歌词与旋律编写的大型语言模型**（2024年2月27日）\u003Cdetails>\u003Csummary>丁双锐、刘子涵、董晓艺等 \u003C\u002Fsummary>丁双锐、刘子涵、董晓艺、张攀、钱睿、何聪辉、林大华、王佳琪\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17645)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=e7c8a74423a5811a3aac5f33001fce32d2e2386c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe7c8a74423a5811a3aac5f33001fce32d2e2386c)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fpjlab-songcomposer.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpjlab-songcomposer\u002Fsongcomposer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpjlab-songcomposer\u002Fsongcomposer)\n\n+ **ChatMusician：利用大语言模型内在地理解并生成音乐**（2024年2月25日）\u003Cdetails>\u003Csummary>袁瑞斌、林汉峰、王毅等 \u003C\u002Fsummary>袁瑞斌、林汉峰、王毅、田泽悦、吴尚达、沈天浩、张戈、吴宇航、刘聪、周子雅、马子洋、薛柳萌、王子宇、刘秦、郑天宇、李一志、马英豪、梁一鸣、迟晓伟、刘瑞博、王子力、李鹏飞、吴景成、林成华、刘启峰、蒋涛、黄文浩、陈文虎、埃马努埃尔·贝内托斯、傅杰、夏古斯、罗杰·丹嫩伯格、薛伟、康世银、郭义克\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16153)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=48494aa30f35a64858644aba839c8cba38c0cf2a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F48494aa30f35a64858644aba839c8cba38c0cf2a)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fshanghaicannon.github.io\u002FChatMusician\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhf-lin\u002FChatMusician.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhf-lin\u002FChatMusician)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fm-a-p\u002FChatMusician)\n\n+ **AnyGPT：具有离散序列建模能力的统一多模态大语言模型**（2024年2月19日）\u003Cdetails>\u003Csummary>詹俊、戴俊奇、叶嘉盛等 \u003C\u002Fsummary>詹俊、戴俊奇、叶嘉盛、周云华、张东、刘志庚、张欣、袁瑞斌、张戈、李林阳、严航、傅杰、桂涛、孙天翔、姜宇刚、邱锡鹏\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12226)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=14191e9f12913ad8c7ac6e1188682afac04aad09)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F14191e9f12913ad8c7ac6e1188682afac04aad09)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjunzhan2000.github.io\u002FAnyGPT.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenMOSS\u002FAnyGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FAnyGPT)\n\n+ **提升大语言模型在语音合成中的性能：一项实证研究**（2023年12月30日）\u003Cdetails>\u003Csummary>郝洪坤、周龙、刘淑洁等 \u003C\u002Fsummary>郝洪坤、周龙、刘淑洁、李金宇、胡淑洁、王睿、魏富儒\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00246)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=c1dd77e48dd615ee6881b2cc876a00a92cae6eac)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc1dd77e48dd615ee6881b2cc876a00a92cae6eac)\n\n+ **Unified-IO 2: 使用视觉、语言、音频和动作扩展自回归多模态模型**（2023年12月28日）\u003Cdetails>\u003Csummary>卢嘉森、克里斯托弗·克拉克、李尚浩等\u003C\u002Fsummary>卢嘉森、克里斯托弗·克拉克、李尚浩、张子辰、萨维亚·科斯拉、瑞安·马滕、德里克·霍伊姆、阿尼鲁达·肯布哈维\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17172)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=6c64ddd2190909de2c680dd18abc9b92e80c39f9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6c64ddd2190909de2c680dd18abc9b92e80c39f9)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Funified-io-2.allenai.org\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Funified-io-2.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fallenai\u002Funified-io-2)\n\n+ **M2UGen: 利用大型语言模型的力量进行多模态音乐理解与生成**（2023年11月19日）\u003Cdetails>\u003Csummary>阿廷·萨基尔·侯赛因、刘善松、孙晨硕等\u003C\u002Fsummary>阿廷·萨基尔·侯赛因、刘善松、孙晨硕、Ying Shan\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.11255)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=1e84d7c45f70038574fcdb7bc1b20da9b348a092)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1e84d7c45f70038574fcdb7bc1b20da9b348a092)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcrypto-code.github.io\u002FM2UGen-Demo\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshansongliu\u002FM2UGen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshansongliu\u002FM2UGen)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FM2UGen\u002FM2UGen-Demo)\n\n+ **LauraGPT: 使用GPT聆听、关注、理解并再生音频**（2023年10月7日）\u003Cdetails>\u003Csummary>王佳明、杜志豪、陈倩等\u003C\u002Fsummary>王佳明、杜志豪、陈倩、褚云飞、高志福、李泽睿、胡凯、周晓欢、徐进、马子洋、王文、郑思琪、周昌、严志杰、张士良\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.04673)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=ffa05cb5504ba08254f498223f613b3ebcf87692)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fffa05cb5504ba08254f498223f613b3ebcf87692)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flauragpt.github.io\u002F)\n\n+ **LLaSM: 大型语言与语音模型**（2023年8月30日）\u003Cdetails>\u003Csummary>舒宇、董思伟、陈光耀等\u003C\u002Fsummary>舒宇、董思伟、陈光耀、黄文浩、张瑞华、石道臣、向奇奇、史业民\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.15930)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=7b22ecd9f1ced58c1704ac6191e029b98054e330)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7b22ecd9f1ced58c1704ac6191e029b98054e330)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLinkSoul\u002FLLaSM)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLinkSoul-AI\u002FLLaSM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLinkSoul-AI\u002FLLaSM)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLinkSoul\u002FLLaSM)\n\n+ **AudioPaLM: 一款能说会听的大型语言模型**（2023年6月22日）\u003Cdetails>\u003Csummary>保罗·K·鲁本斯坦、楚拉育·阿萨沃荣猜、杜克·邓·阮等\u003C\u002Fsummary>保罗·K·鲁本斯坦、楚拉育·阿萨沃荣猜、杜克·邓·阮、安库尔·巴普纳、扎兰·博尔索斯、费利克斯·德·绍蒙特·奎特里、彼得·陈、达莉娅·埃尔·巴达维、魏汉、尤金·哈里托诺夫、汉娜·穆肯希尔恩、迪尔克·帕德菲尔德、詹姆斯·秦、丹尼·罗森伯格、塔拉·赛纳特、约翰·沙尔克维克、马特·沙里菲、米歇尔·塔德莫尔·拉马诺维奇、马可·塔利亚萨奇、亚历山德鲁·图多尔、米哈伊洛·韦利米罗维奇、达米安·文森特、于佳辉、王永强、维姬·扎亚茨、尼尔·泽吉杜尔、张宇、张志帅、卢卡斯·齐尔卡、克里斯蒂安·弗兰克\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.12925)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-69-blue.svg?paper=3efb81de24eb88017d6dbcf22cb4215084223fd8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3efb81de24eb88017d6dbcf22cb4215084223fd8)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgoogle-research.github.io\u002Fseanet\u002Faudiopalm\u002Fexamples\u002F)\n\n+ **Pengi: 用于音频任务的音频语言模型**（2023年5月19日）\u003Cdetails>\u003Csummary>索哈姆·德什穆克、本杰明·埃利萨尔德、丽塔·辛格等\u003C\u002Fsummary>索哈姆·德什穆克、本杰明·埃利萨尔德、丽塔·辛格、王华明\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11834)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-35-blue.svg?paper=ad22af138fa1d1490cda0301abf8159a7c30c5a2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fad22af138fa1d1490cda0301abf8159a7c30c5a2)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPengi)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FPengi.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPengi)\n\n+ **Speechgpt: 赋予大型语言模型内在的跨模态对话能力**（2023年5月18日）\u003Cdetails>\u003Csummary>张东、李世敏、张欣等\u003C\u002Fsummary>张东、李世敏、张欣、詹军、王鹏宇、周雅倩、邱希鹏\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11000)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-76-blue.svg?paper=5cac6430bd379c9d2fe13137dfd6ae7721a2679f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5cac6430bd379c9d2fe13137dfd6ae7721a2679f)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002F0nutation.github.io\u002FSpeechGPT.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F0nutation\u002FSpeechGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002F0nutation\u002FSpeechGPT)\n\n+ **通用人工智能的火花：GPT-4的早期实验**（2023年3月22日）\u003Cdetails>\u003Csummary>塞巴斯蒂安·布贝克、瓦伦·钱德拉塞卡兰、罗嫩·埃尔丹等\u003C\u002Fsummary>塞巴斯蒂安·布贝克、瓦伦·钱德拉塞卡兰、罗嫩·埃尔丹、约翰内斯·格尔克、埃里克·霍维茨、埃杰·卡马尔、彼得·李、李银达、李元智、斯科特·伦德伯格、哈尔沙·诺里、哈米德·帕兰吉、马尔科·图利奥·里贝罗、张毅\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12712)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1407-blue.svg?paper=574beee702be3856d60aa482ec725168fe64fc99)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F574beee702be3856d60aa482ec725168fe64fc99)\n\n### 非LLM相关\n+ **Audiobox：基于自然语言提示的统一音频生成**（2023年12月25日）\\\n阿普尔·维亚斯、鲍文·施、马修·勒\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.15821)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=f124ae1e4663359193be32adb37b07b3252d5329)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff124ae1e4663359193be32adb37b07b3252d5329)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Faudiobox.metademolab.com\u002F)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Faudiobox.metademolab.com\u002Fcapabilities)\n\n+ **Music ControlNet：用于音乐生成的多时变控制**（2023年11月13日）\u003Cdetails>\u003Csummary>吴世伦、克里斯·多纳休、渡边真司等\u003C\u002Fsummary>吴世伦、克里斯·多纳休、渡边真司、尼古拉斯·J·布莱恩\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.07069)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=42239e71a712d70cd24e06ffc0cf0d22fc628a36)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F42239e71a712d70cd24e06ffc0cf0d22fc628a36)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmusiccontrolnet.github.io\u002Fweb\u002F)\n\n+ **Loop Copilot：用于音乐生成与迭代编辑的AI合奏指挥系统**（2023年10月19日）\u003Cdetails>\u003Csummary>张一骁、前泽明、Gus Xia等\u003C\u002Fsummary>张一骁、前泽明、Gus Xia、山本和彦、西蒙·迪克森\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.12404)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=cca4218dd7c10c1614bbd84aa7cd7e00027bdc7c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcca4218dd7c10c1614bbd84aa7cd7e00027bdc7c)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsites.google.com\u002Fview\u002Floop-copilot)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fldzhangyx\u002Floop-copilot.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fldzhangyx\u002Floop-copilot\u002F)\n\n+ **MusicAgent：基于大型语言模型的音乐理解与生成AI代理**（2023年10月18日）\u003Cdetails>\u003Csummary>于丁瑶、宋凯涛、陆培玲等\u003C\u002Fsummary>于丁瑶、宋凯涛、陆培玲、何天宇、谭旭、叶伟、张士坤、卞江\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11954)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=beaf64df85f8204b8cd89a7f46827608e6d16922)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbeaf64df85f8204b8cd89a7f46827608e6d16922)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fmuzic\u002Ftree\u002Fmain\u002Fmusicagent.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmuzic\u002Ftree\u002Fmain\u002Fmusicagent)\n\n+ **UniAudio：面向通用音频生成的音频基础模型**（2023年10月1日）\\\n杨东超、田锦川、谭旭\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00704)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-15-blue.svg?paper=74bfbbb7307a7af2686043ea97ab8b34cb062ba8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F74bfbbb7307a7af2686043ea97ab8b34cb062ba8)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdongchaoyang.top\u002FUniAudio_demo\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangdongchao\u002FUniAudio.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyangdongchao\u002FUniAudio)\n\n+ **AudioLM：一种基于语言模型的音频生成方法**（2022年9月7日）\u003Cdetails>\u003Csummary>扎兰·博尔索斯、拉斐尔·马里尼耶、达米安·文森特等（IEEE\u002FACM音频、语音与语言处理期刊）\u003C\u002Fsummary>扎兰·博尔索斯、拉斐尔·马里尼耶、达米安·文森特、尤金·哈里托诺夫、奥利维埃·皮特坎、马特·沙里菲、多米尼克·罗布莱克、奥利维埃·特布尔、大卫·格兰吉耶、马可·塔利亚萨奇、尼尔·泽吉杜尔\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.03143)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-232-blue.svg?paper=8c870bef01a4fbb20f60722ffc2f6bee3870b18b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8c870bef01a4fbb20f60722ffc2f6bee3870b18b)\n\n+ **Wavjourney：利用大型语言模型进行组合式音频创作**（2023年7月26日）\u003Cdetails>\u003Csummary>刘旭波、朱中凯、刘浩赫等\u003C\u002Fsummary>刘旭波、朱中凯、刘浩赫、袁毅、崔萌、黄秋实、梁金华、曹寅、孔秋强、马克·D·普伦布利、王文武\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.14335)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=aa7bcd1f9453c9096ec78900a7b94e816ed0e1c5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Faa7bcd1f9453c9096ec78900a7b94e816ed0e1c5)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Faudio-agi.github.io\u002FWavJourney_demopage\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAudio-AGI\u002FWavJourney.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAudio-AGI\u002FWavJourney)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAudio-AGI\u002FWavJourney)\n\n+ **探究大型语言模型中的意外性在语音合成韵律中的应用价值**（2023年6月16日）\u003Cdetails>\u003Csummary>索福克里斯·卡库罗斯、尤拉伊·希姆科、马尔蒂·韦尼奥等（2023年SSW会议）\u003C\u002Fsummary>索福克里斯·卡库罗斯、尤拉伊·希姆科、马尔蒂·韦尼奥、安蒂·苏尼\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.09814)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=63aad36dc981348493be6743292a04234b29ba4e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F63aad36dc981348493be6743292a04234b29ba4e)\n\n+ **简单可控的音乐生成**（2023年6月8日）\u003Cdetails>\u003Csummary>贾德·科佩特、费利克斯·克鲁克、伊泰·加特等\u003C\u002Fsummary>贾德·科佩特、费利克斯·克鲁克、伊泰·加特、塔尔·雷梅兹、大卫·坎特、加布里埃尔·辛纳耶夫、约西·阿迪、亚历山大·德福塞》\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05284)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-71-blue.svg?paper=4cc8e18f5eece0b0d8e1abcb8ee10fb33680fbb2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4cc8e18f5eece0b0d8e1abcb8ee10fb33680fbb2)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fai.honu.io\u002Fpapers\u002Fmusicgen\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Faudiocraft.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Faudiocraft)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ffacebook\u002FMusicGen)\n\n+ **Make-An-Audio 2: 时序增强的文本到音频生成**（2023年5月29日）\u003Cdetails>\u003Csummary>黄嘉伟、任毅、黄荣杰等\u003C\u002Fsummary>黄嘉伟、任毅、黄荣杰、杨东超、叶振辉、张晨、刘景林、尹翔、马泽军、赵周\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18474)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=83d4b22d803ae856cf6b308482bd504fa151d39e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F83d4b22d803ae856cf6b308482bd504fa151d39e)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmake-an-audio-2.github.io\u002F)\n\n+ **Jukebox: 音乐生成模型**（2020年4月30日）\u003Cdetails>\u003Csummary>普拉富拉·达里瓦尔、俊熙宇、克里斯汀·佩恩等\u003C\u002Fsummary>普拉富拉·达里瓦尔、俊熙宇、克里斯汀·佩恩、金钟旭、亚历克·拉德福德、伊利亚·萨茨克维尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.00341)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-477-blue.svg?paper=67dea28495cab71703993d0d52ca4733b9a66077)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F67dea28495cab71703993d0d52ca4733b9a66077)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fopenai.com\u002Fresearch\u002Fjukebox)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopenai\u002Fjukebox?tab=readme-ov-file.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fjukebox?tab=readme-ov-file)\n\n+ **Audiogpt: 理解与生成语音、音乐、声音及说话人头像**（2023年4月25日）\u003Cdetails>\u003Csummary>黄荣杰、李明泽、杨东超等\u003C\u002Fsummary>黄荣杰、李明泽、杨东超、施家彤、常轩凯、叶振辉、吴雨宁、洪志清、黄嘉伟、刘景林、任毅、赵周、渡边真司\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.12995)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-83-blue.svg?paper=8bc617c9139648d7a92991d70c671230bac7b2e2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8bc617c9139648d7a92991d70c671230bac7b2e2)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIGC-Audio\u002FAudioGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAIGC-Audio\u002FAudioGPT)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAIGC-Audio\u002FAudioGPT)\n\n+ **TANGO: 基于指令微调的大语言模型与潜在扩散模型的文本到音频生成**（2023年4月24日）\u003Cdetails>\u003Csummary>迪潘韦·戈沙尔、纳沃尼尔·马朱姆达尔、安布吉·梅里什等\u003C\u002Fsummary>迪潘韦·戈沙尔、纳沃尼尔·马朱姆达尔、安布吉·梅里什、苏贾尼亚·波里亚\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.13731)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-51-blue.svg?paper=f51bc74814a3452009ea5ca262d9768d08149ee6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff51bc74814a3452009ea5ca262d9768d08149ee6)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Ftango-web.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeclare-lab\u002Ftango.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdeclare-lab\u002Ftango)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdeclare-lab\u002Ftango)\n\n+ **Hugginggpt: 利用ChatGPT及其在Hugging Face中的伙伴解决AI任务**（2023年3月30日）\u003Cdetails>\u003Csummary>沈永亮、宋凯涛、谭旭等\u003C\u002Fsummary>沈永亮、宋凯涛、谭旭、李东升、陆卫明、庄玉婷\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.17580)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-413-blue.svg?paper=d1120d67b700e4dfe8b39eb1e48fbdea4e1a0c43)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd1120d67b700e4dfe8b39eb1e48fbdea4e1a0c43)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FJARVIS)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FHuggingGPT)\n\n+ **神经编解码语言模型是零样本文本到语音合成器**（2023年1月5日）\u003Cdetails>\u003Csummary>王成义、陈三元、吴宇等\u003C\u002Fsummary>王成义、陈三元、吴宇、张子强、周龙、刘淑洁、陈卓、刘艳青、王华明、李金宇、何磊、赵胜、魏福如\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.02111)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-203-blue.svg?paper=c2f91f35df893714418cc29096083dce0b441229)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc2f91f35df893714418cc29096083dce0b441229)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fproject\u002Fvall-e-x\u002F)\n\n+ **MusicLM: 从文本生成音乐**（2023年1月26日）\u003Cdetails>\u003Csummary>安德烈亚·阿戈斯蒂内利、蒂莫·I·登克、扎兰·博尔索斯等\u003C\u002Fsummary>安德烈亚·阿戈斯蒂内利、蒂莫·I·登克、扎兰·博尔索斯、杰西·恩格尔、毛罗·韦尔泽蒂、安托万·凯永、黄庆庆、阿伦·扬森、亚当·罗伯茨、马可·塔格利亚萨奇、马特·沙里菲、尼尔·泽吉杜尔、克里斯蒂安·弗兰克\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.11325)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-171-blue.svg?paper=428854d9e75f94f0e61f37c6887c77800437d516)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F428854d9e75f94f0e61f37c6887c77800437d516)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgoogle-research.github.io\u002Fseanet\u002Fmusiclm\u002Fexamples\u002F)\n\n\n### 数据集\n+ **Libriheavy: 包含标点符号、大小写及上下文的5万小时ASR语料库**（2023年9月15日）\u003Cdetails>\u003Csummary>康伟、杨晓宇、姚增伟等\u003C\u002Fsummary>康伟、杨晓宇、姚增伟、匡方俊、杨一凡、郭立勇、林龙、丹尼尔·波维\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.08105)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=e99b45179686982401d2d6ec919e42b327f04c0b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe99b45179686982401d2d6ec919e42b327f04c0b)\n\n+ **WenetSpeech: 用于语音识别的超过1万小时多领域普通话语料库**（2021年10月7日）\u003Cdetails>\u003Csummary>张斌斌、吕航、郭鹏程等\u003C\u002Fsummary>张斌斌、吕航、郭鹏程、邵启杰、杨超、谢磊、许欣、步辉、陈晓宇、曾晨晨、吴迪、彭振东\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.03370)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-103-blue.svg?paper=9de3ac21af795dac56f6031e73db8198716bb352)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9de3ac21af795dac56f6031e73db8198716bb352)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwenet.org.cn\u002FWenetSpeech\u002F)\n\n+ **Vggsound：一个大规模的视听数据集**（2020年4月29日）\u003Cdetails>\u003Csummary>陈洪烈、谢伟迪、安德烈亚·韦达尔迪等（ICASSP）\u003C\u002Fsummary>陈洪烈、谢伟迪、安德烈亚·韦达尔迪、安德鲁·齐瑟曼\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.14368)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-316-blue.svg?paper=66831f683141c11ed7e20b0f2e8b40700740c164)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F66831f683141c11ed7e20b0f2e8b40700740c164)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fdata\u002Fvggsound\u002F)\n\n+ **Libri-Light：一个用于有限或无监督ASR的基准数据集**（2019年12月17日）\u003Cdetails>\u003Csummary>雅各布·卡恩、摩根·里维埃尔、魏毅郑等（ICASSP）\u003C\u002Fsummary>雅各布·卡恩、摩根·里维埃尔、魏毅郑、叶夫根尼·哈里托诺夫、钱通徐、皮埃尔-埃马纽埃尔·马扎雷、朱利安·卡拉达伊、维塔利·利普钦斯基、罗南·科洛贝尔、克里斯蒂安·富根、塔季亚娜·利霍马年科、加布里埃尔·辛纳耶夫、阿芒·朱兰、阿卜杜勒-拉赫曼·穆罕默德、埃马纽埃尔·杜普克斯\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9052942?casa_token=lrlqiDak4dkAAAAA:VfCRwwWhLiJyb61NkesOfzpobk4zjac1boi4PoJ7llh1SKSi5YJDt4DaozUQw_o8X4LvO1bK)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-449-blue.svg?paper=f59c038dee828e0a8c2fc28130d12e39ee4952d6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff59c038dee828e0a8c2fc28130d12e39ee4952d6)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flibri-light)\n\n+ **mtg-jamendo音乐自动标签数据集**（2019年6月15日）\u003Cdetails>\u003Csummary>德米特里·博格达诺夫、闵兹·温、菲利普·托夫斯托甘等（ICML）\u003C\u002Fsummary>德米特里·博格达诺夫、闵兹·温、菲利普·托夫斯托甘、阿拉斯泰尔·波特、哈维尔·塞拉\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Frepositori.upf.edu\u002Fbitstream\u002Fhandle\u002F10230\u002F42015\u002Fbogdanov_ICML2019__Jamendo.pdf?sequence=1&isAllowed=y)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-102-blue.svg?paper=23037085b0815455e6d47333089b925c8c0e21d5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F23037085b0815455e6d47333089b925c8c0e21d5)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmtg.github.io\u002Fmtg-jamendo-dataset\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMTG\u002Fmtg-jamendo-dataset.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMTG\u002Fmtg-jamendo-dataset)\n\n+ **LibriTTS：基于LibriSpeech的文本到语音语料库**（2019年4月5日）\u003Cdetails>\u003Csummary>禅平贺、越邓、罗布·克拉克等\u003C\u002Fsummary>禅平贺、越邓、罗布·克拉克、张宇、罗恩·J·魏斯、叶佳、陈志峰、吴永辉\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.02882)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-555-blue.svg?paper=2789b6c84ba1422746246685001accba5563e7c1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2789b6c84ba1422746246685001accba5563e7c1)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.openslr.org\u002F60\u002F)\n\n+ **借助MAESTRO数据集实现钢琴音乐的因子化建模与生成**（2018年10月29日）\u003Cdetails>\u003Csummary>柯蒂斯·霍桑、安德烈·斯塔修克、亚当·罗伯茨等\u003C\u002Fsummary>柯蒂斯·霍桑、安德烈·斯塔修克、亚当·罗伯茨、伊恩·西蒙、程智安娜·黄、桑德·迪勒曼、埃里希·埃尔森、杰西·恩格尔、道格拉斯·埃克\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.12247)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-331-blue.svg?paper=2603a68b4503ba949c91c7e00cd342624b4aae2f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2603a68b4503ba949c91c7e00cd342624b4aae2f)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmagenta.tensorflow.org\u002Fdatasets\u002Fmaestro)\n\n+ **Audio Set：音频事件的本体论及人工标注数据集**（2017年3月5日）\u003Cdetails>\u003Csummary>约尔特·F·盖梅克、丹尼尔·P·W·埃利斯、迪伦·弗里德曼等（TASLP）\u003C\u002Fsummary>约尔特·F·盖梅克、丹尼尔·P·W·埃利斯、迪伦·弗里德曼、阿伦·扬森、韦德·劳伦斯、R·钱宁·摩尔、马诺杰·普拉卡尔、马文·里特尔\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fstatic.googleusercontent.com\u002Fmedia\u002Fresearch.google.com\u002Fzh-CN\u002F\u002Fpubs\u002Farchive\u002F45857.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2338-blue.svg?paper=5ba2218b708ca64ab556e39d5997202e012717d5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5ba2218b708ca64ab556e39d5997202e012717d5)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fresearch.google.com\u002Faudioset\u002Findex.html)\n\n+ **Librispeech：基于公共领域有声读物的ASR语料库**（2015年4月19日）\u003Cdetails>\u003Csummary>瓦西尔·帕纳约托夫、郭国臣、丹尼尔·波维等（ICASSP）\u003C\u002Fsummary>瓦西尔·帕纳约托夫、郭国臣、丹尼尔·波维、桑吉夫·库丹普尔\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7178964)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4815-blue.svg?paper=34038d9424ce602d7ac917a4e582d977725d4393)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F34038d9424ce602d7ac917a4e582d977725d4393)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.openslr.org\u002F12)\n\n+ **利用游戏评估算法：以音乐标签为例**（2009年10月26日）\u003Cdetails>\u003Csummary>艾迪丝·劳、克里斯·韦斯特、迈克尔·曼德尔等（ISMIR）\u003C\u002Fsummary>艾迪丝·劳、克里斯·韦斯特、迈克尔·曼德尔、梅尔特·贝伊、J·史蒂芬·道尼\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fismir2009.ismir.net\u002Fproceedings\u002FOS5-5.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-214-blue.svg?paper=8a1384e041cc6ea2735b01c734aeef666dc92884)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8a1384e041cc6ea2735b01c734aeef666dc92884)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmirg.city.ac.uk\u002Fcodeapps\u002Fthe-magnatagatune-dataset)\n\n\n\n## 多模态生成\n\n### 🔅 基于大语言模型\n\n\n\n+ **C3LLM：基于大型语言模型的条件多模态内容生成**（2024年5月25日） \u003Cdetails>\u003Csummary>王梓轩、段钦凯、戴宇荣等\u003C\u002Fsummary>王梓轩、段钦凯、戴宇荣、唐志刚\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16136)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=78582ad19779a69d97b797a3c6eb2397f99398b6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FC3LLM%3A-Conditional-Multimodal-Content-Generation-Wang-Duan\u002Ff9151b94ff5476cf155c085cf4c3280715cf9bde)\n\n\n\n+ **CoDi-2：上下文内、交错式、交互式的任意到任意生成**（2023年11月30日） \u003Cdetails>\u003Csummary>汤子宁、杨子怡、马哈茂德·卡德米等\u003C\u002Fsummary>汤子宁、杨子怡、马哈茂德·卡德米、刘洋、朱成光、莫希特·班萨尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.18775)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=78582ad19779a69d97b797a3c6eb2397f99398b6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F78582ad19779a69d97b797a3c6eb2397f99398b6)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcodi-2.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fi-Code.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fi-Code\u002Ftree\u002Fmain\u002FCoDi-2)\n\n\n\n+ **TEAL：为多模态大型语言模型对所有内容进行分词与嵌入**（2023年11月8日）\u003Cdetails>\u003Csummary>杨振、张英雪、孟凡东等\u003C\u002Fsummary>杨振、张英雪、孟凡东、周杰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.04589)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=9f411fda2ad5b141a3115f707bcf5ee865b3fb94)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FTEAL%3A-Tokenize-and-Embed-ALL-for-Multi-modal-Large-Yang-Zhang\u002F59d716b442ab760a78f58de6748c0fa1d507bfc1)\n`tokenizer`\n\n\n\n+ **NExT-GPT：任意到任意的多模态大语言模型**（2023年9月11日）\u003Cdetails>\u003Csummary>吴圣琼、费浩、屈雷刚等\u003C\u002Fsummary>吴圣琼、费浩、屈雷刚、季伟、蔡添顺\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05519)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-94-blue.svg?paper=fa75a55760e6ea49b39b83cb85c99a22e1088254)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffa75a55760e6ea49b39b83cb85c99a22e1088254)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fnext-gpt.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNExT-GPT\u002FNExT-GPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FNExT-GPT\u002FNExT-GPT)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002F1ca8b1601858a12830.gradio.live\u002F)\n\n\n+ **CoDi：通过可组合扩散实现任意到任意生成**（2023年5月19日）\u003Cdetails>\u003Csummary>[NeurIPS 2023] 汤子宁、杨子怡、朱成光等\u003C\u002Fsummary>汤子宁、杨子怡、朱成光、Michael Zeng、莫希特·班萨尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11846)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-41-blue.svg?paper=9f411fda2ad5b141a3115f707bcf5ee865b3fb94)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9f411fda2ad5b141a3115f707bcf5ee865b3fb94)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fi-Code.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fi-Code\u002Ftree\u002Fmain\u002Fi-Code-V3)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcodi-gen.github.io\u002F)\n\n\n\n### 非基于大语言模型\n\n\n+ **DiffSHEG：一种基于扩散模型的实时语音驱动全息3D表情与手势生成方法**（2024年1月9日）\u003Cdetails>\u003Csummary>[CVPR 2024] 陈俊明等\u003C\u002Fsummary>陈俊明、刘云飞、王佳楠、曾爱玲、李宇、陈启峰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.04747)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=1370d74ff6f7857a84da952e2f4cb6f42da40615)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1370d74ff6f7857a84da952e2f4cb6f42da40615)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fjeremycjm.github.io\u002Fproj\u002FDiffSHEG\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJeremyCJM\u002FDiffSHEG.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FJeremyCJM\u002FDiffSHEG) \n\n\n+ **看见与听见：利用扩散潜变量对齐器进行开放域视觉-音频生成**（2024年2月27日）\u003Cdetails>\u003Csummary>[CVPR 2024] 邢亚舟、何颖青、田泽悦等\u003C\u002Fsummary>邢亚舟、何颖青、田泽悦、王新涛、陈启峰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17723)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=21a77ed349c8621d0a0ef8407eb744e3de3b13c5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FSeeing-and-Hearing%3A-Open-domain-Visual-Audio-with-Xing-He\u002Fd9822d11ae4ead1f1d32c43124a6a0eb80ea4f0c)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyzxing87\u002FSeeing-and-Hearing.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyzxing87\u002FSeeing-and-Hearing)\n\n\n\n# 📍 多模态编辑\n\n## 图像编辑\n\n### 🔅 基于大语言模型\n\n\n\n+ **UltraEdit：基于指令的大规模细粒度图像编辑**（2024年7月7日）\u003Cdetails>\u003Csummary>赵浩哲、马晓健、陈亮等\u003C\u002Fsummary>赵浩哲、马晓健、陈亮、司树正、吴如洁、安凯凯、余培宇、张敏嘉、李庆、常宝宝\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06739)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-16-blue.svg?paper=388b0f44faf0a14cc402c2554ec36a868cf59129)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FSmartEdit%3A-Exploring-Complex-Instruction-Based-with-Huang-Xie\u002F388b0f44faf0a14cc402c2554ec36a868cf59129)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fultra-editing.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHaozheZhao\u002FUltraEdit.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FHaozheZhao\u002FUltraEdit)\n\n\n+ **TIE：革新文本驱动的图像编辑，实现复杂提示遵循与高保真编辑**（2024年5月27日）\u003Cdetails>\u003Csummary>张欣宇、康梦雪、魏飞等\u003C\u002Fsummary>张欣宇、康梦雪、魏飞、徐爽、刘宇和、马林\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16803)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=8342dfd84eb91cab27404497ef0570b8d9ec55d5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FTIE%3A-Revolutionizing-Text-based-Image-Editing-for-Zhang-Kang\u002F8342dfd84eb91cab27404497ef0570b8d9ec55d5)\n\n+ **SmartEdit：探索基于复杂指令的多模态大语言模型图像编辑**（2023年12月11日）\u003Cdetails>\u003Csummary>[CVPR 2024] 黄宇舟、谢良斌、王新涛等\u003C\u002Fsummary> 黄宇舟、谢良斌、王新涛、袁子洋、寸晓东、葛一骁、周建涛、董超、黄睿、张瑞茂、单颖\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06739)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=388b0f44faf0a14cc402c2554ec36a868cf59129)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F388b0f44faf0a14cc402c2554ec36a868cf59129)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fyuzhou914.github.io\u002FSmartEdit\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencentARC\u002FSmartEdit.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FSmartEdit)\n\n\n+ **自纠正LLM控制的扩散模型**（2023年11月27日）\u003Cdetails>\u003Csummary>[CVPR 2024] 吴宗翰、连龙、约瑟夫·E·冈萨雷斯等\u003C\u002Fsummary> 吴宗翰、连龙、约瑟夫·E·冈萨雷斯、李博毅、特雷弗·达雷尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16090)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=42c4315b5d2e33d7d9a0afdf84e6a47ccd7a700e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F42c4315b5d2e33d7d9a0afdf84e6a47ccd7a700e)\n\n\n+ **Emu Edit：通过识别与生成任务实现精确图像编辑**（2023年11月16日）\u003Cdetails>\u003Csummary>[ArXiv 2023] 谢莉·谢因因、亚当·波利亚克、乌里埃尔·辛格等\u003C\u002Fsummary> 谢莉·谢因因、亚当·波利亚克、乌里埃尔·辛格、尤瓦尔·基尔斯泰因、阿米特·佐哈尔、奥伦·阿舒阿尔、黛薇·帕里克、亚尼夫·泰格曼\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10089)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=5bcb0153dd0840113eb27d4d6f753414ef656a03)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5bcb0153dd0840113eb27d4d6f753414ef656a03)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Femu-edit.metademolab.com\u002F)\n\n\n+ **通过多模态大语言模型引导基于指令的图像编辑** \u003Cdetails>\u003Csummary>[ICLR 2024（亮点论文）] 傅祖杰、胡文泽、杜贤志等\u003C\u002Fsummary> 傅祖杰、胡文泽、杜贤志、威廉·杨·王、殷飞·杨、甘哲\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17102v1)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=092245d86b77181c36f972b1b7a17a59cd989c4a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F092245d86b77181c36f972b1b7a17a59cd989c4a)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmllm-ie.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftsujuifu\u002Fpytorch_mgie.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftsujuifu\u002Fpytorch_mgie)\n\n\n\n\n+ **CHATEDIT：迈向基于对话的多轮交互式人脸图像编辑**（2023年3月20日）\u003Cdetails>\u003Csummary>[EMNLP 2023] 崔星、李泽坤、李佩佩等\u003C\u002Fsummary> 崔星、李泽坤、李佩佩、胡一博、史海林、何兆峰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.11108)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=5a185965ad1e87367d044b47043706d00b85b007)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5a185965ad1e87367d044b47043706d00b85b007)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcuixing100876\u002FChatEdit.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fcuixing100876\u002FChatEdit)\n\n\n\n\n+ **HIVE：利用人类反馈进行指导性视觉编辑**（2023年3月16日）\u003Cdetails>\u003Csummary>张姝、杨欣怡、冯义浩等\u003C\u002Fsummary> 张姝、杨欣怡、冯义浩、秦灿、陈嘉志、于宁、陈泽远、王欢、西尔维奥·萨瓦雷斯、斯特凡诺·埃尔蒙、熊才明、徐然。\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.09618)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-28-blue.svg?paper=372bc41602bbd21f192305775f0a58de9880e454)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F372bc41602bbd21f192305775f0a58de9880e454)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fshugerdou.github.io\u002Fhive\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsalesforce\u002FHIVE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FHIVE)\n\n\n+ **Visual ChatGPT：与视觉基础模型对话、绘图和编辑**（2023年3月8日） \u003Cdetails>\u003Csummary>吴晨菲、尹圣明、齐伟珍等\u003C\u002Fsummary> 吴晨菲、尹圣明、齐伟珍、王晓东、唐泽成、段楠\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.04671)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-337-blue.svg?paper=af997821231898a5f8d0fd78dad4eec526acabe5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Faf997821231898a5f8d0fd78dad4eec526acabe5)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmoymix\u002FTaskMatrix)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002Fvisual_chatgpt)\n\n+ **InstructPix2Pix：学习遵循图像编辑指令**（2022年11月17日）\\\n[CVPR 2023（亮点论文）] 布鲁克斯、蒂姆、亚历山大·霍林斯基和阿列克谢·A·叶夫罗斯。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.09800)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-582-blue.svg?paper=a2d2bbe4c542173662a444b33b76c66992697830)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fa2d2bbe4c542173662a444b33b76c66992697830)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.timothybrooks.com\u002Finstruct-pix2pix)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftimothybrooks\u002Finstruct-pix2pix.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftimothybrooks\u002Finstruct-pix2pix)\n\n\n\n\n\n### 非LLM类（Clip\u002FT5）\n\n+ **SeedEdit：将图像重生成与图像编辑对齐**（2024年11月11日）\\\n施一春、王鹏、黄伟林 \\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.06686)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fteam.doubao.com\u002Fen\u002Fspecial\u002Fseededit)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FByteDance\u002FSeedEdit-APP)\n\n+ **DiffEditor：提升基于扩散模型的图像编辑精度与灵活性**（2024年2月4日）\u003Cdetails>\u003Csummary>[CVPR 2024] 邵蒙、王新涛、宋杰冲等\u003C\u002Fsummary>邵蒙、王新涛、宋杰冲、Ying Shan、Jian Zhang。\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.02583)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=198b3d809594a76bc473927af37b858132ac7fdd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F198b3d809594a76bc473927af37b858132ac7fdd)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMC-E\u002FDragonDiffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMC-E\u002FDragonDiffusion)\n\n\n\n+ **ZONE：零样本指令引导的局部编辑**（2023年12月28日）\u003Cdetails>\u003Csummary>李尚林、曾博涵、冯宇唐等\u003C\u002Fsummary>李尚林、曾博涵、冯宇唐、高思成、刘旭辉、刘嘉铭、林立、唐旭、胡耀、刘建庄、张宝昌。\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.16794)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=05eb2ad3af471c05a24abbf70258688e579cdf22)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F05eb2ad3af471c05a24abbf70258688e579cdf22)\n\n\n\n\n+ **留意每一步：基于文本指令的局部图像与场景编辑**（2023年8月17日）\u003Cdetails>\u003Csummary>Ashkan Mirzaei、Tristan Aumentado-Armstrong、Marcus A. Brubaker等\u003C\u002Fsummary>Ashkan Mirzaei、Tristan Aumentado-Armstrong、Marcus A. Brubaker、Jonathan Kelly、Alex Levinshtein、Konstantinos G. Derpanis、Igor Gilitschenski。\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.08947)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=737ad8905228cd410e3342b5cceefd4feb57d166)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F737ad8905228cd410e3342b5cceefd4feb57d166)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fashmrz.github.io\u002FWatchYourSteps\u002F)\n\n\n+ **Dragondiffusion：在扩散模型上实现拖拽式操控**（2023年7月5日）\u003Cdetails>\u003Csummary>[ICLR 2024] 邵蒙、王新涛、宋杰冲等\u003C\u002Fsummary>邵蒙、王新涛、宋杰冲、Ying Shan、Jian Zhang。\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.14331)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-36-blue.svg?paper=2cfaa5b3571d3b75f040f6d639359a3c673f5561)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2cfaa5b3571d3b75f040f6d639359a3c673f5561)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmc-e.github.io\u002Fproject\u002FDragonDiffusion\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMC-E\u002FDragonDiffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMC-E\u002FDragonDiffusion)\n\n\n+ **差异扩散：让每个像素都发挥其优势**（2023年6月1日）\u003Cdetails>\u003Csummary>[Arxiv 2023] Thao Nguyen、Yuheng Li、Utkarsh Ojha等\u003C\u002Fsummary>Thao Nguyen、Yuheng Li、Utkarsh Ojha、Yong Jae Lee。\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.14331)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=6e5760e5d4b468bbf01a95a6f64bd65c3aa3d798)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6e5760e5d4b468bbf01a95a6f64bd65c3aa3d798)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdifferential-diffusion.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fexx8\u002Fdifferential-diffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fexx8\u002Fdifferential-diffusion)\n\n\n+ **视觉指令反演：通过视觉提示进行图像编辑**（2023年7月26日）\u003Cdetails>\u003Csummary>[ArXiv 2023] Thao Nguyen、Yuheng Li、Utkarsh Ojha等\u003C\u002Fsummary>Thao Nguyen、Yuheng Li、Utkarsh Ojha、Yong Jae Lee。\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.14331)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=f4c62aa336de45273e0fdfcfbd65b3c2e552ad56)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff4c62aa336de45273e0fdfcfbd65b3c2e552ad56)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fthaoshibe.github.io\u002Fvisii\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthaoshibe\u002Fvisii.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthaoshibe\u002Fvisii)\n\n+ **MasaCtrl：无需调优的一致性图像合成与编辑中的互斥自注意力控制**（2023年4月17日）\u003Cdetails>\u003Csummary>[ICCV 2023] 曹明登、王新涛、齐中刚等\u003C\u002Fsummary>曹明登、王新涛、齐中刚、Ying Shan、Xiaohu Qie、Yinqiang Zheng。\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.08947)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-102-blue.svg?paper=85963807c11abe38e9a2797d9860e012238607ef)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F85963807c11abe38e9a2797d9860e012238607ef)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fljzycmd.github.io\u002Fprojects\u002FMasaCtrl\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencentARC\u002FMasaCtrl.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FMasaCtrl)\n\n\n+ **PAIR-Diffusion：全面的多模态对象级图像编辑器**（2023年3月30日）\u003Cdetails>\u003Csummary>[ArXiv 2023] Vidit Goel、Elia Peruzzo、Yifan Jiang等\u003C\u002Fsummary>Vidit Goel、Elia Peruzzo、Yifan Jiang、Dejia Xu、Xingqian Xu、Nicu Sebe、Trevor Darrell、Zhangyang Wang、Humphrey Shi。\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.17546)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=c614a4da924466f62ca39002af425c9d14d240a3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc614a4da924466f62ca39002af425c9d14d240a3)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvidit98.github.io\u002Fpublication\u002Fconference-paper\u002Fpair_diff.html)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpix2pixzero\u002Fpix2pix-zero.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPicsart-AI-Research\u002FPAIR-Diffusion)\n\n\n\n+ **零样本图像到图像的转换**（2023年2月6日）\u003Cdetails>\u003Csummary>[SIGGRAPH 2023] Gaurav Parmar、Krishna Kumar Singh、Richard Zhang等\u003C\u002Fsummary>Gaurav Parmar、Krishna Kumar Singh、Richard Zhang、Yijun Li、Jingwan Lu、Jun-Yan Zhu。\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.03027)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-159-blue.svg?paper=daf61010eee0fbf6f9bab7db71c395ffca6f3ff3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdaf61010eee0fbf6f9bab7db71c395ffca6f3ff3)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fpix2pixzero.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpix2pixzero\u002Fpix2pix-zero.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpix2pixzero\u002Fpix2pix-zero)\n\n+ **SINE：基于文本到图像扩散模型的单张图像编辑**（2022年12月8日）\u003Cdetails>\u003Csummary>[CVPR 2023] 张志行、韩立功、Arnab Ghosh 等\u003C\u002Fsummary> 张志行、韩立功、Arnab Ghosh、Dimitris Metaxas、任健。 \u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.04489)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-69-blue.svg?paper=a6ad30123bef4b19ee40c3d63cfabf00d211f0ef)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fa6ad30123bef4b19ee40c3d63cfabf00d211f0ef)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fzhang-zx.github.io\u002FSINE\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhang-zx\u002FSINE.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzhang-zx\u002FSINE)\n\n+ **基于复杂文本指令的交互式图像操控**（2022年11月25日）\u003Cdetails>\u003Csummary>[WACV 2023] 森田隆吾、张志强、何文敏 等\u003C\u002Fsummary> 森田隆吾、张志强、何文敏、周金佳。 \u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.15352)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=387144d293567408c363313aac971294e7ec8547)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F387144d293567408c363313aac971294e7ec8547)\n\n\n+ **用于文本驱动图像到图像转换的即插即用扩散特征**（2022年11月22日）\u003Cdetails>\u003Csummary>[CVPR 2023] 纳雷克·图马尼扬、米哈尔·盖耶尔、沙伊·巴贡 等\u003C\u002Fsummary> 纳雷克·图马尼扬、米哈尔·盖耶尔、沙伊·巴贡、塔莉·德凯尔。 \u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.12572)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-224-blue.svg?paper=b000d6865db824af1563708fb7a545ddd65c6b3a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb000d6865db824af1563708fb7a545ddd65c6b3a)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fpnp-diffusion.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMichalGeyer\u002Fplug-and-play.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMichalGeyer\u002Fplug-and-play)\n\n+ **Imagic：基于扩散模型的文本驱动真实图像编辑**（2022年10月17日）\u003Cdetails>\u003Csummary>[CVPR 2023] 巴哈贾特·卡瓦尔、希兰·扎达、奥兰·朗格 等\u003C\u002Fsummary> 巴哈贾特·卡瓦尔、希兰·扎达、奥兰·朗格、奥默·托夫、常慧雯、塔莉·德凯尔、因巴尔·莫塞里、米哈尔·伊拉尼。 \u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.09276)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-496-blue.svg?paper=23e261a20a315059b4de5492ed071c97a20c12e7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F23e261a20a315059b4de5492ed071c97a20c12e7)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fimagic-editing.github.io\u002F)\n\u003C!-- [![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpix2pixzero\u002Fpix2pix-zero.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fpix2pixzero\u002Fpix2pix-zero) -->\n\n\n\n+ **利用引导扩散模型编辑真实图像的空文本反演**\u003Cdetails>\u003Csummary>[ICLR 2023] 罗恩·莫卡迪、阿米尔·赫兹、基菲尔·阿伯曼 等\u003C\u002Fsummary> 罗恩·莫卡迪、阿米尔·赫兹、基菲尔·阿伯曼、雅埃尔·普里奇、丹尼尔·科恩-奥尔。 \u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.09794)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=4de94949daf9bc8dd0e5161d20dfe83198d20ec1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4de94949daf9bc8dd0e5161d20dfe83198d20ec1)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fnull-text-inversion.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Fprompt-to-prompt.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fprompt-to-prompt\u002F#null-text-inversion-for-editing-real-images)\n\n\n+ **基于交叉注意力控制的提示到提示图像编辑** \u003Cdetails>\u003Csummary>[ICLR 2023] 阿米尔·赫兹、罗恩·莫卡迪、杰伊·特南鲍姆 等\u003C\u002Fsummary> 阿米尔·赫兹、罗恩·莫卡迪、杰伊·特南鲍姆、基菲尔·阿伯曼、雅埃尔·普里奇、丹尼尔·科恩-奥尔。 \u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.01626)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-717-blue.svg?paper=04e541391e8dce14d099d00fb2c21dbbd8afe87f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F04e541391e8dce14d099d00fb2c21dbbd8afe87f)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fprompt-to-prompt.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle\u002Fprompt-to-prompt.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fprompt-to-prompt)\n\n\n+ **DiffEdit：基于扩散的语义图像编辑，带掩码指导**（2022年10月20日） \u003Cdetails>\u003Csummary>[ICLR 2023] 吉约姆·库瓦龙、雅各布·韦尔贝克、霍尔格·施文克 等\u003C\u002Fsummary> 吉约姆·库瓦龙、雅各布·韦尔贝克、霍尔格·施文克、马蒂厄·科尔德。 \u003C\u002Fdetails> \n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.11427)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-208-blue.svg?paper=064ccebc03d3afabaae30fe29a457c1cfcdff7e3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F064ccebc03d3afabaae30fe29a457c1cfcdff7e3)\n\u003C!-- [![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fashmrz.github.io\u002FWatchYourSteps\u002F) -->\n\u003C!-- [![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FXiang-cd\u002FDiffEdit-stable-diffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FXiang-cd\u002FDiffEdit-stable-diffusion) -->\n\n\n\n+ **DiffusionCLIP：用于鲁棒图像操控的文本引导扩散模型**（2021年10月6日)\\\n[CVPR 2022] 金光贤、权泰成、叶宗哲。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.02711)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-318-blue.svg?paper=8f8dedb511c0324d1cb7f9750560109ca9290b5f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8f8dedb511c0324d1cb7f9750560109ca9290b5f)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgwang-kim\u002FDiffusionCLIP.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fgwang-kim\u002FDiffusionCLIP)\n\n\n+ **SDEdit：基于随机微分方程的引导图像合成与编辑**（2021年8月2日）\u003Cdetails>\u003Csummary>[ICLR 2022] 孟晨琳、何宇彤、宋阳 等\u003C\u002Fsummary> 孟晨琳、何宇彤、宋阳、宋嘉明、吴家俊、朱俊彦、斯特凡诺·埃尔蒙。 \u003C\u002Fdetails> \n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.01073)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-604-blue.svg?paper=f671a09e3e5922e6d38cb77dda8d76d5ceac2a27)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff671a09e3e5922e6d38cb77dda8d76d5ceac2a27)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsde-image-editing.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fermongroup\u002FSDEdit.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fermongroup\u002FSDEdit)\n\n\n\n## 视频编辑\n\n### 🔅 基于大语言模型\n\n\n+ **使用合成数据集进行一致的视频到视频迁移**（2023年11月1日）\\\n程佳欣、肖天俊、何通。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00213)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=e8bbffb8413cb1f88e99a7ecbabd21a6eac82271)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe8bbffb8413cb1f88e99a7ecbabd21a6eac82271)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Famazon-science\u002Finstruct-video-to-video.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Famazon-science\u002Finstruct-video-to-video)\n\n+ **InstructVid2Vid：基于自然语言指令的可控视频编辑**（2023年5月21日）\u003Cdetails>\u003Csummary>秦博生、李俊成、唐思亮等。\u003C\u002Fsummary>秦博生、李俊成、唐思亮、蔡德成、庄宇婷。\u003C\u002Fdetails> [![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.12328)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=205d2ed0906440f07a0275d7d6a63bced60951fc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F205d2ed0906440f07a0275d7d6a63bced60951fc)\n\u003C!-- [![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fduyguceylan\u002Fpix2video.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fduyguceylan\u002Fpix2video) -->\n\n### 非基于大语言模型（Clip\u002FT5）\n\n\n\n+ **AudioScenic：音频驱动的视频场景编辑**（2024年4月25日）\u003Cdetails>\u003Csummary>沈凯欣、全瑞杰、朱林超等。\u003C\u002Fsummary>沈凯欣、全瑞杰、朱林超、肖军、杨毅。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.16581)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=6fa898ed5e58ade17a020e3251687b811ff1d023)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FAudioScenic%3A-Audio-Driven-Video-Scene-Editing-Shen-Quan\u002F6fa898ed5e58ade17a020e3251687b811ff1d023)\n\n\n+ **LATENTWARP：用于零样本视频到视频转换的一致性扩散潜在空间**（2023年11月1日）\u003Cdetails>\u003Csummary>鲍宇翔、邱迪、康国梁等。\u003C\u002Fsummary>鲍宇翔、邱迪、康国梁、张宝昌、金波、王凯业、闫鹏飞。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00353)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=1b4323a5324ee20fe9b2ff2a65ec26550a51ec2c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1b4323a5324ee20fe9b2ff2a65ec26550a51ec2c)\n\u003C!-- [![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdiffusion-tokenflow.github.io\u002F) -->\n\u003C!-- [![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fomerbt\u002FTokenFlow.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fomerbt\u002FTokenFlow) -->\n\n+ **MagicStick：通过控制手柄变换实现的可控视频编辑**（2023年11月1日）\u003Cdetails>\u003Csummary>马悦、寸晓东、何英青等。\u003C\u002Fsummary>马悦、寸晓东、何英青、齐晨阳、王新涛、单颖、李秀、陈启峰。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03047)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=xxx)\n)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fxxx)\n)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmagic-stick-edit.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmayuelala\u002FMagicStick.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmayuelala\u002FMagicStick)\n\n\n+ **MagicEdit：高保真度的时间一致性视频编辑**（2023年8月28日） \u003Cdetails>\u003Csummary>刘俊豪、严汉书、张建峰等。\u003C\u002Fsummary>刘俊豪、严汉书、张建峰、徐忠聪、冯家世。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.14749)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-20-blue.svg?paper=8819777e104f8c4197c262e11a01b070b50007aa)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8819777e104f8c4197c262e11a01b070b50007aa)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmagic-edit.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmagic-research\u002Fmagic-edit.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmagic-research\u002Fmagic-edit)\n\n\n+ **StableVideo：文本驱动的一致性感知扩散视频编辑**（2023年8月18日）\u003Cdetails>\u003Csummary>[ICCV 2023] 柴文浩、郭迅、王高昂等。\u003C\u002Fsummary>柴文浩、郭迅、王高昂、陆燕。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.09592)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-41-blue.svg?paper=05cbac9a5101f47a6fabad72398616506572c9fa)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F05cbac9a5101f47a6fabad72398616506572c9fa)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frese1f\u002FStableVideo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Frese1f\u002FStableVideo)\n\n\n+ **CoDeF：用于时间一致性视频处理的内容变形场**（2023年8月15日） \u003Cdetails>\u003Csummary>欧阳浩、王秋雨、肖宇曦等。\u003C\u002Fsummary>欧阳浩、王秋雨、肖宇曦、白庆彦、张俊涛、郑克诚、周晓伟、陈启峰。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.07926)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-28-blue.svg?paper=c2d65fc3a7fde3f7662c6ef9448e5737d7e5551f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc2d65fc3a7fde3f7662c6ef9448e5737d7e5551f)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fqiuyu96.github.io\u002FCoDeF\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fqiuyu96\u002FCoDeF.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fqiuyu96\u002FCoDeF)\n\n\n+ **TokenFlow：用于一致性视频编辑的一致性扩散特征**（2023年7月19日） \u003Cdetails>\u003Csummary>米哈尔·盖耶、奥默·巴尔-塔尔、沙伊·巴贡等。\u003C\u002Fsummary>米哈尔·盖耶、奥默·巴尔-塔尔、沙伊·巴贡、塔莉·德克尔。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.10373)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-80-blue.svg?paper=4761f173965195798cd3046ef4af608a83504e4d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4761f173965195798cd3046ef4af608a83504e4d)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdiffusion-tokenflow.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fomerbt\u002FTokenFlow.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fomerbt\u002FTokenFlow)\n\n+ **重新渲染一段视频：零样本文本引导的视频到视频转换**（2023年6月13日）\u003Cdetails>\u003Csummary>杨帅、周一帆、刘子威等\u003C\u002Fsummary>杨帅、周一帆、刘子威、陈昌 Loy。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07954)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-74-blue.svg?paper=1e09b83fe064826a9a1ac61a7bdc00f26be41aee)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1e09b83fe064826a9a1ac61a7bdc00f26be41aee)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fwww.mmlab-ntu.com\u002Fproject\u002Frerender\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwilliamyang1991\u002FRerender_A_Video.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fwilliamyang1991\u002FRerender_A_Video)\n\n+ **ControlVideo：为单次文本到视频编辑添加条件控制**（2023年5月26日）\u003Cdetails>\u003Csummary>赵敏、王荣振、鲍凡等\u003C\u002Fsummary>赵敏、王荣振、鲍凡、李崇轩、朱俊。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17098)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=14acc36d8c87f31f8dcbbf8433b91af70a2a516a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F14acc36d8c87f31f8dcbbf8433b91af70a2a516a)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fml.cs.tsinghua.edu.cn\u002Fcontrolvideo\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-ml\u002Fcontrolvideo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002Fcontrolvideo)\n\n\n\n+ **打造主角：基于专家集成的通用视频编辑**（2023年5月15日）\n米哈尔·盖耶、奥默·巴尔-塔尔、沙伊·巴贡、塔莉·德克尔。\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.08850)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=5f51eda9f7abddca027941d50fb0b6bf6f508eff)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5f51eda9f7abddca027941d50fb0b6bf6f508eff)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmake-a-protagonist.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHeliosZhao\u002FMake-A-Protagonist.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FHeliosZhao\u002FMake-A-Protagonist)\n\n\n+ **Pix2Video：基于图像扩散的视频编辑**（2023年3月22日）\\\n[ICCV 2023] 塞兰、杜伊古、黄春豪 P. 和尼洛伊 J. 米特拉。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12688)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-92-blue.svg?paper=32a3c2fbd3e733bd0eea938517fec2ff8dc7c701)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F32a3c2fbd3e733bd0eea938517fec2ff8dc7c701)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fduyguceylan.github.io\u002Fpix2video.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fduyguceylan\u002Fpix2video.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fduyguceylan\u002Fpix2video)\n\n\n+ **FateZero：融合注意力机制实现零样本文本驱动的视频编辑**（2023年3月16日）\u003Cdetails>\u003Csummary>[ICCV 2023] 齐晨阳、孙晓东、张勇等\u003C\u002Fsummary>齐晨阳、孙晓东、张勇、雷晨阳、王新涛、应珊、陈启峰。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.09535)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-133-blue.svg?paper=14ccb8bcceb6de10eda6ad08bec242a4f2946497)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F14ccb8bcceb6de10eda6ad08bec242a4f2946497)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Ffate-zero-edit.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenyangQiQi\u002FFateZero.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FChenyangQiQi\u002FFateZero)\n\n+ **Video-P2P：基于交叉注意力控制的视频编辑**（2023年3月8日）\u003Cdetails>\u003Csummary>刘绍腾、张悦辰、李文博等\u003C\u002Fsummary>刘绍腾、张悦辰、李文博、林哲、贾佳亚。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.04761)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-81-blue.svg?paper=6283502d6900a0b403e2454b1cb1cf16ddefd5a7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6283502d6900a0b403e2454b1cb1cf16ddefd5a7)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvideo-p2p.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FShaoTengLiu\u002FVideo-P2P.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FShaoTengLiu\u002FVideo-P2P)\n\n+ **Dreamix：视频扩散模型是通用视频编辑器**（2023年2月2日）\u003Cdetails>\u003Csummary>埃亚尔·莫拉德、以利亚胡·霍维茨、丹尼·瓦列夫斯基等\u003C\u002Fsummary>埃亚尔·莫拉德、以利亚胡·霍维茨、丹尼·瓦列夫斯基、亚历克斯·拉夫·阿查、约西·马蒂亚斯、雅埃尔·普里奇、亚尼夫·莱维坦、耶迪德·霍申。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.01329)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-107-blue.svg?paper=9758ddd6ffbaac75aa0447a9664e6989811a05e2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9758ddd6ffbaac75aa0447a9664e6989811a05e2)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdreamix-video-editing.github.io\u002F)\n\n\n+ **调优一段视频：用于文本到视频生成的图像扩散模型的一次性调优**（2022年12月22日）\u003Cdetails>\u003Csummary>[ICCV 2023] 吴章杰、葛益骁、王新涛等\u003C\u002Fsummary>吴章杰、葛益骁、王新涛、雷伟贤、顾宇超、史宇飞、许咏恩、应珊、谢小虎、郑守迈。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.11565)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-275-blue.svg?paper=1367dcff4ccb927a5e95c452041288b3f0dd0eff)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1367dcff4ccb927a5e95c452041288b3f0dd0eff)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Ffate-zero-edit.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FTune-A-Video.svg?style=social&label=Star)](https:\u002F\u002Ftuneavideo.github.io\u002F)\n\n\n+ **M3L：基于多模态多层级Transformer的语言驱动视频编辑**（2021年4月2日）\u003Cdetails>\u003Csummary>[CVPR 2022] 傅子睿、王欣艾瑞克、斯科特 T. 格拉夫顿等\u003C\u002Fsummary>傅子睿、王欣艾瑞克、斯科特 T. 格拉夫顿、米格尔 P. 埃克施泰因、威廉·杨·王。\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.01122)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=81349524489f8ba0812ac2529eac92ec45959782)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F81349524489f8ba0812ac2529eac92ec45959782)\n\n\n\n## 3D编辑\n\n###基於LLM\n+ **SceneCraft：用Blender代碼合成3D場景的LLM智能體**（2024年3月2日）\u003Cdetails>\u003Csummary>胡子宇、Ahmet Iscen、Aashi Jain等\u003C\u002Fsummary>胡子宇、Ahmet Iscen、Aashi Jain、Thomas Kipf、Yisong Yue、David A. Ross、Cordelia Schmid、Alireza Fathi\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01248v1)\n\n+ **3D-GPT：利用大型語言模型進行程序化3D建模**（2023年10月19日）\u003Cdetails>\u003Csummary>孫淳毅*、韓俊林*、鄧偉健等\u003C\u002Fsummary>孫淳毅、韓俊林、鄧偉健、王鑫龍、秦子山、Stephen Gould\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.12945)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=588930cdd801f335b5e524d13f99aa94136a20a0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F588930cdd801f335b5e524d13f99aa94136a20a0)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChuny1\u002F3DGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FChuny1\u002F3DGPT)\n\n###非LLM基於（Clip\u002FT5）\n+ **Paint3D：無光照紋理擴散模型繪製任意3D圖像**（2023年11月16日）\u003Cdetails>\u003Csummary>曾賢芳、陳欣、齊中奇等\u003C\u002Fsummary>曾賢芳、陳欣、齊中奇、劉文、趙子博、王志斌、傅彬、劉勇、于剛\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13913)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=e90883da4ee8c947a8b97422c95bde905a257a74)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe90883da4ee8c947a8b97422c95bde905a257a74)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenTexture\u002FPaint3D?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenTexture\u002FPaint3D)\n\n+ **3D畫筆：利用級聯分數蒸餾對3D形狀進行局部風格化**（2023年11月16日）\u003Cdetails>\u003Csummary>戴爾·迪卡圖爾、伊泰·朗、克菲爾·阿伯曼等\u003C\u002Fsummary>戴爾·迪卡圖爾、伊泰·朗、克菲爾·阿伯曼、拉娜·哈諾卡\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.09571)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=496bdd2804a231a3336463fca8e0a4c6a46f0304)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F496bdd2804a231a3336463fca8e0a4c6a46f0304)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthreedle\u002F3d-paintbrush?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthreedle\u002F3d-paintbrush)\n\n+ **Blending-NeRF：神經輻射場中的文本驅動局部編輯**（2023年8月23日）\u003Cdetails>\u003Csummary>宋賢燮、崔錫勳、都浩植等\u003C\u002Fsummary>宋賢燮、崔錫勳、都浩植、李哲、金泰亨\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.11974)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=bf7f31e07d9b128a0f555c275bc3fdb851f725b8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbf7f31e07d9b128a0f555c275bc3fdb851f725b8)\n\n+ **SINE：語義驅動的基於圖像的NeRF編輯，帶有先驗指導的編輯場**（2023年3月23日）\u003Cdetails>\u003Csummary>[CVPR 2023] 包沖、張銀達、楊邦邦等\u003C\u002Fsummary>包沖、張銀達、楊邦邦、范天興、楊澤松、鮑虎軍、張國峰、崔兆鵬\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.13277)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-44-blue.svg?paper=222c47b81fe04598fd84fe8b9a43f694415ec7e9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F222c47b81fe04598fd84fe8b9a43f694415ec7e9)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzju3dv\u002FSINE?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzju3dv\u002FSINE)\n\n+ **TextDeformer：利用文本引導進行幾何變換**（2023年4月26日）\u003Cdetails>\u003Csummary>[TVCG 2022] 威廉·高、諾姆·艾格曼、蒂博·格魯埃等\u003C\u002Fsummary>威廉·高、諾姆·艾格曼、蒂博·格魯埃、弗拉基米爾·G·金、拉娜·哈諾卡\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.13348)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-21-blue.svg?paper=4974186c3b5b50112cfd909de115d5fbe25411fd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4974186c3b5b50112cfd909de115d5fbe25411fd)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthreedle\u002FTextDeformer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthreedle\u002FTextDeformer)\n\n+ **Instruct-NeRF2NeRF：用指令編輯3D場景**（2023年3月22日）\u003Cdetails>\u003Csummary>[SIGGRAPH Asia 2023] 阿揚·哈克、馬修·坦西克、阿列克謝·A·埃夫羅斯等\u003C\u002Fsummary>阿揚·哈克、馬修·坦西克、阿列克謝·A·埃夫羅斯、亞歷山大·霍倫斯基、安朱·卡納扎瓦\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12789)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-131-blue.svg?paper=26c22380282a00166273038bc5ba785d845d61ad)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F26c22380282a00166273038bc5ba785d845d61ad)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fayaanzhaque\u002Finstruct-nerf2nerf.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fayaanzhaque\u002Finstruct-nerf2nerf)\n\n+ **DreamEditor：神經場驅動的文本編輯3D場景**（2023年6月23日）\u003Cdetails>\u003Csummary>[SIGGRAPH Asia 2023] 莊靜宇、王晨、劉凌潔等\u003C\u002Fsummary>莊靜宇、王晨、劉凌潔、林亮、李冠斌\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.13455)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-39-blue.svg?paper=029f3e2c215edac138be26ade67b3d70b8f74dd7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F029f3e2c215edac138be26ade67b3d70b8f74dd7)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzjy526223908\u002FDreamEditor.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fzjy526223908\u002FDreamEditor)\n\n+ **SKED：草圖引導的文本驅動3D編輯**（2023年3月19日）\u003Cdetails>\u003Csummary>[ICCV 2023] 阿里安·米凱伊利、奧爾·佩雷爾、梅赫迪·薩法伊等\u003C\u002Fsummary>阿里安·米凱伊利、奧爾·佩雷爾、梅赫迪·薩法伊、丹尼爾·科恩-奧爾、阿里·馬赫達維-阿米里\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.10735)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-27-blue.svg?paper=6ebec1ece44daa090158ff2531d6fabb94a4e683)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6ebec1ece44daa090158ff2531d6fabb94a4e683)\n[![代碼](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Faryanmikaeili\u002FSKED.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Faryanmikaeili\u002FSKED)\n\n+ **混合NeRF：在現有神經輻射場中進行零樣本對象生成與融合**（2023年6月22日）\u003Cdetails>\u003Csummary>[ICCVW 2023] 奧里·戈登、歐姆里·阿夫拉哈米、丹尼·利希金斯基\u003C\u002Fsummary>奧里·戈登、歐姆里·阿夫拉哈米、丹尼·利希金斯基\u003C\u002Fdetails>\n[![論文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.12760)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12-blue.svg?paper=3a5d4352d3dd53148a9544233bb59f88d2504910)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3a5d4352d3dd53148a9544233bb59f88d2504910)\n\n+ **ClipFace：基于文本指导的带纹理3D可变形模型与神经辐射场编辑**（2022年12月2日）\u003Cdetails>\u003Csummary>[SIGGRAPH 2023] 希万吉·阿内贾、尤斯图斯·蒂斯、安吉拉·戴等\u003C\u002Fsummary>希万吉·阿内贾、尤斯图斯·蒂斯、安吉拉·戴、马蒂亚斯·尼斯纳\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.01406)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-31-blue.svg?paper=f21e8eddf42580d1f38a11ec5acd8891c0454a1f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff21e8eddf42580d1f38a11ec5acd8891c0454a1f)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FcassiePython\u002FCLIPNeRF.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FcassiePython\u002FCLIPNeRF)\n\n+ **CLIP-NeRF：基于文本和图像驱动的神经辐射场操控**（2021年12月9日）\u003Cdetails>\u003Csummary>[CVPR 2022] 曹旺、柴孟磊、何明明等\u003C\u002Fsummary>曹旺、柴孟磊、何明明、陈冬冬、廖静\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.05139)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-244-blue.svg?paper=0483be6c3ec6cd41ffe248f86effc7468d3ac7be)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0483be6c3ec6cd41ffe248f86effc7468d3ac7be)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshivangi-aneja\u002FClipFace.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshivangi-aneja\u002FClipFace)\n\n\n\n\n\n## 音频编辑\n\n### 🔅 基于大语言模型\n\n+ **Loop Copilot：用于音乐生成与迭代编辑的AI合奏指挥系统**（2023年10月19日）\u003Cdetails>\u003Csummary>张一骁、前泽晶、Gus Xia等\u003C\u002Fsummary>张一骁、前泽晶、Gus Xia、山本和彦、西蒙·迪克森\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.12404)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-3-blue.svg?paper=cca4218dd7c10c1614bbd84aa7cd7e00027bdc7c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcca4218dd7c10c1614bbd84aa7cd7e00027bdc7c)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fsites.google.com\u002Fview\u002Floop-copilot)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fldzhangyx\u002Floop-copilot\u002F.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fldzhangyx\u002Floop-copilot)\n\n+ **UniAudio：迈向通用音频生成的音频基础模型**（2023年10月1日）\\\n杨东超、田锦川、谭旭\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00704)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-15-blue.svg?paper=74bfbbb7307a7af2686043ea97ab8b34cb062ba8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F74bfbbb7307a7af2686043ea97ab8b34cb062ba8)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fdongchaoyang.top\u002FUniAudio_demo\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangdongchao\u002FUniAudio.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyangdongchao\u002FUniAudio)\n\n### 非大语言模型（Clip\u002FT5）\n\n# 📍 多模态智能体\n\n+ **LLaVA-Interactive：图像聊天、分割、生成与编辑的一体化演示**（2023年11月1日） \u003Cdetails>\u003Csummary>陈伟格、伊琳娜·斯皮里多诺娃、杨建伟等\u003C\u002Fsummary>陈伟格、伊琳娜·斯皮里多诺娃、杨建伟、高剑锋、李春元\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00571)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=c020f15be1dee20f9e2e0c5a6f05f272b5508325)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc020f15be1dee20f9e2e0c5a6f05f272b5508325)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllava-vl.github.io\u002Fllava-interactive)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLaVA-VL\u002FLLaVA-Interactive-Demo.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Interactive-Demo)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fllavainteractive.ngrok.app\u002F)\\\n**标签:** `图像聊天` `图像分割`、`图像生成` `图像编辑`\n\n+ **ControlLLM：通过图搜索为语言模型添加工具**（2023年10月26日） \u003Cdetails>\u003Csummary>刘兆阳、赖泽强、高章伟等\u003C\u002Fsummary>刘兆阳、赖泽强、高章伟、崔尔飞、李子恒、朱锡洲、陆乐威、陈启峰、乔宇、戴继峰、王文海\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.17796)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=288e7224d53d68669eb67f2496e068dc965c639e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F288e7224d53d68669eb67f2496e068dc965c639e)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcontrolllm.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FControlLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FControlLLM)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fcllm.opengvlab.com\u002F)\\\n**标签:** `图像理解` `图像生成` `图像编辑` `视频理解` `视频生成` `视频编辑` `音频理解` `音频生成`\n\n+ **ImageBind-LLM：多模态指令微调**（2023年9月7日） \u003Cdetails>\u003Csummary>韩嘉明、张仁睿、邵文琪等\u003C\u002Fsummary>韩嘉明、张仁睿、邵文琪、高鹏、徐鹏、肖汉、张凯鹏、刘克里斯、温松、郭子宇、卢旭东、任帅、温亚飞、陈晓欣、岳向宇、李洪生、乔宇\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.03905)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-33-blue.svg?paper=54c68b8623505dc6bf7a0b08aaa77ca9165f2d7f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F54c68b8623505dc6bf7a0b08aaa77ca9165f2d7f)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FLLaMA-Adapter.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FLLaMA-Adapter)\\\n**模态:** `文本` `图像` `视频` `音频` `点云`\n\n+ **ModelScope-Agent：使用开源大型语言模型构建可定制的智能体系统**（2023年9月2日） \u003Cdetails>\u003Csummary>李晨亮、陈鹤红、严明等\u003C\u002Fsummary>李晨亮、陈鹤红、严明、沈伟周、许海洋、吴志凯、张志成、周文猛、陈英达、程晨、施洪柱、张继、黄飞、周景仁\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.00986)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=e2f1f04f648a8863d11439aa4c80ee65d6caccda)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe2f1f04f648a8863d11439aa4c80ee65d6caccda)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmodelscope\u002Fmodelscope-agent.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fmodelscope-agent)\n\n+ **InternGPT：通过与ChatGPT交互解决以视觉为中心的任务，超越语言限制**（2023年5月9日） \u003Cdetails>\u003Csummary>刘兆阳、何一楠、王文海等\u003C\u002Fsummary>刘兆阳、何一楠、王文海、王伟云、王毅、陈寿发、张庆龙、赖泽强、杨阳、李青云、于家硕、李坤昌、陈哲、杨雪、朱锡洲、王雅丽、王利民、罗平、戴继峰、乔宇\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.05662)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-40-blue.svg?paper=54a8b153ed04a872da878d695239bdc413dc782c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F54a8b153ed04a872da878d695239bdc413dc782c)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternGPT)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Figpt.opengvlab.com\u002F)\\\n**条件模态:** `文本` `图像` `视频` `音频`\n\n+ **HuggingGPT：借助ChatGPT及其在Hugging Face中的伙伴解决AI任务**（2023年3月30日） \u003Cdetails>\u003Csummary>沈永亮、宋凯涛、谭旭等\u003C\u002Fsummary>沈永亮、宋凯涛、谭旭、李东升、陆卫明、庄玉婷\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.17580)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-413-blue.svg?paper=d1120d67b700e4dfe8b39eb1e48fbdea4e1a0c43)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd1120d67b700e4dfe8b39eb1e48fbdea4e1a0c43)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FJARVIS)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FHuggingGPT)\n\n+ **Visual ChatGPT：与视觉基础模型对话、绘图和编辑**（2023年3月8日） \u003Cdetails>\u003Csummary>吴晨菲、尹圣明、齐维珍等\u003C\u002Fsummary>吴晨菲、尹圣明、齐维珍、王晓东、唐泽成、段楠\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.04671)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-337-blue.svg?paper=af997821231898a5f8d0fd78dad4eec526acabe5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Faf997821231898a5f8d0fd78dad4eec526acabe5)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmoymix\u002FTaskMatrix)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002Fvisual_chatgpt)\n\n+ **AutoGPT：构建与使用AI智能体**\\\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fnews.agpt.co\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSignificant-Gravitas\u002FAutoGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FSignificant-Gravitas\u002FAutoGPT)\n\n\n# 📍 基于LLM的多模态理解\n\n## 多模态\n+ **Mirasol3B：一种用于时间对齐和上下文相关模态的多模态自回归模型**（2023年11月9日）\u003Cdetails>\u003Csummary>[CVPR 2024] AJ Piergiovanni、Isaac Noble、Dahun Kim 等\u003C\u002Fsummary>AJ Piergiovanni、Isaac Noble、Dahun Kim、Michael S. Ryoo、Victor Gomes、Anelia Angelova\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.05698)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=a4e7199e725b34ae5ddd574057f60ebb1a2011b7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FMirasol3B%3A-A-Multimodal-Autoregressive-Model-for-Piergiovanni-Noble\u002Fa4e7199e725b34ae5ddd574057f60ebb1a2011b7)\n`文本、视频、音频`\n\n\n## 图像理解\n\n+ **图像文本化：一种自动生成准确且详细图像描述的框架**（2024年6月11日）\u003Cdetails>\u003Csummary>Renjie Pi、Jianshu Zhang、Jipeng Zhang 等\u003C\u002Fsummary> Renjie Pi、Jianshu Zhang、Jipeng Zhang、Rui Pan、Zhekai Chen、Tong Zhang\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07502)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F91b4f447bb06d081a7947b42df57491a04fa46f9)\n\n\n\n+ **T2S-GPT：基于文本的自动手语生成中的动态向量量化**（2024年6月11日）\u003Cdetails>\u003Csummary>[ACL 2024] Aoxiong Yin、Haoyuan Li、Kai Shen 等\u003C\u002Fsummary> Aoxiong Yin、Haoyuan Li、Kai Shen、Siliang Tang、Yueting Zhuang\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07119)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F186910d697bf7eb605aa055aee78fd91ce3ce9fe)\n\n\n+ **基于多模态提示的开放世界人-物交互检测**（2024年6月11日）\u003Cdetails>\u003Csummary>Jie Yang、Bingliang Li、Ailing Zeng 等\u003C\u002Fsummary>Jie Yang、Bingliang Li、Ailing Zeng、Lei Zhang、Ruimao Zhang\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07119)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F186910d697bf7eb605aa055aee78fd91ce3ce9fe)\n\n\n+ **常识-T2I挑战：文本到图像生成模型能否理解常识？**（2024年6月11日）\u003Cdetails>\u003Csummary>Xingyu Fu、Muyu He、Yujie Lu 等\u003C\u002Fsummary>Xingyu Fu、Muyu He、Yujie Lu、William Yang Wang、Dan Roth\u003C\u002Fdetails>\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07221v1)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=da0d382c7fa981ba185ca633868442b75cb76de6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff0acf2a2293d963c3786e83bb198c75612adc446)\n\n+ **InternVL：扩展视觉基础模型并针对通用视觉-语言任务进行对齐**（2023年12月21日）\u003Cdetails>\u003Csummary>Zhe Chen、Jiannan Wu、Wenhai Wang 等\u003C\u002Fsummary>Zhe Chen、Jiannan Wu、Wenhai Wang、Weijie Su、Guo Chen、Sen Xing、Muyan Zhong、Qinglong Zhang、Xizhou Zhu、Lewei Lu、Bin Li、Ping Luo、Tong Lu、Yu Qiao、Jifeng Dai\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14238)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-10-blue.svg?paper=6a33e58ef961a3a0a5657518b2be86395eb7c8d0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6a33e58ef961a3a0a5657518b2be86395eb7c8d0)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Finternvl.opengvlab.com\u002F)\n\n+ **LLaMA-VID：在大型语言模型中，一张图片胜过2个token**（2023年11月28日）\nYanwei Li、Chengyao Wang、Jiaya Jia\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17043)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=486c2df78cbb770a90a55f7fa3fe19102fba2c24)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F486c2df78cbb770a90a55f7fa3fe19102fba2c24)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllama-vid.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLaMA-VID.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](http:\u002F\u002F103.170.5.190:7864\u002F)\n\n+ **CogVLM：预训练语言模型的视觉专家**（2023年11月6日）\u003Cdetails>\u003Csummary>Weihan Wang、Qingsong Lv、Wenmeng Yu 等\u003C\u002Fsummary>Weihan Wang、Qingsong Lv、Wenmeng Yu、Wenyi Hong、Ji Qi、Yan Wang、Junhui Ji、Zhuoyi Yang、Lei Zhao、Xixuan Song、Jiazheng Xu、Bin Xu、Juanzi Li、Yuxiao Dong、Ming Ding、Jie Tang\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.03079)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=3bf842dec99016da2d309ea8cbd7e25343032317)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3bf842dec99016da2d309ea8cbd7e25343032317)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FCogVLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVLM)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](http:\u002F\u002F36.103.203.44:7861\u002F)\n\n+ **MiniGPT-v2：大型语言模型作为视觉-语言多任务学习的统一接口**（2023年10月14日）\u003Cdetails>\u003Csummary>Jun Chen、Deyao Zhu、Xiaoqian Shen 等\u003C\u002Fsummary>Jun Chen、Deyao Zhu、Xiaoqian Shen、Xiang Li、Zechun Liu、Pengchuan Zhang、Raghuraman Krishnamoorthi、Vikas Chandra、Yunyang Xiong、Mohamed Elhoseiny\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.09478)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-97-blue.svg?paper=1ddbd08ad8cf22a5c66c4242194c4286328533bf)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1ddbd08ad8cf22a5c66c4242194c4286328533bf)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fminigpt-v2.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FMiniGPT-4.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVision-CAIR\u002FMiniGPT-v2)\n\n+ **OphGLM：基于指令和对话的眼科多模态大型语言模型训练**（2023年6月21日）\u003Cdetails>\u003Csummary>高伟豪、邓卓、牛志远等\u003C\u002Fsummary>高伟豪、邓卓、牛志远、荣福居、陈楚成、龚政、张文泽、肖代敏、李芳、曹振杰、马兆义、魏文斌、马兰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.12174)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=0f8d12775a4685575f1489796b5dee9e11fbdfb5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0f8d12775a4685575f1489796b5dee9e11fbdfb5)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fminigpt-v2.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FML-AILab\u002FOphGLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FML-AILab\u002FOphGLM)\n\n\n+ **InternLM-XComposer：用于高级文本-图像理解与生成的视觉-语言大模型**（2023年9月26日）\u003Cdetails>\u003Csummary>张攀、董晓艺、王斌等\u003C\u002Fsummary> 张攀、董晓艺、王斌、曹宇航、徐超、欧阳林科、赵志远、段浩东、张松阳、丁双瑞、张文伟、严航、张欣悦、李伟、李静雯、陈凯、何聪辉、张兴成、乔宇、林大华、王佳琪\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15112)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-48-blue.svg?paper=c1e450284e7d6cac1855330a1197df8537df653f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc1e450284e7d6cac1855330a1197df8537df653f)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer)\n\n+ **[LaVIT] 基于动态离散视觉分词的统一语言-视觉预训练LLM**（2023年9月9日）\u003Cdetails>\u003Csummary>金杨、许坤、许坤等\u003C\u002Fsummary>金杨、许坤、许坤、陈立伟、廖超、谭建超、黄曲哲、陈彬、雷晨毅、刘安、宋承儒、雷小强、张迪、欧文武、盖坤、穆亚东\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04669)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=bcac614f9774488447221ebb4f16f05e3975ec1e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbcac614f9774488447221ebb4f16f05e3975ec1e)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjy0205\u002FLaVIT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fjy0205\u002FLaVIT)\n`tokenizer`\n\n+ **Qwen-VL：用于理解、定位、文本识别等任务的多功能视觉-语言模型**（2023年8月24日）\u003Cdetails>\u003Csummary>白金泽、白帅、杨树生等\u003C\u002Fsummary>白金泽、白帅、杨树生、王世杰、谭思南、王鹏、林俊洋、周畅、周景仁\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.12966)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-69-blue.svg?paper=fc6a2f7478f68adefd69e2071f27e38aa1647f2f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffc6a2f7478f68adefd69e2071f27e38aa1647f2f)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-VL\u002Fblob\u002Fmaster\u002FTUTORIAL.md)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen-VL.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-VL)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fqwen\u002FQwen-VL-Chat-Demo\u002Fsummary)\n\n+ **VisionLLM：大型语言模型也是面向视觉任务的开放式解码器**（2023年5月18日）\u003Cdetails>\u003Csummary>[NeurIPS 2023] 王文海、陈哲、陈孝康等\u003C\u002Fsummary>王文海、陈哲、陈孝康、吴建楠、朱锡洲、曾刚、罗平、陆通、周杰、乔宇、戴继峰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11175)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-134-blue.svg?paper=42a30dc5470f54ec249f25d3c31e05d7c376c8e3)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F42a30dc5470f54ec249f25d3c31e05d7c376c8e3)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FVisionLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FVisionLLM)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternGPT)\n\n+ **InstructBLIP：通过指令微调迈向通用型视觉-语言模型**（2023年5月11日）\u003Cdetails>\u003Csummary>戴文亮、李俊楠、李东旭等\u003C\u002Fsummary>戴文亮、李俊楠、李东旭、安东尼·孟华特·童、赵俊奇、王伟胜、李博阳、冯佩斯卡、史蒂文·霍伊\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06500)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-474-blue.svg?paper=8bd6a2a89503be083176f2cc26fabedb79238cbd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8bd6a2a89503be083176f2cc26fabedb79238cbd)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen-VL.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-VL)\n\n+ **MiniGPT-4：利用先进大型语言模型提升视觉-语言理解能力**（2023年4月20日）\u003Cdetails>\u003Csummary>朱德耀、陈军、沈晓倩等\u003C\u002Fsummary>朱德耀、陈军、沈晓倩、李翔、穆罕默德·埃尔霍西尼\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.10592)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-649-blue.svg?paper=ca6a2bc279be5a3349a22bfd6866ed633d18734b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fca6a2bc279be5a3349a22bfd6866ed633d18734b)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fminigpt-4.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FMiniGPT-4.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVision-CAIR\u002FMiniGPT-v2)\n\n+ **视觉指令微调**（2023年4月17日）\u003Cdetails>\u003Csummary>[NeurIPS 2023（口头报告）] 刘浩天等\u003C\u002Fsummary>刘浩天、李春元、吴庆阳、李勇宰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08485)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=1a8eb2cae1833df3bf12fe3b41b03d60b4a4a98d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1a8eb2cae1833df3bf12fe3b41b03d60b4a4a98d)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fllava-vl.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaotian-liu\u002FLLaVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fllava.hliu.cc\u002F)\n\n\n\n\n## 视频理解\n\n+ **StoryTeller：通过全局音视频角色识别改进长视频描述**  (11 Nov 2024)\u003Cdetails>\u003Csummary>何一晨、林源、吴建超等\u003C\u002Fsummary>何一晨、林源、吴建超、张汉冲、张宇辰、乐瑞成\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.07076)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhyc2026\u002FStoryTeller.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fhyc2026\u002FStoryTeller)\n\n\n\n+ **Video-XL：用于小时级视频理解的超长视觉语言模型**  (22 Sep 2024)\u003Cdetails>\u003Csummary>舒岩、张培田、刘征等\u003C\u002Fsummary>舒岩、张培田、刘征、秦明浩、周俊杰、黄铁军、赵博\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.14485)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=ad6f68db45aaebc0e61b342d03da4c2702ce5697)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVideo-XL%3A-Extra-Long-Vision-Language-Model-for-Shu-Zhang\u002Fad6f68db45aaebc0e61b342d03da4c2702ce5697)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVectorSpaceLab\u002FVideo-XL.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FVectorSpaceLab\u002FVideo-XL\u002Ftree\u002Fmain)\n\n\n\n+ **Oryx MLLM：任意分辨率下的按需时空理解** (19 Sep 2024)\u003Cdetails>\u003Csummary>刘祖言、董宇豪、刘子威等\u003C\u002Fsummary>刘祖言、董宇豪、刘子威、胡文森、陆继文、饶永明\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12961)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=85c514db4e90e1fd4200d858353f27a3cc2c29ad)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FOryx-MLLM%3A-On-Demand-Spatial-Temporal-Understanding-Liu-Dong\u002F85c514db4e90e1fd4200d858353f27a3cc2c29ad)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Foryx-mllm.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOryx-mllm\u002FOryx.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOryx-mllm\u002FOryx?tab=readme-ov-file)\n\n\n\n+ **VideoLLaMA 2：推进视频大语言模型中的时空建模与音频理解**  (25 Apr 2024)\u003Cdetails>\u003Csummary>程泽森、冷思聪、张航等\u003C\u002Fsummary>程泽森、冷思聪、张航、辛一飞、李欣、陈冠政、朱永新、张文琪、罗子阳、赵德利、邴立东\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07476)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=7115e1f6cdd91ef09737d5a13664d9489fe27e08)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FVideoLLaMA-2%3A-Advancing-Spatial-Temporal-Modeling-Cheng-Leng\u002F7115e1f6cdd91ef09737d5a13664d9489fe27e08)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoLLaMA2.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2)\n\n+ **PLLaVA：从图像到视频的无参数LLaVA扩展，用于视频密集字幕生成**  (25 Apr 2024)\u003Cdetails>\u003Csummary>徐林、赵怡琳、周大泉等\u003C\u002Fsummary>徐林、赵怡琳、周大泉、林志杰、吴锡强、冯嘉实\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Fpllava.github.io\u002F)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=9d29da83aba362c728c36f4dea9dde678ae3e2b2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FPLLaVA-%3A-Parameter-free-LLaVA-Extension-from-Images-Xu-Zhao\u002F9d29da83aba362c728c36f4dea9dde678ae3e2b2)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmagic-research\u002FPLLaVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmagic-research\u002FPLLaVA?tab=readme-ov-file)\n\n+ **MovieChat：从密集标记到稀疏记忆，用于长视频理解**  (3 Dec 2023) \\\n恩鑫、宋等人。 \\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-16-blue.svg?paper=6f9b7c8cde1be2e62a503c31cac883c6d44c9d0d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6f9b7c8cde1be2e62a503c31cac883c6d44c9d0d)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frese1f\u002FMovieChat.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat)\n\n+ **LLaMA-VID：在大型语言模型中，一张图片胜过两个标记**  (28 Nov 2023) \\\n严伟、李等人。 \\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17043)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=486c2df78cbb770a90a55f7fa3fe19102fba2c24)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F486c2df78cbb770a90a55f7fa3fe19102fba2c24)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLaMA-VID.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID)\n\n+ **Video-Bench：评估基于视频的大语言模型的综合基准和工具包** (27 Nov 2023)\\\n宁、木楠等人。 \\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16103)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=b037bb09aa162d8a543e64ec777ca0edc732d2af)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb037bb09aa162d8a543e64ec777ca0edc732d2af)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FVideo-Bench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-Bench)\n\n+ **PG-Video-LLaVA：像素对齐的大视频-语言模型** (22 Nov 2023)\\\n穆纳辛格、谢汉等人。 \\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13435)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=4edbb942c2d20a6f5a4e3caa763a9761be953231)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4edbb942c2d20a6f5a4e3caa763a9761be953231)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FVideo-LLaVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-LLaVA)\n[![项目页](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-LLaVA\u002F)\n\n+ **Video-LLaVA：通过对齐后再投影学习统一的视觉表征** (16 Nov 2023)\\\n林、斌等人。 \\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10122)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-19-blue.svg?paper=107fb6eec2febbae12db29bf3e311aaf5680027c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F107fb6eec2febbae12db29bf3e311aaf5680027c)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FVideo-LLaVA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FVideo-LLaVA)\n\n+ **Chat-UniVi：统一的视觉表示赋能大型语言模型实现图像与视频理解**（2023年11月14日）\\\n金鹏，等。 \\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.08046)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=aad3d2e690f6c73f04a14622ceff51464bbc560e)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Faad3d2e690f6c73f04a14622ceff51464bbc560e)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FChat-UniVi.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FChat-UniVi)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FChat-UniVi\u002FChat-UniVi)\n\n\n+ **Video-LLaMA：面向视频理解的指令微调音视频语言模型**（2023年6月5日）\\\n张航、李欣、邴立东。EMNLP 2023演示赛道。 \\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.02858)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-176-blue.svg?paper=5d321194696f1f75cf9da045e6022b2f20ba5b9c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5d321194696f1f75cf9da045e6022b2f20ba5b9c)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideo-LLaMA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideo-LLaMA)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FDAMO-NLP-SG\u002FVideo-LLaMA)\n\n+ **AntGPT：大型语言模型能否助力从视频中进行长期动作预测？**（2023年7月31日）\\\n赵琪，等。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16368)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=6024f320e0a5b9b8fc29b86903aa9a96956b26dd)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6024f320e0a5b9b8fc29b86903aa9a96956b26dd)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fbrown-palm.github.io\u002FAntGPT\u002F)\n\n+ **Valley：具备大型语言模型增强能力的视频助手**（2023年6月12日）\\\n罗睿璞，等。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07207)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-42-blue.svg?paper=4c4d176c6e28f48041f215d563f6ee8633534cff)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4c4d176c6e28f48041f215d563f6ee8633534cff)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fvalley-vl.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRupertLuo\u002FValley.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FRupertLuo\u002FValley)\n\n+ **Video-ChatGPT：通过大型视觉与语言模型实现精细化视频理解**（2023年6月8日）\\\n穆罕默德·马兹、哈努娜·拉希德、萨尔曼·汗，等。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05424)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-107-blue.svg?paper=bf7025a2e5dbb3c09deae02a1aa98a256ca559e2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbf7025a2e5dbb3c09deae02a1aa98a256ca559e2)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FVideo-ChatGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT)\n\n+ **VideoChat：以聊天为中心的视频理解**（2023年5月10日）\\\n李坤昌，等。 \\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06355)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-133-blue.svg?paper=d48cb91b9e555194f7494c4d4bb9815021d3ee45)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd48cb91b9e555194f7494c4d4bb9815021d3ee45)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FAsk-Anything.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything)\n\n+ **VideoLLM：利用大型语言模型建模视频序列**（2023年5月22日）\\\n陈国，等。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13292)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-33-blue.svg?paper=f9bfc6d9ba1665b73af3323d46c7642b852759ef)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff9bfc6d9ba1665b73af3323d46c7642b852759ef)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcg1177\u002FVideoLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fcg1177\u002FVideoLLM)\n\n+ **在自然语言监督下学习视频嵌入空间**（2023年3月25日）\\\n乌帕拉、法尼·克里希纳、施丽蒂·普里亚和瓦伊黛希·乔希。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.14584)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=4e54a45d2118b61ae1baec07308af3fdd2c48759)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4e54a45d2118b61ae1baec07308af3fdd2c48759)\n\n## 3D理解\n+ **Lexicon3D：探索单词表3D——用于复杂3D场景理解的视觉基础模型**（2024年10月12日）\u003Cdetails>\u003Csummary>[NeurIPS 2024] 云泽·曼、郑淑红、鲍志鹏等\u003C\u002Fsummary>云泽·曼、郑淑红、鲍志鹏、马蒂尔·赫贝尔、桂良燕、王宇雄\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03757)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=10213cac3343964f746e99e223818b052d07b775)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F10213cac3343964f746e99e223818b052d07b775)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fyunzeman.github.io\u002Flexicon3d\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYunzeMan\u002FLexicon3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYunzeMan\u002FLexicon3D)\n\n\n+ **Situation3D：情境感知在3D视觉语言推理中的重要性**（2024年10月12日）\\\n[CVPR 2024] 云泽·曼、桂良燕、王宇雄 \\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07544)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=c31970f0b5a3ffa41ab604ae29ffd323af277538)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc31970f0b5a3ffa41ab604ae29ffd323af277538)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fyunzeman.github.io\u002Fsituation3d\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYunzeMan\u002FSituation3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYunzeMan\u002FSituation3D)\n\n\n\n+ **LL3DA：面向全维度3D理解、推理与规划的视觉交互式指令微调**（2023年11月30日）\u003Cdetails>\u003Csummary>[CVPR2024] 陈思进、陈鑫、张驰等\u003C\u002Fsummary>[CVPR 2024] 陈思进、陈鑫、张驰、李明胜、于刚、费浩、朱宏远、范家源、陈涛\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.18651)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=fc53f8f3a84f1fc4993689d8f98cf6551d07a22d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ffc53f8f3a84f1fc4993689d8f98cf6551d07a22d)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen3DA\u002FLL3DA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpen3DA\u002FLL3DA)\n\n\n+ **LiDAR-LLM：探索大型语言模型在3D LiDAR理解中的潜力**（2023年12月21日）\\\n杨森乔*、刘嘉铭*、雷·张等。\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14074)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=5edf706467dc76cd09319592d18db0ad4e1fb64d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5edf706467dc76cd09319592d18db0ad4e1fb64d)\n\n+ **3D-LLM：将3D世界注入大型语言模型**（2023年7月24日）\u003Cdetails>\u003Csummary>[NeurIPS 2023 Spotlight] 洪怡宁、甄浩宇、陈培豪等\u003C\u002Fsummary>洪怡宁、甄浩宇、陈培豪、郑淑红、杜一伦、陈振芳、甘创\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.12981)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-48-blue.svg?paper=7637ed79d30d0139901175ae4abedd822c217ab4)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F7637ed79d30d0139901175ae4abedd822c217ab4)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUMass-Foundation-Model\u002F3D-LLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FUMass-Foundation-Model\u002F3D-LLM)\n\n+ **PointLLM：赋能大型语言模型理解点云数据**（2023年8月31日）\u003Cdetails>\u003Csummary>[NeurIPS 2023 Spotlight] 徐润森、王小龙、王泰等\u003C\u002Fsummary>徐润森、王小龙、王泰、陈一伦、庞江淼、林大华\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16911.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-31-blue.svg?paper=6bcc6ab9c28805d4067e99b2cdc7524550fe80e1)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6bcc6ab9c28805d4067e99b2cdc7524550fe80e1)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenRobotLab\u002FPointLLM.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FPointLLM)\n\n+ **PointCLIP：通过CLIP实现点云理解**（2023年8月31日）\u003Cdetails>\u003Csummary>[CVPR 2022] 张仁瑞、郭子宇、张伟等\u003C\u002Fsummary>张仁瑞、郭子宇、张伟、李坤昌、缪旭鹏、崔斌、乔宇、高鹏、李洪生\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02413.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-215-blue.svg?paper=f3ce9ba3fcec362b70263a7ed63d9404975496a0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff3ce9ba3fcec362b70263a7ed63d9404975496a0)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZrrSkywalker\u002FPointCLIP.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FZrrSkywalker\u002FPointCLIP)\n\n\n\n## 音频理解\n+ **Unified-IO 2：扩展具有视觉、语言、音频和动作能力的自回归多模态模型**（2023年12月28日）\u003Cdetails>\u003Csummary>陆嘉森、克里斯托弗·克拉克、李相浩等\u003C\u002Fsummary>陆嘉森、克里斯托弗·克拉克、李相浩、张子辰、萨维亚·科斯拉、瑞安·马滕、德里克·霍伊姆、阿尼鲁达·肯布哈维\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17172)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=6c64ddd2190909de2c680dd18abc9b92e80c39f9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6c64ddd2190909de2c680dd18abc9b92e80c39f9)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Funified-io-2.allenai.org\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fallenai\u002Funified-io-2.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fallenai\u002Funified-io-2)\n\n+ **M2UGen：利用大型语言模型的力量进行多模态音乐理解与生成**（2023年11月19日）\u003Cdetails>\u003Csummary>阿廷·萨基尔·侯赛因、刘善松、孙晨硕等\u003C\u002Fsummary>阿廷·萨基尔·侯赛因、刘善松、孙晨硕、殷珊\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.11255)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=1e84d7c45f70038574fcdb7bc1b20da9b348a092)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F1e84d7c45f70038574fcdb7bc1b20da9b348a092)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcrypto-code.github.io\u002FM2UGen-Demo\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshansongliu\u002FM2UGen.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshansongliu\u002FM2UGen)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FM2UGen\u002FM2UGen-Demo)\n\n+ **Qwen-Audio：通过统一的大规模音频-语言模型推进通用音频理解**（2023年11月14日）\u003Cdetails>\u003Csummary>楚云飞、徐进、周晓欢等\u003C\u002Fsummary>楚云飞、徐进、周晓欢、杨倩、张士亮、闫志杰、周畅、周静仁\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.07919)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-14-blue.svg?paper=f90595f99a0c66d2bb6d0f230f17c7cd8c58f44d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff90595f99a0c66d2bb6d0f230f17c7cd8c58f44d)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-Audio)\n\n+ **SALMONN：迈向大型语言模型的通用听觉能力**（2023年10月20日）\u003Cdetails>\u003Csummary>汤昌立、于文义、孙广志等\u003C\u002Fsummary>汤昌立、于文义、孙广志、陈贤昭、谭天、李伟、陆璐、马泽军、张超\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.13289)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=f72be31de9f9a09d4410fd38bc717efe43444827)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff72be31de9f9a09d4410fd38bc717efe43444827)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ftsinghua-ee\u002FSALMONN-7B-gradio)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FSALMONN.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FSALMONN)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Ftsinghua-ee\u002FSALMONN)\n\n+ **MusicAgent：基于大型语言模型的音乐理解与生成AI代理**（2023年10月18日）\u003Cdetails>\u003Csummary>俞丁瑶、宋凯涛、陆佩玲等\u003C\u002Fsummary>俞丁瑶、宋凯涛、陆佩玲、何天宇、谭旭、叶伟、张世坤、卞江\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11954)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=beaf64df85f8204b8cd89a7f46827608e6d16922)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fbeaf64df85f8204b8cd89a7f46827608e6d16922)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002Fmuzic\u002Ftree\u002Fmain\u002Fmusicagent.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmuzic\u002Ftree\u002Fmain\u002Fmusicagent)\n\n+ **Llark：用于音乐的多模态基础模型**（2023年10月11日）\u003Cdetails>\u003Csummary>乔什·加德纳、西蒙·杜兰、丹尼尔·斯托勒等\u003C\u002Fsummary>乔什·加德纳、西蒙·杜兰、丹尼尔·斯托勒、瑞秋·M·比特纳\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07160)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=86e75cf15a838ed7d672fb114beff727d7210ca5)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F86e75cf15a838ed7d672fb114beff727d7210ca5)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fstorage.googleapis.com\u002Fmusic2text-public\u002Findex.html)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fspotify-research\u002Fllark.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fspotify-research\u002Fllark)\n\n+ **LauraGPT：使用GPT聆听、注意、理解并再生音频**（2023年10月7日）\u003Cdetails>\u003Csummary>王嘉明、杜志浩、陈谦等\u003C\u002Fsummary>王嘉明、杜志浩、陈谦、楚云飞、高志福、李泽睿、胡凯、周晓欢、徐进、马子洋、王文、郑思琪、周畅、闫志杰、张士亮\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.04673)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=ffa05cb5504ba08254f498223f613b3ebcf87692)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fffa05cb5504ba08254f498223f613b3ebcf87692)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Flauragpt.github.io\u002F)\n\n+ **利用细粒度音频特征、文本嵌入监督和LLM混合增强改进音频字幕生成模型**（2023年9月29日）\u003Cdetails>\u003Csummary>吴士伦、常轩凯、戈登·维彻恩等\u003C\u002Fsummary>吴士伦、常轩凯、戈登·维彻恩、郑智源、弗朗索瓦·热尔曼、乔纳森·勒鲁、渡边真司\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17352)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=8f0a24d1678e4d0e584b0932196cd257d5c53c7d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8f0a24d1678e4d0e584b0932196cd257d5c53c7d)\n\n+ **将语音编码器与大型语言模型连接以实现自动语音识别**（2023年9月25日）\u003Cdetails>\u003Csummary>于文义、汤昌立、孙广志等\u003C\u002Fsummary>于文义、汤昌立、孙广志、陈贤昭、谭天、李伟、陆璐、马泽军、张超\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.13963)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=5596bd3e26ec2207666ec1ff3db4415d212f14b9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5596bd3e26ec2207666ec1ff3db4415d212f14b9)\n\n+ **Whisper能否进行基于语音的上下文学习？**（2023年9月13日）\u003Cdetails>\u003Csummary>王思寅、杨朝汉、吴继、张超等\u003C\u002Fsummary>王思寅、杨朝汉、吴继、张超\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.07081)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=3a944ddba8b6fbaaac36126fc955f181f8b8b06a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3a944ddba8b6fbaaac36126fc955f181f8b8b06a)\n\n+ **音乐理解版LLaMA：通过问答和字幕生成推进文本到音乐的创作**（2023年8月22日）\u003Cdetails>\u003Csummary>刘善松、阿廷·萨基尔·侯赛因、孙晨硕等\u003C\u002Fsummary>刘善松、阿廷·萨基尔·侯赛因、孙晨硕、山英\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.11276)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-11-blue.svg?paper=a33b437618be733fea7176bd98e18b6362af0838)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fa33b437618be733fea7176bd98e18b6362af0838)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fcrypto-code.github.io\u002FMU-LLaMA-Demo\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcrypto-code\u002FMU-LLaMA.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fcrypto-code\u002FMU-LLaMA)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmu-llama\u002FMusicQA)\n\n+ **关于仅解码器架构在语音转文本与大型语言模型集成中的应用**（2023年7月8日）\u003Cdetails>\u003Csummary>吴健、雅谢什·高尔、陈卓等\u003C\u002Fsummary>吴健、雅谢什·高尔、陈卓、周龙、朱一梦、王天锐、李金宇、刘淑洁、任波、刘林泉、吴宇\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.03917)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-24-blue.svg?paper=8e1868f84091272544cb4209c4ccaad7cc88af27)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8e1868f84091272544cb4209c4ccaad7cc88af27)\n\n+ **AudioPaLM：一款能说会听的大型語言模型**（2023年6月22日）\u003Cdetails>\u003Csummary>Paul K. Rubenstein、Chulayuth Asawaroengchai、Duc Dung Nguyen 等\u003C\u002Fsummary>Paul K. Rubenstein、Chulayuth Asawaroengchai、Duc Dung Nguyen、Ankur Bapna、Zalán Borsos、Félix de Chaumont Quitry、Peter Chen、Dalia El Badawy、Wei Han、Eugene Kharitonov、Hannah Muckenhirn、Dirk Padfield、James Qin、Danny Rozenberg、Tara Sainath、Johan Schalkwyk、Matt Sharifi、Michelle Tadmor Ramanovich、Marco Tagliasacchi、Alexandru Tudor、Mihajlo Velimirović、Damien Vincent、Jiahui Yu、Yongqiang Wang、Vicky Zayats、Neil Zeghidour、Yu Zhang、Zhishuai Zhang、Lukas Zilka、Christian Frank\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.12925)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-69-blue.svg?paper=3efb81de24eb88017d6dbcf22cb4215084223fd8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3efb81de24eb88017d6dbcf22cb4215084223fd8)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fgoogle-research.github.io\u002Fseanet\u002Faudiopalm\u002Fexamples\u002F)\n\n+ **Hugginggpt：利用 ChatGPT 及其在 Hugging Face 中的伙伴解決 AI 任務**（2023年3月30日）\u003Cdetails>\u003Csummary>Shen Yongliang、Song Kaitao、Tan Xu 等\u003C\u002Fsummary>Shen Yongliang、Song Kaitao、Tan Xu、Li Dongsheng、Lu Weiming、Zhuang Yueting\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.17580)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-413-blue.svg?paper=d1120d67b700e4dfe8b39eb1e48fbdea4e1a0c43)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd1120d67b700e4dfe8b39eb1e48fbdea4e1a0c43)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FJARVIS.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FJARVIS)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FHuggingGPT)\n\n+ **人工通用智能的火花：GPT-4 的早期實驗**（2023年3月22日）\u003Cdetails>\u003Csummary>Sébastien Bubeck、Varun Chandrasekaran、Ronen Eldan 等\u003C\u002Fsummary>Sébastien Bubeck、Varun Chandrasekaran、Ronen Eldan、Johannes Gehrke、Eric Horvitz、Ece Kamar、Peter Lee、Yin Tat Lee、Yuanzhi Li、Scott Lundberg、Harsha Nori、Hamid Palangi、Marco Tulio Ribeiro、Yi Zhang\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.12712)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1407-blue.svg?paper=574beee702be3856d60aa482ec725168fe64fc99)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F574beee702be3856d60aa482ec725168fe64fc99)\n\n+ **聽、思考與理解**（2023年5月18日）\u003Cdetails>\u003Csummary>Gong Yuan、Luo Hongyin、Liu Alexander H. 等\u003C\u002Fsummary>Gong Yuan、Luo Hongyin、Liu Alexander H.、Karlinsky Leonid、Glass James\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10790)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-33-blue.svg?paper=4bb0b12803791764d641a4cef1e0ce39cf049542)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4bb0b12803791764d641a4cef1e0ce39cf049542)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fyuangongfdu\u002FLTU)\n\n+ **Speechgpt：賦予大型語言模型內在的跨模態對話能力**（2023年5月18日）\u003Cdetails>\u003Csummary>Zhang Dong、Li Shimin、Zhang Xin 等\u003C\u002Fsummary>Zhang Dong、Li Shimin、Zhang Xin、Zhan Jun、Wang Pengyu、Zhou Yaqian、Qiu Xipeng\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.11000)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-76-blue.svg?paper=5cac6430bd379c9d2fe13137dfd6ae7721a2679f)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5cac6430bd379c9d2fe13137dfd6ae7721a2679f)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002F0nutation.github.io\u002FSpeechGPT.github.io\u002F)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F0nutation\u002FSpeechGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002F0nutation\u002FSpeechGPT)\n\n+ **Audiogpt：理解並生成語音、音樂、聲音及講話頭像**（2023年4月25日）\u003Cdetails>\u003Csummary>Huang Rongjie、Li Mingze、Yang Dongchao 等\u003C\u002Fsummary>Huang Rongjie、Li Mingze、Yang Dongchao、Shi Jiatong、Chang Xuankai、Ye Zhenhui、Wu Yuning、Hong Zhiqing、Huang Jiawei、Liu Jinglin、Ren Yi、Zhao Zhou、Watanabe Shinji\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.12995)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-83-blue.svg?paper=8bc617c9139648d7a92991d70c671230bac7b2e2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8bc617c9139648d7a92991d70c671230bac7b2e2)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIGC-Audio\u002FAudioGPT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FAIGC-Audio\u002FAudioGPT)\n[![演示](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-EEAD0E)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAIGC-Audio\u002FAudioGPT)\n\n\n\n\n\n# 📍 多模態大語言模型安全\n## 攻擊\n\n+ **通過系統提示詞的自我對抗攻擊越獄 GPT-4v。**（2024年1月20日）\u003Cdetails>\u003Csummary>Wu Yuanwei、Li Xiang、Liu Yixin 等\u003C\u002Fsummary>Wu Yuanwei、Li Xiang、Liu Yixin、Zhou Pan、Sun Lichao\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.09127.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-8-blue.svg?paper=18a8b97d75a87e8fef07542d8875d4a62b553744)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F18a8b97d75a87e8fef07542d8875d4a62b553744)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FThuCCSLab\u002Flm-ssp.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FThuCCSLab\u002Flm-ssp)\n\n+ **通過自我提醒防禦 ChatGPT 遭受越獄攻擊。**（2023年12月1日）\u003Cdetails>\u003Csummary>Xie Yueqi、Yi Jingwei、Shao Jiawei 等\u003C\u002Fsummary>Xie Yueqi、Yi Jingwei、Shao Jiawei、Justin Curl、Lyu Lingjuan、Chen Qifeng、Xie Xing、Wu Fangzhao\u003C\u002Fdetails>\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-32-blue.svg?paper=e762f92273cd96f63b7788c0173b9b6450adedd7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe762f92273cd96f63b7788c0173b9b6450adedd7)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyjw1029\u002FSelf-Reminder.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fyjw1029\u002FSelf-Reminder)\n\n+ **利用視覺對抗樣本在大型語言模型中濫用工具**（2023年10月4日）\u003Cdetails>\u003Csummary>Fu Xiaohan、Wang Zihan、Li Shuheng 等\u003C\u002Fsummary>Fu Xiaohan、Wang Zihan、Li Shuheng、Gupta Rajesh K.、Mireshghallah Niloofar、Berg-Kirkpatrick Taylor、Fernandes Earlence\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.03185.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-6-blue.svg?paper=ac5b4df0e398ca48388330ac5c795b6fe708793c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fac5b4df0e398ca48388330ac5c795b6fe708793c)\n\n+ **图像劫持：对抗性图像可在运行时控制生成模型。**（2023年9月18日）\u003Cdetails>\u003Csummary>卢克·贝利、尤安·翁、斯图尔特·拉塞尔等\u003C\u002Fsummary>卢克·贝利、尤安·翁、斯图尔特·拉塞尔、斯科特·埃蒙斯\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.00236.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-22-blue.svg?paper=5bdaadb84db0cbf72aaebda9f55f4288b63c6e9b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F5bdaadb84db0cbf72aaebda9f55f4288b63c6e9b)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Feuanong\u002Fimage-hijacks.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Feuanong\u002Fimage-hijacks)\n\n+ **对齐语言模型的通用且可迁移的对抗攻击**（2023年7月27日）\u003Cdetails>\u003Csummary>安迪·邹、王子凡、尼古拉斯·卡林尼等\u003C\u002Fsummary>安迪·邹、王子凡、尼古拉斯·卡林尼、米拉德·纳斯尔、J·齐科·科尔特、马特·弗雷德里克森\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.15043.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-309-blue.svg?paper=47030369e97cc44d4b2e3cf1be85da0fd134904a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F47030369e97cc44d4b2e3cf1be85da0fd134904a)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fllm-attacks\u002Fllm-attacks.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fllm-attacks\u002Fllm-attacks)\n\n+ **针对集成LLM的应用程序的提示注入攻击**（2023年6月8日）\u003Cdetails>\u003Csummary>刘毅、邓戈磊、李岳康等\u003C\u002Fsummary>刘毅、邓戈磊、李岳康、王凯龙、张天伟、刘业鹏、王浩宇、郑岩、刘洋\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05499.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-77-blue.svg?paper=db4cf9f6a653d5c15973e836c800ea47743251ae)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdb4cf9f6a653d5c15973e836c800ea47743251ae)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLMSecurity\u002FHouYi.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FLLMSecurity\u002FHouYi)\n\n+ **通过离散优化自动审计大型语言模型**（2023年3月8日）\u003Cdetails>\u003Csummary>埃里克·琼斯、安卡·德拉甘、阿迪蒂·拉古纳坦等\u003C\u002Fsummary>埃里克·琼斯、安卡·德拉甘、阿迪蒂·拉古纳坦、雅各布·施泰因哈特\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.04381.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-64-blue.svg?paper=2f94f03fdac62d05f0f416b7b3855d1f597afee9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2f94f03fdac62d05f0f416b7b3855d1f597afee9)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fejones313\u002Fauditing-llms.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fejones313\u002Fauditing-llms)\n\n+ **污染网络规模训练数据集是可行的**（2023年2月20日）\u003Cdetails>\u003Csummary>尼古拉斯·卡林尼、马修·贾吉尔斯基、克里斯托弗·A·乔奎特-丘等\u003C\u002Fsummary>尼古拉斯·卡林尼、马修·贾吉尔斯基、克里斯托弗·A·乔奎特-丘、丹尼尔·帕莱卡、威尔·皮尔斯、海勒姆·安德森、安德烈亚斯·特尔齐斯、库尔特·托马斯、弗洛里安·特拉姆尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.10149.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-61-blue.svg?paper=2cf43a61d0937ad25f23eaef7c90253ab799b3c7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2cf43a61d0937ad25f23eaef7c90253ab799b3c7)\n\n+ **利用LLM的程序化行为：通过标准安全攻击实现双重用途。**（2023年2月11日）\u003Cdetails>\u003Csummary>丹尼尔·康、李雪晨、伊昂·斯托伊卡等\u003C\u002Fsummary>丹尼尔·康、李雪晨、伊昂·斯托伊卡、卡洛斯·格斯特林、马泰伊·扎哈里亚、夏目哲典\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.05733.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-94-blue.svg?paper=0cf694b8f85ab2e11d45595de211a15cfbadcd22)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0cf694b8f85ab2e11d45595de211a15cfbadcd22)\n\n+ **忽略先前提示：语言模型的攻击技术**（2022年11月17日）\\\nF 生物佩雷斯、伊恩·里贝罗（NeurIPS 2022研讨会）\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.09527.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-151-blue.svg?paper=9716a2876d08fce9d8e5c5ba4d7b1a9af44806d6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9716a2876d08fce9d8e5c5ba4d7b1a9af44806d6)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fagencyenterprise\u002FPromptInject.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fagencyenterprise\u002FPromptInject)\n\n+ **用于攻击和分析NLP的通用对抗触发器**（2019年8月20日）\u003Cdetails>\u003Csummary>埃里克·华莱士、冯诗、尼基尔·坎德帕尔等（EMNLP 2019）\u003C\u002Fsummary>埃里克·华莱士、冯诗、尼基尔·坎德帕尔、马特·加德纳、萨米尔·辛格\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.07125.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-613-blue.svg?paper=18a1c21f35153c45d0ef30c564bffb7d70a13ccc)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F18a1c21f35153c45d0ef30c564bffb7d70a13ccc)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEric-Wallace\u002Funiversal-triggers.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FEric-Wallace\u002Funiversal-triggers)\n\n+ **用于评估阅读理解系统的对抗样本**（2017年7月23日）\\\n罗宾·贾、珀西·梁（EMNLP 2017）\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.07328.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1445-blue.svg?paper=ffb949d3493c3b2f3c9acf9c75cb03938933ddf0)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fffb949d3493c3b2f3c9acf9c75cb03938933ddf0)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frobinjia\u002Fadversarial-squad.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Frobinjia\u002Fadversarial-squad)\n\n## 防御与检测\n+ **利用大型多模态视觉语言模型检测并纠正多模态表情包中的仇恨言论。**（2023年11月12日）\\\n范明浩、吴新涛\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.06737.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-5-blue.svg?paper=60f4dc690ea42fb77b04fc685e9d9c3a1e209319)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F60f4dc690ea42fb77b04fc685e9d9c3a1e209319)\n\n+ **从大型语言模型中检测预训练数据**（2023年11月3日）\u003Cdetails>\u003Csummary>史伟嘉、阿尼鲁德·阿吉特、夏孟周等。\u003C\u002Fsummary>史伟嘉、阿尼鲁德·阿吉特、夏孟周、黄洋思博、刘道高、泰拉·布利文斯、陈丹琪、卢克·泽特勒莫耶\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.16789.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-20-blue.svg?paper=3422d5e0cdfdc935d6a84a1e3d3f96659265fe3a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3422d5e0cdfdc935d6a84a1e3d3f96659265fe3a)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fswj0419\u002Fdetect-pretrain-code.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fswj0419\u002Fdetect-pretrain-code)\n\n+ **仅使用少量上下文示范即可越狱并使语言模型保持对齐**（2023年10月10日）\\\n魏泽明、王一飞、王艺森\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06387.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-43-blue.svg?paper=6b135e922a0c673aeb0b05c5aeecdb6c794791c6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6b135e922a0c673aeb0b05c5aeecdb6c794791c6)\n\n+ **SmoothLLM：防御大型语言模型免受越狱攻击。**（2023年10月5日）\u003Cdetails>\u003Csummary>亚历山大·罗比、埃里克·王、哈梅德·哈萨尼等。\u003C\u002Fsummary>亚历山大·罗比、埃里克·王、哈梅德·哈萨尼、乔治·J·帕帕斯\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.03684.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-46-blue.svg?paper=8cf9b49698fdb1b754df2556576412a7b44929f6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F8cf9b49698fdb1b754df2556576412a7b44929f6)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Farobey1\u002Fsmooth-llm.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Farobey1\u002Fsmooth-llm)\n\n+ **大型语言模型的水印技术**（2023年6月6日）\u003Cdetails>\u003Csummary>约翰·基尔兴鲍尔、乔纳斯·盖平、温宇欣等（ICML 2023）。\u003C\u002Fsummary>约翰·基尔兴鲍尔、乔纳斯·盖平、温宇欣、乔纳森·卡茨、伊恩·米尔斯、汤姆·戈德斯坦\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.10226.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-199-blue.svg?paper=cb5b71a622aff47014d4f28a958679629a8b6363)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fcb5b71a622aff47014d4f28a958679629a8b6363)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBrianPulfer\u002FLMWatermark.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FBrianPulfer\u002FLMWatermark)\n\n+ **不安全扩散：关于文本到图像模型生成不安全图像和仇恨表情包的研究**（2023年5月23日）\u003Cdetails>\u003Csummary>瞿怡婷、沈欣悦、何鑫磊等（ACM CCS 2023）。\u003C\u002Fsummary>瞿怡婷、沈欣悦、何鑫磊、迈克尔·巴克斯、萨瓦斯·赞内托、张阳\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13873.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-29-blue.svg?paper=c9e548d72f5ad72215025602be36f72042219baf)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc9e548d72f5ad72215025602be36f72042219baf)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYitingQu\u002Funsafe-diffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FYitingQu\u002Funsafe-diffusion)\n\n+ **TRAK：大规模模型行为归因**（2023年4月3日）\u003Cdetails>\u003Csummary>朴成珉、克里斯蒂安·格奥尔基耶夫、安德鲁·伊利亚斯等。\u003C\u002Fsummary>朴成珉、克里斯蒂安·格奥尔基耶夫、安德鲁·伊利亚斯、纪尧姆·勒克莱尔、亚历山大·马德里\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.14186.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-32-blue.svg?paper=4f2ae5fa2dc74af9c36ee57b359a4b3241006a92)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F4f2ae5fa2dc74af9c36ee57b359a4b3241006a92)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMadryLab\u002Ftrak.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FMadryLab\u002Ftrak)\n\n+ **污染网络规模训练数据集是可行的**（2023年2月20日）\u003Cdetails>\u003Csummary>尼古拉斯·卡尔尼尼、马修·雅吉尔斯基、克里斯托弗·A·乔克特-丘等。\u003C\u002Fsummary>尼古拉斯·卡尔尼尼、马修·雅吉尔斯基、克里斯托弗·A·乔克特-丘、丹尼尔·帕莱卡、威尔·皮尔斯、海勒姆·安德森、安德烈亚斯·特尔齐斯、库尔特·托马斯、弗洛里安·特拉默\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.10149.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-61-blue.svg?paper=2cf43a61d0937ad25f23eaef7c90253ab799b3c7)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2cf43a61d0937ad25f23eaef7c90253ab799b3c7)\n\n+ **缓解扩散模型中的不当退化现象**（2022年11月9日）\u003Cdetails>\u003Csummary>帕特里克·施拉莫夫斯基、曼努埃尔·布拉克、比约恩·戴泽罗斯等（CVPR 2023）。\u003C\u002Fsummary>帕特里克·施拉莫夫斯基、曼努埃尔·布拉克、比约恩·戴泽罗斯、克里斯蒂安·克尔斯廷\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.05105.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-76-blue.svg?paper=0231f2aed9a96cb516242fb57f2cb63f5651c4d8)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0231f2aed9a96cb516242fb57f2cb63f5651c4d8)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fml-research\u002Fsafe-latent-diffusion.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fml-research\u002Fsafe-latent-diffusion)\n\n+ **从大型语言模型中提取训练数据**（2021年6月15日）\u003Cdetails>\u003Csummary>尼古拉斯·卡尔尼尼、弗洛里安·特拉默、埃里克·华莱士等。\u003C\u002Fsummary>尼古拉斯·卡尔尼尼、弗洛里安·特拉默、埃里克·华莱士、马修·雅吉尔斯基、艾瑞尔·赫伯特-沃斯、凯瑟琳·李、亚当·罗伯茨、汤姆·布朗、邓恩·宋、乌尔法尔·埃尔林松、阿丽娜·奥普雷亚、科林·拉菲尔\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.07805.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1005-blue.svg?paper=df7d26339adf4eb0c07160947b9d2973c24911ba)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdf7d26339adf4eb0c07160947b9d2973c24911ba)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshreyansh26\u002FExtracting-Training-Data-from-Large-Langauge-Models.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fshreyansh26\u002FExtracting-Training-Data-from-Large-Langauge-Models)\n\n## 对齐\n+ **直接偏好优化：你的语言模型其实是一个奖励模型**（2023年12月13日）\u003Cdetails>\u003Csummary>拉斐尔·拉法伊洛夫、阿奇特·夏尔马、埃里克·米切尔等\u003C\u002Fsummary>拉斐尔·拉法伊洛夫、阿奇特·夏尔马、埃里克·米切尔、斯特凡诺·埃尔蒙、克里斯托弗·D·曼宁、切尔西·芬恩\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18290.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-450-blue.svg?paper=0d1c76d45afa012ded7ab741194baf142117c495)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F0d1c76d45afa012ded7ab741194baf142117c495)\n\n+ **Raft：用于生成式基础模型对齐的奖励排序微调**（2023年12月1日）\u003Cdetails>\u003Csummary>董汉泽、熊伟、迪潘舒·戈亚尔等（机器学习研究汇刊，TMLR）\u003C\u002Fsummary>董汉泽、熊伟、迪潘舒·戈亚尔、张一涵、温妮·周、潘睿、刁世哲、张继鹏、沈嘉勋、张彤\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.06767.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-116-blue.svg?paper=3ab661db57d924f4ff1706e05ac807873ca00e0a)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F3ab661db57d924f4ff1706e05ac807873ca00e0a)\n\n+ **用人类偏好更好地对齐文本到图像模型**（2023年8月22日）\u003Cdetails>\u003Csummary>吴晓石、孙克强、朱峰等（ICCV 2023）\u003C\u002Fsummary>吴晓石、孙克强、朱峰、赵锐、李洪生\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.14420.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-44-blue.svg?paper=14c3cf58192774b9b6fc6188df99efd6ab5fc739)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F14c3cf58192774b9b6fc6188df99efd6ab5fc739)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftgxs002\u002Falign_sd.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ftgxs002\u002Falign_sd)\n\n+ **通过奖励建模实现可扩展的智能体对齐：一个研究方向**（2018年11月19日）\u003Cdetails>\u003Csummary>扬·莱克、大卫·克鲁格、汤姆·埃弗里特等\u003C\u002Fsummary>扬·莱克、大卫·克鲁格、汤姆·埃弗里特、米利扬·马蒂奇、维沙尔·迈尼、谢恩·莱格\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.07871.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-244-blue.svg?paper=c6f913e4baa7f2c85363c0625c87003ad3b3a14c)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc6f913e4baa7f2c85363c0625c87003ad3b3a14c)\n\n+ **近端策略优化算法**（2017年7月20日）\u003Cdetails>\u003Csummary>约翰·舒尔曼、菲利普·沃尔斯基、普拉富拉·达里瓦尔等\u003C\u002Fsummary>约翰·舒尔曼、菲利普·沃尔斯基、普拉富拉·达里瓦尔、亚历克·拉德福德、奥列格·克利莫夫\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06347.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-12282-blue.svg?paper=dce6f9d4017b1785979e7520fd0834ef8cf02f4b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdce6f9d4017b1785979e7520fd0834ef8cf02f4b)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmorikatron\u002FPPO.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fmorikatron\u002FPPO)\n\n## 数据集\n+ **Goat-bench：通过基于模因的社会滥用行为洞察大型多模态模型的安全性。**（2024年1月7日）\u003Cdetails>\u003Csummary>林宏展、罗子阳、王博等\u003C\u002Fsummary>林宏展、罗子阳、王博、杨瑞超、马静\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.01523.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=d98aa44f79fe798ad5ff0cac6e7bf32ee30bd156)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fd98aa44f79fe798ad5ff0cac6e7bf32ee30bd156)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FisXinLiu\u002FMLLM-Safety-Collection.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FisXinLiu\u002FMLLM-Safety-Collection)\n\n+ **Tovilag：你的视觉-语言生成模型也可能成为作恶者。**（2023年12月13日）\u003Cdetails>\u003Csummary>王新鹏、易晓远、江涵等（EMNLP 2023 口头报告）\u003C\u002Fsummary>王新鹏、易晓远、江涵、周善林、魏志华、谢星\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](http:\u002F\u002Fexport.arxiv.org\u002Fabs\u002F2312.11523)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-1-blue.svg?paper=10280c290825fc0b0c884e988f4f1dedb80e4e80)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F10280c290825fc0b0c884e988f4f1dedb80e4e80)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvictorup\u002FToViLaG.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fvictorup\u002FToViLaG)\n\n+ **Figstep：通过排版视觉提示破解大型视觉-语言模型。**（2023年12月13日）\u003Cdetails>\u003Csummary>龚一辰、冉德龙、刘金元等\u003C\u002Fsummary>龚一辰、冉德龙、刘金元、王聪磊、丛天硕、王安宇、段思思、王小云\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.05608.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-7-blue.svg?paper=b78b5ce5f21f46d8149824463f8eebd6103d49aa)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fb78b5ce5f21f46d8149824463f8eebd6103d49aa)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FThuCCSLab\u002FFigStep.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FThuCCSLab\u002FFigStep)\n\n+ **查询相关图像可破解大型多模态模型。**（2023年11月29日）\u003Cdetails>\u003Csummary>刘欣、朱一辰、兰云石等\u003C\u002Fsummary>刘欣、朱一辰、兰云石、杨超、乔宇\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17600.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=74423a9ee66085e74cd2b2e42303f28359c74eb6)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F74423a9ee66085e74cd2b2e42303f28359c74eb6)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FisXinLiu\u002FMM-SafetyBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FisXinLiu\u002FMM-SafetyBench)\n\n+ **Dress：通过自然语言反馈指导大型视觉-语言模型与人类对齐并互动。**（2023年11月16日）\u003Cdetails>\u003Csummary>陈洋溢、卡兰·西卡、迈克尔·科格斯韦尔等\u003C\u002Fsummary>陈洋溢、卡兰·西卡、迈克尔·科格斯韦尔、季恒、阿贾伊·迪瓦卡兰\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10081.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-9-blue.svg?paper=391eaeb1092c2b145ff0e5a2fa61637a42921fce)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F391eaeb1092c2b145ff0e5a2fa61637a42921fce)\n\n+ **Beavertails：借助人类偏好数据集实现 LLM 更安全的对齐**（2023年11月7日）\u003Cdetails>\u003Csummary>季嘉明、刘米克尔、戴俊涛等（NeurIPS 2023）\u003C\u002Fsummary>季嘉明、刘米克尔、戴俊涛、潘学海、张驰、卞策、张驰、孙睿阳、王义舟、杨耀东\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.04657.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-56-blue.svg?paper=92930ed3560ea6c86d53cf52158bc793b089054d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F92930ed3560ea6c86d53cf52158bc793b089054d)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGaryYufei\u002FAlignLLMHumanSurvey.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FGaryYufei\u002FAlignLLMHumanSurvey)\n\n+ **预训练的视觉和语言模型能否回答视觉信息检索问题？**（2023年10月17日）\u003Cdetails>\u003Csummary>陈阳、胡赫翔、栾毅等（EMNLP 2023）\u003C\u002Fsummary>陈阳、胡赫翔、栾毅、孙海天、查恩皮诺、艾伦·里特、张明伟\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.11713.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-19-blue.svg?paper=f890b4dfe915174b23db909b07c515d465eaeff2)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Ff890b4dfe915174b23db909b07c515d465eaeff2)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fedchengg\u002Finfoseek_eval.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fedchengg\u002Finfoseek_eval)\n\n+ **能否指导语言模型保护个人信息？**（2023年10月3日）\u003Cdetails>\u003Csummary>陈阳、伊森·门德斯、萨维克·达斯等\u003C\u002Fsummary>陈阳、伊森·门德斯、萨维克·达斯、许伟、艾伦·里特\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02224.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-14-blue.svg?paper=2403c8e72a90d9c778970fc0812ecdcc58800c5d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F2403c8e72a90d9c778970fc0812ecdcc58800c5d)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fethanm88\u002Fllm-access-control.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fethanm88\u002Fllm-access-control)\n\n+ **Safetybench：用选择题评估大型语言模型的安全性**（2023年9月13日）\u003Cdetails>\u003Csummary>张哲鑫、雷琪、吴林东等\u003C\u002Fsummary>张哲鑫、雷琪、吴林东、孙锐、黄永康、龙冲、刘晓、雷轩宇、唐杰、黄敏列\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.07045.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-13-blue.svg?paper=9b9a4fa3ed510fc6eb1bf831979235f3d9f8b556)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F9b9a4fa3ed510fc6eb1bf831979235f3d9f8b556)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-coai\u002FSafetyBench.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthu-coai\u002FSafetyBench)\n\n+ **中文大型语言模型的安全性评估**（2023年4月20日）\u003Cdetails>\u003Csummary>孙浩、张哲鑫、邓佳文等\u003C\u002Fsummary>孙浩、张哲鑫、邓佳文、程家乐、黄敏列\u003C\u002Fdetails>\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.10436.pdf)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-33-blue.svg?paper=59fc49dfd81b92661437eaf7e339c0792ccd8755)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F59fc49dfd81b92661437eaf7e339c0792ccd8755)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthu-coai\u002FSafety-Prompts.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fthu-coai\u002FSafety-Prompts)\n\n## 3D、视频与音频安全\n\n+ **不是我的声音！语音生成器的伦理与安全危害分类**（2024年1月25日）\\\n维布克·胡蒂里、奥雷西蒂·帕帕基里亚科普洛斯、爱丽丝·香\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01708)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-N\u002FA-blue.svg?paper=Not-My-Voice!-A-Taxonomy-of-Ethical-an)\n)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FNot-My-Voice!-A-Taxonomy-of-Ethical-an)\n)\n\n+ **Adv3D：使用NeRF在驾驶场景中生成3D对抗样本**（2023年9月4日）\\\n李乐恒、连青、陈英聪\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.01351)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-2-blue.svg?paper=daa6a6b2c495d002d72075c6203c98061d1e35f9)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fdaa6a6b2c495d002d72075c6203c98061d1e35f9)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEnVision-Research\u002FAdv3D.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FEnVision-Research\u002FAdv3D)\n\n+ **基于生成式卷积视觉Transformer的深度伪造视频检测**（2023年7月13日）\\\n德雷萨·沃达乔、所罗门·阿特纳夫、扎希德·阿赫塔尔\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.07036)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=86301139cc02eb53247e63fca91b916348591505)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F86301139cc02eb53247e63fca91b916348591505)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ferprogs\u002FGenConViT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ferprogs\u002FGenConViT)\n\n+ **M2TR：用于深度伪造检测的多模态多尺度Transformer**（2022年4月19日）\\\n王俊科、吴祖轩、欧阳文浩、韩欣彤、陈静静、林世南、蒋宇刚\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.09770)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-132-blue.svg?paper=21e0858665cddf51689fc680f72ec4e00b68ae04)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F21e0858665cddf51689fc680f72ec4e00b68ae04)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwangjk666\u002FM2TR-Multi-modal-Multi-scale-Transformers-for-Deepfake-Detection.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Fwangjk666\u002FM2TR-Multi-modal-Multi-scale-Transformers-for-Deepfake-Detection)\n\n+ **基于卷积视觉Transformer的深度伪造视频检测**（2021年3月11日）\\\n德雷萨·沃达乔、所罗门·阿特纳夫\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.11126)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-0-blue.svg?paper=86301139cc02eb53247e63fca91b916348591505)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F86301139cc02eb53247e63fca91b916348591505)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ferprogs\u002FGenConViT.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002Ferprogs\u002FGenConViT)\n\n+ **“深度伪造生成与检测：现状、开放挑战、应对措施及未来方向”**（2021年2月25日）\\\n莫米娜·马苏德、玛丽亚姆·纳瓦兹、哈立德·马赫穆德·马利克、阿里·贾韦德、奥恩·伊尔塔扎\\\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.00484)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-123-blue.svg?paper=e8f1c51c4e881345c0588bec8aa8bc6d9164a535)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fe8f1c51c4e881345c0588bec8aa8bc6d9164a535)\n\n\n\n# 📍 相关综述\n## LLM\n\n\n+ **MM-LLMs：多模态大型语言模型的最新进展**（2024年1月24日）\u003Cdetails>\u003Csummary>张笃真、于雅涵、李晨星\u003C\u002Fsummary>张笃真、于雅涵、李晨星、董家华、苏丹、褚晨辉、于东\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.13601)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-4-blue.svg?paper=a050c9b0c321839e4427ab9defa3463be7825ac4)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fa050c9b0c321839e4427ab9defa3463be7825ac4)\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-00CED1)](https:\u002F\u002Fmm-llms.github.io\u002F)\n\n\n+ **多模态大型语言模型综述**（2023年6月23日）\u003Cdetails>\u003Csummary>尹书康、傅超友、赵思睿等\u003C\u002Fsummary>尹书康、傅超友、赵思睿、李可、孙兴、徐通、陈恩洪\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.13549)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-114-blue.svg?paper=ebedc4d7a2356090904baba4104ef0832bc236df)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Febedc4d7a2356090904baba4104ef0832bc236df)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models)\n\n\n+ **多模态大型语言模型：综述**（2023年11月22日）\u003Cdetails>\u003Csummary>[IEEE BigData 2023] 吴嘉阳、甘文胜、陈泽峰等\u003C\u002Fsummary>吴嘉阳、甘文胜、陈泽峰、万士成、菲利普·S·余\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13165)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-14-blue.svg?paper=52941cadbd340344f3e0a6f50719fe55b3de5088)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F52941cadbd340344f3e0a6f50719fe55b3de5088)\n\n\n+ **大型语言模型综述**（2023年3月31日）\u003Cdetails>\u003Csummary>韦恩·辛·赵、周坤、李俊义等\u003C\u002Fsummary>韦恩·辛·赵、周坤、李俊义、唐天义、王小磊、侯玉鹏、闵英倩、张贝辰、张俊杰、董子灿、杜一凡、杨晨、陈宇硕、陈志鹏、江金浩、任瑞阳、李一凡、唐新宇、刘子康、刘培宇、聂建云、温继荣\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.18223)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-875-blue.svg?paper=c61d54644e9aedcfc756e5d6fe4cc8b78c87755d)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fc61d54644e9airedcfc756e5d6fe4cc8b78c87755d)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FLLMSurvey.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FLLMSurvey?tab=readme-ov-file#timeline-of-llms)\n\n## 视觉\n\n\n\n+ **视觉领域的自回归模型：综述**（2024年11月8日）\u003Cdetails>\u003Csummary>熊静、刘功业、黄伦等\u003C\u002Fsummary>熊静、刘功业、黄伦、吴成悦、吴泰强、穆瑶、姚远、沈辉、万中伟、黄金发、陶超凡、严申、姚华秀、孔令鹏、杨红霞、张密、吉列尔莫·萨皮罗、罗杰波、罗平、王义）\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.05902)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChaofanTao\u002FAutoregressive-Models-in-Vision-Survey)](https:\u002F\u002Fgithub.com\u002FChaofanTao\u002FAutoregressive-Models-in-Vision-Survey)\n\n+ **用于视觉计算的扩散模型最新进展**（2023年10月11日）\u003Cdetails>\u003Csummary>Ryan Po、王一帆、弗拉季斯拉夫·戈利亚尼克等\u003C\u002Fsummary>Ryan Po、王一帆、弗拉季斯拉夫·戈利亚尼克、克菲尔·阿伯曼、乔纳森·T·巴伦、阿米特·H·贝尔马诺、埃里克·瑞安·陈、塔莉·德凯尔、亚历山大·霍林斯基、安朱·卡纳扎瓦、C·卡伦·刘、刘凌杰、本·米尔登霍尔、马蒂亚斯·尼瑟纳、比约恩·奥默、克里斯蒂安·西奥巴尔特、彼得·翁卡、戈登·韦茨施泰因\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07204)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-27-blue.svg?paper=6487ec82f6d8082a5b402a5416ea03009acb1679)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002F6487ec82f6d8082a5b402a5416ea03009acb1679)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCroitoruAlin\u002FDiffusion-Models-in-Vision-A-Survey.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FCroitoruAlin\u002FDiffusion-Models-in-Vision-A-Survey)\n\n+ **视觉领域的扩散模型：综述**（2022年9月10日）\u003Cdetails>\u003Csummary>[TPAMI 2023] 弗洛里内尔-阿林·克罗伊托鲁、弗拉德·洪德鲁、拉杜·图多尔·伊奥内斯库等\u003C\u002Fsummary>弗洛里内尔-阿林·克罗伊托鲁、弗拉德·洪德鲁、拉杜·图多尔·伊奥内斯库、穆巴拉克·沙赫\u003C\u002Fdetails>[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.04747)\n[![引用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcitation-371-blue.svg?paper=efa1647594b236361610a20d507127f0586a379b)](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002Fefa1647594b236361610a20d507127f0586a379b)\n[![代码](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCroitoruAlin\u002FDiffusion-Models-in-Vision-A-Survey.svg?style=social&label=Star)](https:\u002F\u002Fgithub.com\u002FCroitoruAlin\u002FDiffusion-Models-in-Vision-A-Survey)\n\n\n\n\n# 👨‍💻 团队\n\n以下是本仓库各模态贡献者的名单。\n\n| 模态\u002F任务                      |  贡献者                                                 |\n| ----------------------------- | -------------------------------------------------------------------- |\n| 图像生成 | 陈景业、迟晓伟、何英青                                       |\n| 视频生成 | 何英青、迟晓伟、陈景业           |\n| 图像与视频编辑           | 邢雅周 |\n| 3D生成与编辑            | 刘宏宇 |\n| 音频生成与编辑         | 田泽越、袁睿彬                            |\n| LLM智能体           | 刘兆阳                         |\n| 安全           | 刘润涛                         |\n| 负责人               | 何英青、刘兆阳                                                   |\n\n# 😉 引用\n如果您在研究中使用了本工作，请按以下格式引用论文：\n```bib\n@article{he2024llms,\n    title={LLMs Meet Multimodal Generation and Editing: A Survey},\n    author={He, Yingqing and Liu, Zhaoyang and Chen, Jingye and Tian, Zeyue and Liu, Hongyu and Chi, Xiaowei and Liu, Runtao and Yuan, Ruibin and Xing, Yazhou and Wang, Wenhai and Dai, Jifeng and Zhang, Yong and Xue, Wei and Liu, Qifeng and Guo, Yike and Chen, Qifeng},\n    journal={arXiv preprint arXiv:2405.19334},\n    year={2024},\n}\n```\n\n# ⭐️ 星标历史\n\n[![星标历史图表](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FYingqingHe_Awesome-LLMs-meet-Multimodal-Generation_readme_35f851bb94d9.png)](https:\u002F\u002Fstar-history.com\u002F#YingqingHe\u002FAwesome-LLMs-meet-Multimodal-Generation&Date)","# Awesome-LLMs-meet-Multimodal-Generation 快速上手指南\n\n本仓库并非单一的可安装软件包，而是一个**精选的学术论文与开源项目列表**，涵盖了大语言模型（LLM）与多模态生成（图像、视频、3D、音频）及编辑领域的前沿研究。本指南将帮助您快速浏览资源、检索目标论文并运行相关代码。\n\n## 环境准备\n\n由于本仓库包含多个独立的科研项目，每个项目的环境要求各不相同。在开始之前，请确保您的开发环境满足以下通用基础要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04\u002F22.04) 或 macOS。\n*   **Python**: 建议版本 3.8 或更高（具体取决于所选子项目）。\n*   **GPU**: 大多数生成式模型需要 NVIDIA GPU 支持，建议显存 16GB 以上以运行较新的视频或高分辨率图像模型。\n*   **基础依赖**:\n    *   Git (用于克隆仓库)\n    *   CUDA Toolkit (需与 PyTorch 版本匹配)\n    *   PyTorch \u002F TensorFlow (根据具体项目选择)\n\n## 安装步骤\n\n本仓库本身无需通过 `pip` 安装，只需克隆到本地即可作为资源索引使用。若您想运行列表中某个具体的模型（例如 `Show-o` 或 `ThinkDiff`），需进入该项目对应的子链接进行独立安装。\n\n### 1. 克隆本资源仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyour-target-repo\u002FAwesome-LLMs-meet-Multimodal-Generation.git\ncd Awesome-LLMs-meet-Multimodal-Generation\n```\n*(注：请将上述 URL 替换为实际的仓库地址，原文未提供具体 GitHub 链接，通常为类似结构)*\n\n### 2. 安装具体项目示例\n假设您选择运行列表中的 **Show-o** 项目（统一多模态理解与生成），请参考其官方代码库进行安装：\n\n```bash\n# 进入具体项目目录（示例）\ngit clone https:\u002F\u002Fgithub.com\u002Fshowlab\u002FShow-o.git\ncd Show-o\n\n# 创建虚拟环境\nconda create -n showo python=3.10\nconda activate showo\n\n# 安装依赖 (具体命令请以该项目 README 为准)\npip install -r requirements.txt\n```\n\n> **💡 提示**：列表中每个项目都有独立的 `[Code]` 链接，请点击对应项目的 GitHub 页面获取最准确的安装指令。\n\n## 基本使用\n\n本仓库的主要用途是**文献检索**与**项目发现**。以下是三种高效使用本指南的方法：\n\n### 1. 通过目录浏览研究领域\n直接点击仓库 README 中的目录链接，快速定位到您感兴趣的任务类别：\n*   **多模态生成 (Multimodal Generation)**: 包含图像、视频、3D、音频生成。\n    *   细分领域：`LLM-based` (基于大模型) 或 `Non-LLM-based` (基于 Clip\u002FT5 等)。\n*   **多模态编辑 (Multimodal Editing)**: 针对现有内容的修改。\n*   **多模态智能体 (Multimodal Agents)**: 代理任务。\n*   **安全性 (Safety)**: 攻击、防御与对齐。\n\n### 2. 通过作者姓名检索论文\n如果您想查找特定学者（例如 \"Yann LeCun\" 或 \"Song Han\"）的相关工作：\n1.  在浏览器或 PDF 阅读器中按下 `Ctrl + F` (Windows\u002FLinux) 或 `Cmd + F` (macOS)。\n2.  输入作者姓名。\n3.  列表会自动高亮并展开包含该作者的所有论文条目。\n\n### 3. 通过标签筛选特定技术\n利用仓库支持的标签功能快速过滤技术点。在搜索框中输入以下标签：\n*   `tokenizer`: 查找关于神经分词器（如 Cosmos Tokenizer, ElasticTok）的研究。\n*   `customization`: 查找个性化生成相关论文。\n*   `iteractive`: 查找交互式生成内容。\n*   `human motion generation`: 查找人体动作生成相关研究。\n\n### 4. 运行示例代码\n一旦找到感兴趣的项目（例如 **MetaMorph** 或 **VILA-U**）：\n1.  点击条目下的 **[Code]** 徽章跳转至 GitHub。\n2.  按照该项目的 `README.md` 下载预训练权重。\n3.  运行推理脚本，通常格式如下（以伪代码为例）：\n    ```bash\n    python inference.py --prompt \"A cat playing guitar\" --model_path .\u002Fcheckpoints\u002Fmodel.pth\n    ```\n\n---\n*本指南仅作为资源导航，具体模型的参数调整、训练细节及许可证限制请务必查阅各子项目的原始文档。*","某游戏工作室的技术美术团队正致力于开发一款基于玩家语音指令实时生成 3D 角色与背景音效的原型系统，急需整合最新的多模态生成技术。\n\n### 没有 Awesome-LLMs-meet-Multimodal-Generation 时\n- **文献检索如大海捞针**：团队需要在 arXiv、GitHub 和各类会议论文集中手动筛选，难以区分哪些是真正基于 LLM 驱动的 3D 或音频生成方案，哪些仍是传统的 CLIP\u002FT5 架构。\n- **技术选型盲目试错**：由于缺乏对\"LLM-based\"与\"Non-LLM-based\"方法的清晰分类，开发人员容易误选不适合语音交互场景的模型，导致原型开发周期延长数周。\n- **多模态协同困难**：在寻找能同时处理“语音输入 +3D 输出 + 音频反馈”的联合生成论文时，往往只能找到单一模态的研究，缺乏系统性的跨模态代理（Multimodal Agents）参考。\n- **前沿动态滞后**：无法快速定位如\"I Think, Therefore I Diffuse\"这类最新发表的关于扩散模型上下文推理的关键论文，错失提升生成逻辑性的机会。\n\n### 使用 Awesome-LLMs-meet-Multimodal-Generation 后\n- **精准定位技术路线**：利用仓库清晰的目录结构，团队直接锁定\"3D Generation\"和\"Audio Generation\"下的\"LLM-based\"板块，瞬间过滤掉非目标架构的旧方案。\n- **高效匹配具体需求**：通过 `ctrl+F` 搜索作者或利用 `customization`、`interactive` 等标签，迅速找到了支持语音交互且具备编辑能力的 3D 生成论文，大幅缩短调研时间。\n- **构建完整生成链路**：参考\"Generation with Multiple Modalities\"章节，团队成功组合了最新的 LLM 驱动方案，实现了从语音指令到 3D 资产及配套音效的一体化生成流程。\n- **紧跟学术最前沿**：仓库每日更新的列表让团队第一时间掌握了 2025 年最新的扩散模型推理机制，将其应用于优化角色生成的逻辑一致性上。\n\nAwesome-LLMs-meet-Multimodal-Generation 将原本需要数周的碎片化文献调研压缩至数小时，为多模态创意应用提供了最权威的技术导航图。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FYingqingHe_Awesome-LLMs-meet-Multimodal-Generation_73831b14.jpg","YingqingHe","Joy","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FYingqingHe_c4de4112.jpg","Ph.D. student @ HKUST; \r\nContact: yhebm@connect.ust.hk","HKUST",null,"https:\u002F\u002Fgithub.com\u002FYingqingHe",[81],{"name":82,"color":83,"percentage":84},"HTML","#e34c26",100,546,30,"2026-04-16T07:04:55",1,"","未说明",{"notes":92,"python":90,"dependencies":93},"该仓库是一个综述（Survey）列表，汇集了关于“大语言模型与多模态生成\u002F编辑”相关的论文、代码库和项目页面链接，本身不是一个可直接运行的单一软件工具。因此，README 中未包含具体的操作系统、GPU、内存、Python 版本或依赖库的安装要求。用户若需运行列表中提到的具体项目（如 ThinkDiff, Show-o, VILA-U 等），需分别前往对应的子项目仓库查看其独立的环境配置说明。",[],[15,14,95,96,35,97],"视频","音频","其他",[99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115],"aigc","large-language-models","large-vision-language-models","multimodal-generation","multimodal-large-language-models","multimodal-models","multimodality","text-to-3d","text-to-audio","text-to-image","text-to-music","text-to-sound","text-to-speech","text-to-video","llm","lvlm","mllm","2026-03-27T02:49:30.150509","2026-04-17T10:19:24.944227",[119,124,129,134,139,144],{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},37356,"如何向该仓库推荐或添加新的学术论文？","您可以直接在 GitHub 上提交 Issue，提供论文的标题、会议\u002F期刊来源（如 CVPR, ICLR, ICML 等）、Arxiv 链接、代码仓库链接以及项目主页。维护者审核后会将其添加到仓库和综述论文中。此外，也欢迎直接提交 Pull Request 来补充相关工作。","https:\u002F\u002Fgithub.com\u002FYingqingHe\u002FAwesome-LLMs-meet-Multimodal-Generation\u002Fissues\u002F5",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},37357,"该列表是否收录实际的应用程序或非学术类项目？","目前该列表主要专注于收录学术论文。对于纯粹的实际应用程序（如旅行故事生成应用等），由于没有合适的分类，暂时难以收录。建议此类项目关注其他专门针对应用演示的列表。","https:\u002F\u002Fgithub.com\u002FYingqingHe\u002FAwesome-LLMs-meet-Multimodal-Generation\u002Fissues\u002F16",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},37358,"如果发现综述论文或图表中存在错误（如时间线标注错误），该如何反馈？","欢迎通过提交 Issue 详细指出错误所在（例如具体的图号、错误内容及正确信息）。维护者会感谢您的反馈，并在下一版本的 Arxiv 论文或仓库更新中修复该错误。","https:\u002F\u002Fgithub.com\u002FYingqingHe\u002FAwesome-LLMs-meet-Multimodal-Generation\u002Fissues\u002F15",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},37359,"提交的工作需要满足什么条件才会被收录？","提交的工作应是与“基于大语言模型的多模态生成与编辑”相关的优秀研究。通常包括顶级会议（如 CVPR, ICLR, ICML 等）的论文或高质量的 Arxiv 预印本。提交时请确保提供完整的 Arxiv 链接和代码仓库链接以便核实。","https:\u002F\u002Fgithub.com\u002FYingqingHe\u002FAwesome-LLMs-meet-Multimodal-Generation\u002Fissues\u002F14",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},37360,"除了提交 Issue，还有其他方式贡献内容吗？","是的，除了提交 Issue 外，维护者非常鼓励用户直接提交 Pull Request (PR)。如果您发现有任何遗漏的相关工作，可以直接通过 PR 将内容添加到仓库中，这样效率更高。","https:\u002F\u002Fgithub.com\u002FYingqingHe\u002FAwesome-LLMs-meet-Multimodal-Generation\u002Fissues\u002F10",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},37361,"关于长视频生成或多模态编辑的最新论文会被收录吗？","会的。只要论文主题符合“多模态大模型与生成\u002F编辑”这一核心范畴（例如长视频生成、语言驱动的视频修复、复杂指令图像编辑等），维护者都会在收到推荐后尽快将其收录到仓库和综述中。","https:\u002F\u002Fgithub.com\u002FYingqingHe\u002FAwesome-LLMs-meet-Multimodal-Generation\u002Fissues\u002F12",[]]