[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-mini-sora--minisora":3,"tool-mini-sora--minisora":62},[4,18,26,35,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,2,"2026-04-18T11:18:24",[14,15,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":32,"last_commit_at":41,"category_tags":42,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[43,13,15,14],"插件",{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[52,15,13,14],"语言模型",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,61],"视频",{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":68,"readme_en":69,"readme_zh":70,"quickstart_zh":71,"use_case_zh":72,"hero_image_url":73,"owner_login":74,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":76,"owner_url":77,"languages":78,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":98,"env_os":99,"env_gpu":100,"env_ram":99,"env_deps":101,"category_tags":108,"github_topics":109,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":113,"updated_at":114,"faqs":115,"releases":151},9247,"mini-sora\u002Fminisora","minisora","MiniSora: A community aims to explore the implementation path and future development direction of Sora.","MiniSora 是一个由社区自发组织的开源项目，旨在探索 OpenAI Sora 视频生成模型的实现路径与未来发展方向。面对 Sora 技术细节尚未完全公开的现状，MiniSora 致力于通过复现相关核心论文（如 DiT）和梳理技术演进路线（从 DDPM 到 Sora），降低视频生成领域的研究门槛。\n\n该项目主要解决了普通开发者和研究人员难以直接获取或复现顶尖视频生成模型的痛点。它设定了明确的复现目标：追求\"GPU 友好”，力求在消费级显卡（如 RTX4090）或少量专业卡上即可进行训练与推理；同时注重效率，希望在较短的训练时间内生成 3-10 秒、480p 分辨率的视频。此外，社区还定期举办圆桌会议，深入解读 Stable Diffusion 3、Movie Gen 等前沿技术论文。\n\nMiniSora 非常适合对 AI 视频生成感兴趣的开发者、研究人员以及希望深入了解扩散模型技术的学生使用。其独特的技术亮点在于结合 XTuner 框架高效复现 DiT 架构，并获得了核心开发者的技术支持与算力资源协助。加入 MiniSora，你不仅能获取系统的技术综述，还能与志同道合的伙伴共同推动","MiniSora 是一个由社区自发组织的开源项目，旨在探索 OpenAI Sora 视频生成模型的实现路径与未来发展方向。面对 Sora 技术细节尚未完全公开的现状，MiniSora 致力于通过复现相关核心论文（如 DiT）和梳理技术演进路线（从 DDPM 到 Sora），降低视频生成领域的研究门槛。\n\n该项目主要解决了普通开发者和研究人员难以直接获取或复现顶尖视频生成模型的痛点。它设定了明确的复现目标：追求\"GPU 友好”，力求在消费级显卡（如 RTX4090）或少量专业卡上即可进行训练与推理；同时注重效率，希望在较短的训练时间内生成 3-10 秒、480p 分辨率的视频。此外，社区还定期举办圆桌会议，深入解读 Stable Diffusion 3、Movie Gen 等前沿技术论文。\n\nMiniSora 非常适合对 AI 视频生成感兴趣的开发者、研究人员以及希望深入了解扩散模型技术的学生使用。其独特的技术亮点在于结合 XTuner 框架高效复现 DiT 架构，并获得了核心开发者的技术支持与算力资源协助。加入 MiniSora，你不仅能获取系统的技术综述，还能与志同道合的伙伴共同推动开源视频生成技术的发展。","# MiniSora Community\n\n\u003C!-- PROJECT SHIELDS -->\n\n[![Contributors][contributors-shield]][contributors-url]\n[![Forks][forks-shield]][forks-url]\n[![Issues][issues-shield]][issues-url]\n[![MIT License][license-shield]][license-url]\n[![Stargazers][stars-shield]][stars-url]\n\u003Cbr \u002F>\n\n\u003Cdiv align=\"center\">\n\u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F8252\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_1145cd82417e.png\" alt=\"mini-sora%2Fminisora | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003C!-- PROJECT LOGO -->\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_4c0382e96db7.jpg\" width=\"600\"\u002F>\n\n  \u003Cdiv>&nbsp;\u003C\u002Fdiv>\n  \u003Cdiv align=\"center\">\n  \u003C\u002Fdiv>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\nEnglish | [简体中文](README_zh-CN.md)\n\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n    👋 join us on  \u003Ca href=\"https:\u002F\u002Fcdn.vansin.top\u002Fminisora.jpg\" target=\"_blank\">WeChat\u003C\u002Fa>\n\u003C\u002Fp>\n\nThe MiniSora open-source community is positioned as a community-driven initiative organized spontaneously by community members. The MiniSora community aims to explore the implementation path and future development direction of Sora.\n\n- Regular round-table discussions will be held with the Sora team and the community to explore possibilities.\n- We will delve into existing technological pathways for video generation.\n- Leading the replication of papers or research results related to Sora, such as DiT ([MiniSora-DiT](https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora-DiT)), etc.\n- Conducting a comprehensive review of Sora-related technologies and their implementations, i.e., \"**From DDPM to Sora: A Review of Video Generation Models Based on Diffusion Models**\".\n\n## Hot News\n\n- [OpenAI Sora](https:\u002F\u002Fopenai.com\u002Findex\u002Fsora-system-card\u002F) is coming out!\n- [**Movie Gen**: A Cast of Media Foundation Models](https:\u002F\u002Fai.meta.com\u002Fstatic-resource\u002Fmovie-gen-research-paper)\n- [**Stable Diffusion 3**: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https:\u002F\u002Fstability.ai\u002Fnews\u002Fstable-diffusion-3-research-paper)\n- [**MiniSora-DiT**](..\u002Fminisora-DiT\u002FREADME.md): Reproducing the DiT Paper with XTuner\n- [**Introduction of MiniSora and Latest Progress in Replicating Sora**](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_7bbc71f17e0e.png)\n\n![[empty](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_7bbc71f17e0e.png)](.\u002Fdocs\u002FMinisora_LPRS\u002F0001.jpg)\n\n## [Reproduction Group of MiniSora Community](.\u002Fcodes\u002FREADME.md)\n\n### Sora Reproduction Goals of MiniSora\n\n1. **GPU-Friendly**: Ideally, it should have low requirements for GPU memory size and the number of GPUs, such as being trainable and inferable with compute power like 8 A100 80G cards, 8 A6000 48G cards, or RTX4090 24G.\n2. **Training-Efficiency**: It should achieve good results without requiring extensive training time.\n3. **Inference-Efficiency**: When generating videos during inference, there is no need for high length or resolution; acceptable parameters include 3-10 seconds in length and 480p resolution.\n\n### [MiniSora-DiT](https:\u002F\u002Fgithub.com\u002Fmini-sora\u002FMiniSora-DiT): Reproducing the DiT Paper with XTuner\n\n[https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora-DiT](https:\u002F\u002Fgithub.com\u002Fmini-sora\u002FMiniSora-DiT)\n\n#### Requirements\n\nWe are recruiting MiniSora Community contributors to reproduce `DiT` using [XTuner](https:\u002F\u002Fgithub.com\u002FinternLM\u002Fxtuner).\n\nWe hope the community member has the following characteristics:\n\n1. Familiarity with the `OpenMMLab MMEngine` mechanism.\n2. Familiarity with `DiT`.\n\n#### Background\n\n1. The author of `DiT` is the same as the author of `Sora`.\n2. [XTuner](https:\u002F\u002Fgithub.com\u002FinternLM\u002Fxtuner) has the core technology to efficiently train sequences of length `1000K`.\n\n#### Support\n\n1. Computational resources: 2*A100.\n2. Strong supports from [XTuner](https:\u002F\u002Fgithub.com\u002FinternLM\u002Fxtuner) core developer [P佬@pppppM](https:\u002F\u002Fgithub.com\u002FpppppM).\n\n## Recent round-table Discussions\n\n### Paper Interpretation of Stable Diffusion 3 paper: MM-DiT\n\n**Speaker**: MMagic Core Contributors\n\n**Live Streaming Time**: 03\u002F12 20:00\n\n**Highlights**: MMagic core contributors will lead us in interpreting the Stable Diffusion 3 paper, discussing the architecture details and design principles of Stable Diffusion 3. \n\n**PPT**: [FeiShu Link](https:\u002F\u002Faicarrier.feishu.cn\u002Ffile\u002FNXnTbo5eqo8xNYxeHnecjLdJnQq)\n\n\u003C!-- Please scan the QR code with WeChat to book a live video session.\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_bca6e068793f.png\" width=\"100\"\u002F>\n\n  \u003Cdiv>&nbsp;\u003C\u002Fdiv>\n  \u003Cdiv align=\"center\">\n  \u003C\u002Fdiv>\n\u003C\u002Fdiv> -->\n\n### Highlights from Previous Discussions\n\n#### [**Night Talk with Sora: Video Diffusion Overview**](https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fblob\u002Fmain\u002Fnotes\u002FREADME.md)\n\n**ZhiHu Notes**: [A Survey on Generative Diffusion Model: An Overview of Generative Diffusion Models](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F684795460)\n\n## [Paper Reading Program](.\u002Fnotes\u002FREADME.md)\n\n- [**Sora**: Creating video from text](https:\u002F\u002Fopenai.com\u002Fsora)\n- **Technical Report**: [Video generation models as world simulators](https:\u002F\u002Fopenai.com\u002Fresearch\u002Fvideo-generation-models-as-world-simulators)\n- **Latte**: [Latte: Latent Diffusion Transformer for Video Generation](https:\u002F\u002Fmaxin-cn.github.io\u002Flatte_project\u002F)\n  - [Latte Paper Interpretation (zh-CN)](.\u002Fnotes\u002FLatte.md), [ZhiHu(zh-CN)](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F686407292)\n- **DiT**: [Scalable Diffusion Models with Transformers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09748)\n- **Stable Cascade (ICLR 24 Paper)**: [Würstchen: An efficient architecture for large-scale text-to-image diffusion models](https:\u002F\u002Fopenreview.net\u002Fforum?id=gU58d5QeGv)\n- [**Stable Diffusion 3**: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https:\u002F\u002Fstability.ai\u002Fnews\u002Fstable-diffusion-3-research-paper)\n  - [SD3 Paper Interpretation (zh-CN)](.\u002Fnotes\u002FSD3_zh-CN.md), [ZhiHu(zh-CN)](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F686273242)\n\n- Updating...\n\n### Recruitment of Presenters\n\n- [**DiT** (ICCV 23 Paper)](https:\u002F\u002Fgithub.com\u002Forgs\u002Fmini-sora\u002Fdiscussions\u002F39)\n- [**Stable Cascade** (ICLR 24 Paper)](https:\u002F\u002Fgithub.com\u002Forgs\u002Fmini-sora\u002Fdiscussions\u002F145)\n\n## Related Work\n\n- 01 [Diffusion Model](#diffusion-models)\n- 02 [Diffusion Transformer](#diffusion-transformer)\n- 03 [Baseline Video Generation Models](#baseline-video-generation-models)\n- 04 [Diffusion UNet](#diffusion-unet)\n- 05 [Video Generation](#video-generation)\n- 06 [Dataset](#dataset)\n  - 6.1 [Pubclic Datasets](#dataset_paper)\n  - 6.2 [Video Augmentation Methods](#video_aug)\n    - 6.2.1 [Basic Transformations](#video_aug_basic)\n    - 6.2.2 [Feature Space](#video_aug_feature)\n    - 6.2.3 [GAN-based Augmentation](#video_aug_gan)\n    - 6.2.4 [Encoder\u002FDecoder Based](#video_aug_ed)\n    - 6.2.5 [Simulation](#video_aug_simulation)\n- 07 [Patchifying Methods](#patchifying-methods)\n- 08 [Long-context](#long-context)\n- 09 [Audio Related Resource](#audio-related-resource)\n- 10 [Consistency](#consistency)\n- 11 [Prompt Engineering](#prompt-engineering)\n- 12 [Security](#security)\n- 13 [World Model](#world-model)\n- 14 [Video Compression](#video-compression)\n- 15 [Mamba](#Mamba)\n    - 15.1 [Theoretical Foundations and Model Architecture](#theoretical-foundations-and-model-architecture)\n    - 15.2 [Image Generation and Visual Applications](#image-generation-and-visual-applications)\n    - 15.3 [Video Processing and Understanding](#video-processing-and-understanding)\n    - 15.4 [Medical Image Processing](#medical-image-processing)\n- 16 [Existing high-quality resources](#existing-high-quality-resources)\n- 17 [Efficient Training](#train)\n  - 17.1 [Parallelism based Approach](#train_paral)\n    - 17.1.1 [Data Parallelism (DP)](#train_paral_dp)\n    - 17.1.2 [Model Parallelism (MP)](#train_paral_mp)\n    - 17.1.3 [Pipeline Parallelism (PP)](#train_paral_pp)\n    - 17.1.4 [Generalized Parallelism (GP)](#train_paral_gp)\n    - 17.1.5 [ZeRO Parallelism (ZP)](#train_paral_zp)\n  - 17.2 [Non-parallelism based Approach](#train_non)\n    - 17.2.1 [Reducing Activation Memory](#train_non_reduce)\n    - 17.2.2 [CPU-Offloading](#train_non_cpu)\n    - 17.2.3 [Memory Efficient Optimizer](#train_non_mem)\n  - 17.3 [Novel Structure](#train_struct)\n- 18 [Efficient Inference](#infer)\n  - 18.1 [Reduce Sampling Steps](#infer_reduce)\n    - 18.1.1 [Continuous Steps](#infer_reduce_continuous)\n    - 18.1.2 [Fast Sampling](#infer_reduce_fast)\n    - 18.1.3 [Step distillation](#infer_reduce_dist)\n  - 18.2 [Optimizing Inference](#infer_opt)\n    - 18.2.1 [Low-bit Quantization](#infer_opt_low)\n    - 18.2.2 [Parallel\u002FSparse inference](#infer_opt_ps)\n\n| \u003Ch3 id=\"diffusion-models\">01 Diffusion Models\u003C\u002Fh3> |  |\n| :------------- | :------------- |\n| **Paper** | **Link** |\n| 1) **Guided-Diffusion**: Diffusion Models Beat GANs on Image Synthesis | [**NeurIPS 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.05233), [GitHub](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fguided-diffusion)|\n| 2) **Latent Diffusion**: High-Resolution Image Synthesis with Latent Diffusion Models | [**CVPR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.10752), [GitHub](https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion) |\n| 3) **EDM**: Elucidating the Design Space of Diffusion-Based Generative Models | [**NeurIPS 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.00364), [GitHub](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fedm) |\n| 4) **DDPM**: Denoising Diffusion Probabilistic Models | [**NeurIPS 20 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.11239), [GitHub](https:\u002F\u002Fgithub.com\u002Fhojonathanho\u002Fdiffusion) |\n| 5) **DDIM**: Denoising Diffusion Implicit Models | [**ICLR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.02502), [GitHub](https:\u002F\u002Fgithub.com\u002Fermongroup\u002Fddim) |\n| 6) **Score-Based Diffusion**: Score-Based Generative Modeling through Stochastic Differential Equations | [**ICLR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.13456), [GitHub](https:\u002F\u002Fgithub.com\u002Fyang-song\u002Fscore_sde), [Blog](https:\u002F\u002Fyang-song.net\u002Fblog\u002F2021\u002Fscore) |\n| 7) **Stable Cascade**: Würstchen: An efficient architecture for large-scale text-to-image diffusion models | [**ICLR 24 Paper**](https:\u002F\u002Fopenreview.net\u002Fforum?id=gU58d5QeGv), [GitHub](https:\u002F\u002Fgithub.com\u002FStability-AI\u002FStableCascade), [Blog](https:\u002F\u002Fstability.ai\u002Fnews\u002Fintroducing-stable-cascade) |\n| 8) Diffusion Models in Vision: A Survey| [**TPAMI 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.13456), [GitHub](https:\u002F\u002Fgithub.com\u002FCroitoruAlin\u002FDiffusion-Models-in-Vision-A-Survey)|\n| 9) **Improved DDPM**: Improved Denoising Diffusion Probabilistic Models | [**ICML 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.09672), [Github](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fimproved-diffusion) |\n| 10) Classifier-free diffusion guidance | [**NIPS 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.12598) |\n| 11) **Glide**: Towards photorealistic image generation and editing with text-guided diffusion models | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.10741), [Github](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fglide-text2im) |\n| 12) **VQ-DDM**: Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation | [**CVPR 22 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2022\u002Fpapers\u002FHu_Global_Context_With_Discrete_Diffusion_in_Vector_Quantised_Modelling_for_CVPR_2022_paper.pdf), [Github](https:\u002F\u002Fgithub.com\u002Fanonymrelease\u002FVQ-DDM) |\n| 13) Diffusion Models for Medical Anomaly Detection | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.04306), [Github](https:\u002F\u002Fgithub.com\u002FJuliaWolleb\u002Fdiffusion-anomaly) |\n| 14) Generation of Anonymous Chest Radiographs Using Latent Diffusion Models for Training Thoracic Abnormality Classification Systems | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.01323) |\n| 15) **DiffusionDet**: Diffusion Model for Object Detection | [**ICCV 23 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fpapers\u002FChen_DiffusionDet_Diffusion_Model_for_Object_Detection_ICCV_2023_paper.pdf), [Github](https:\u002F\u002Fgithub.com\u002FShoufaChen\u002FDiffusionDet) |\n| 16) Label-efficient semantic segmentation with diffusion models | [**ICLR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.03126), [Github](https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Fddpm-segmentation), [Project](https:\u002F\u002Fyandex-research.github.io\u002Fddpm-segmentation\u002F) |\n| \u003Ch3 id=\"diffusion-transformer\">02 Diffusion Transformer\u003C\u002Fh3> | |\n| **Paper** | **Link** |\n| 1) **UViT**: All are Worth Words: A ViT Backbone for Diffusion Models | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.12152), [GitHub](https:\u002F\u002Fgithub.com\u002Fbaofff\u002FU-ViT), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels?name=UVit&page=1) |\n| 2) **DiT**: Scalable Diffusion Models with Transformers | [**ICCV 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09748), [GitHub](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDiT), [Project](https:\u002F\u002Fwww.wpeebles.com\u002FDiT), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels?name=Dit&page=1)|\n| 3) **SiT**: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.08740), [GitHub](https:\u002F\u002Fgithub.com\u002Fwillisma\u002FSiT), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FAI-ModelScope\u002FSiT-XL-2-256\u002Fsummary) |\n| 4) **FiT**: Flexible Vision Transformer for Diffusion Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12376), [GitHub](https:\u002F\u002Fgithub.com\u002Fwhlzy\u002FFiT) |\n| 5) **k-diffusion**: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.11605v1.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fcrowsonkb\u002Fk-diffusion) |\n| 6) **Large-DiT**: Large Diffusion Transformer | [GitHub](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLLaMA2-Accessory\u002Ftree\u002Fmain\u002FLarge-DiT) |\n| 7) **VisionLLaMA**: A Unified LLaMA Interface for Vision Tasks | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.00522), [GitHub](https:\u002F\u002Fgithub.com\u002FMeituan-AutoML\u002FVisionLLaMA) |\n| 8) **Stable Diffusion 3**: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis | [**Paper**](https:\u002F\u002Fstabilityai-public-packages.s3.us-west-2.amazonaws.com\u002FStable+Diffusion+3+Paper.pdf), [Blog](https:\u002F\u002Fstability.ai\u002Fnews\u002Fstable-diffusion-3-research-paper) |\n| 9) **PIXART-Σ**: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.04692.pdf), [Project](https:\u002F\u002Fpixart-alpha.github.io\u002FPixArt-sigma-project\u002F) |\n| 10) **PIXART-α**: Fast Training of Diffusion Transformer for Photorealistic Text-To-Image Synthesis | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00426.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FPixArt-alpha\u002FPixArt-alpha) [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Faojie1997\u002Fcv_PixArt-alpha_text-to-image\u002Fsummary)|\n| 11) **PIXART-δ**: Fast and Controllable Image Generation With Latent Consistency Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.05252.pdf), |\n| 12) **Lumina-T2X**: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05945), [GitHub](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X) |\n| 13) **DDM**: Deconstructing Denoising Diffusion Models for Self-Supervised Learning | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.14404v1)|\n| 14) Autoregressive Image Generation without Vector Quantization | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.11838), [GitHub](https:\u002F\u002Fgithub.com\u002FLTH14\u002Fmar) |\n| 15) **Transfusion**: Predict the Next Token and Diffuse Images with One Multi-Modal Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11039)|\n| 16) Scaling Diffusion Language Models via Adaptation from Autoregressive Models | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.17891)|\n| 17) Large Language Diffusion Models | [**ArXiv 25**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.09992)|\n| \u003Ch3 id=\"baseline-video-generation-models\">03 Baseline Video Generation Models\u003C\u002Fh3> | |\n| **Paper** | **Link** |\n| 1) **ViViT**: A Video Vision Transformer | [**ICCV 21 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.15691v2.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fscenic) |\n| 2) **VideoLDM**: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08818) |\n| 3) **DiT**: Scalable Diffusion Models with Transformers | [**ICCV 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09748), [Github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDiT), [Project](https:\u002F\u002Fwww.wpeebles.com\u002FDiT), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels?name=Dit&page=1) |\n| 4) **Text2Video-Zero**: Text-to-Image Diffusion Models are Zero-Shot Video Generators | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.13439), [GitHub](https:\u002F\u002Fgithub.com\u002FPicsart-AI-Research\u002FText2Video-Zero) |\n| 5) **Latte**: Latent Diffusion Transformer for Video Generation | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.03048v1.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FVchitect\u002FLatte), [Project](https:\u002F\u002Fmaxin-cn.github.io\u002Flatte_project\u002F), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FAI-ModelScope\u002FLatte\u002Fsummary)|\n| \u003Ch3 id=\"diffusion-unet\">04 Diffusion UNet\u003C\u002Fh3> |\n| **Paper** | **Link** |\n| 1) Taming Transformers for High-Resolution Image Synthesis | [**CVPR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2012.09841.pdf),[GitHub](https:\u002F\u002Fgithub.com\u002FCompVis\u002Ftaming-transformers) ,[Project](https:\u002F\u002Fcompvis.github.io\u002Ftaming-transformers\u002F)|\n| 2) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05135) [Github](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA) |\n| \u003Ch3 id=\"video-generation\">05 Video Generation\u003C\u002Fh3> | |\n| **Paper** | **Link** |\n| 1) **Animatediff**: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.04725), [GitHub](https:\u002F\u002Fgithub.com\u002Fguoyww\u002Fanimatediff\u002F), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels?name=Animatediff&page=1) |\n| 2) **I2VGen-XL**: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.04145), [GitHub](https:\u002F\u002Fgithub.com\u002Fali-vilab\u002Fi2vgen-xl),  [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fi2vgen-xl\u002Fsummary) |\n| 3) **Imagen Video**: High Definition Video Generation with Diffusion Models | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.02303) |\n| 4) **MoCoGAN**: Decomposing Motion and Content for Video Generation | [**CVPR 18 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.04993) |\n| 5) Adversarial Video Generation on Complex Datasets | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1907.06571) |\n| 6) **W.A.L.T**: Photorealistic Video Generation with Diffusion Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06662), [Project](https:\u002F\u002Fwalt-video-diffusion.github.io\u002F) |\n| 7) **VideoGPT**: Video Generation using VQ-VAE and Transformers | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.10157), [GitHub](https:\u002F\u002Fgithub.com\u002Fwilson1yan\u002FVideoGPT) |\n| 8) Video Diffusion Models | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.03458), [GitHub](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fvideo-diffusion-pytorch), [Project](https:\u002F\u002Fvideo-diffusion.github.io\u002F) |\n| 9) **MCVD**: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation | [**NeurIPS 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.09853), [GitHub](https:\u002F\u002Fgithub.com\u002Fvoletiv\u002Fmcvd-pytorch), [Project](https:\u002F\u002Fmask-cond-video-diffusion.github.io\u002F), [Blog](https:\u002F\u002Fajolicoeur.ca\u002F2022\u002F05\u002F22\u002Fmasked-conditional-video-diffusion\u002F) |\n| 10) **VideoPoet**: A Large Language Model for Zero-Shot Video Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14125), [Project](http:\u002F\u002Fsites.research.google\u002Fvideopoet\u002F), [Blog](https:\u002F\u002Fblog.research.google\u002F2023\u002F12\u002Fvideopoet-large-language-model-for-zero.html) |\n| 11) **MAGVIT**: Masked Generative Video Transformer | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.05199), [GitHub](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fmagvit), [Project](https:\u002F\u002Fmagvit.cs.cmu.edu\u002F), [Colab](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fmagvit\u002Fblob\u002Fmain) |\n| 12) **EMO**: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17485), [GitHub](https:\u002F\u002Fgithub.com\u002FHumanAIGC\u002FEMO), [Project](https:\u002F\u002Fhumanaigc.github.io\u002Femote-portrait-alive\u002F) |\n| 13) **SimDA**: Simple Diffusion Adapter for Efficient Video Generation | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.09710.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FChenHsing\u002FSimDA), [Project](https:\u002F\u002Fchenhsing.github.io\u002FSimDA\u002F) |\n| 14) **StableVideo**: Text-driven Consistency-aware Diffusion Video Editing | [**ICCV 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.09592), [GitHub](https:\u002F\u002Fgithub.com\u002Frese1f\u002FStableVideo), [Project](https:\u002F\u002Frese1f.github.io\u002FStableVideo\u002F) |\n| 15) **SVD**: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets| [**Paper**](https:\u002F\u002Fstatic1.squarespace.com\u002Fstatic\u002F6213c340453c3f502425776e\u002Ft\u002F655ce779b9d47d342a93c890\u002F1700587395994\u002Fstable_video_diffusion.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FStability-AI\u002Fgenerative-models)|\n| 16) **ADD**: Adversarial Diffusion Distillation| [**Paper**](https:\u002F\u002Fstatic1.squarespace.com\u002Fstatic\u002F6213c340453c3f502425776e\u002Ft\u002F65663480a92fba51d0e1023f\u002F1701197769659\u002Fadversarial_diffusion_distillation.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FStability-AI\u002Fgenerative-models) |\n| 17) **GenTron:** Diffusion Transformers for Image and Video Generation | [**CVPR 24 Paper**](http:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04557), [Project](https:\u002F\u002Fwww.shoufachen.com\u002Fgentron_website\u002F)|\n| 18) **LFDM**: Conditional Image-to-Video Generation with Latent Flow Diffusion Models | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.13744), [GitHub](https:\u002F\u002Fgithub.com\u002Fnihaomiao\u002FCVPR23_LFDM) |\n| 19) **MotionDirector**: Motion Customization of Text-to-Video Diffusion Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08465), [GitHub](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FMotionDirector) |\n| 20) **TGAN-ODE**: Latent Neural Differential Equations for Video Generation | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2011.03864v3.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FZasder3\u002FLatent-Neural-Differential-Equations-for-Video-Generation) |\n| 21) **VideoCrafter1**: Open Diffusion Models for High-Quality Video Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.19512), [GitHub](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FVideoCrafter) |\n| 22) **VideoCrafter2**: Overcoming Data Limitations for High-Quality Video Diffusion Models | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09047), [GitHub](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FVideoCrafter) |\n| 23) **LVDM**: Latent Video Diffusion Models for High-Fidelity Long Video Generation | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.13221), [GitHub](https:\u002F\u002Fgithub.com\u002FYingqingHe\u002FLVDM) |\n| 24) **LaVie**: High-Quality Video Generation with Cascaded Latent Diffusion Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15103), [GitHub](https:\u002F\u002Fgithub.com\u002FVchitect\u002FLaVie), [Project](https:\u002F\u002Fvchitect.github.io\u002FLaVie-project\u002F) |\n| 25) **PYoCo**: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models | [**ICCV 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10474), [Project](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fdir\u002Fpyoco\u002F)|\n| 26) **VideoFusion**: Decomposed Diffusion Models for High-Quality Video Generation | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.08320)|\n| 27) **Movie Gen**: A Cast of Media Foundation Models | [**Paper**](https:\u002F\u002Fai.meta.com\u002Fstatic-resource\u002Fmovie-gen-research-paper), [Project](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fmovie-gen\u002F)|\n| 28) Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model| [**ArXiv 25**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.10248), [Project](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep-Video-T2V)|\n| \u003Ch3 id=\"dataset\">06 Dataset\u003C\u002Fh3> | |\n| \u003Ch4 id=\"dataset_paper\">6.1 Public Datasets\u003C\u002Fh4> | |\n| **Dataset Name - Paper** | **Link** |\n| 1) **Panda-70M** - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers\u003Cbr>\u003Csmall>`70M Clips, 720P, Downloadable`\u003C\u002Fsmall>|[**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.19479), [Github](https:\u002F\u002Fgithub.com\u002Fsnap-research\u002FPanda-70M), [Project](https:\u002F\u002Fsnap-research.github.io\u002FPanda-70M\u002F), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002FAI-ModelScope\u002Fpanda-70m\u002Fsummary)|\n| 2) **InternVid-10M** - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation\u003Cbr>\u003Csmall>`10M Clips, 720P, Downloadable`\u003C\u002Fsmall>|[**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06942), [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVideo\u002Ftree\u002Fmain\u002FData\u002FInternVid)|\n| 3) **CelebV-Text** - CelebV-Text: A Large-Scale Facial Text-Video Dataset\u003Cbr>\u003Csmall>`70K Clips, 720P, Downloadable`\u003C\u002Fsmall>|[**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.14717), [Github](https:\u002F\u002Fgithub.com\u002Fcelebv-text\u002FCelebV-Text), [Project](https:\u002F\u002Fcelebv-text.github.io\u002F)|\n| 4) **HD-VG-130M** - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation\u003Cbr>\u003Csmall> `130M Clips, 720P, Downloadable`\u003C\u002Fsmall>|[**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10874), [Github](https:\u002F\u002Fgithub.com\u002Fdaooshee\u002FHD-VG-130M), [Tool](https:\u002F\u002Fgithub.com\u002FBreakthrough\u002FPySceneDetect)|\n| 5) **HD-VILA-100M** - Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions\u003Cbr>\u003Csmall> `100M Clips, 720P, Downloadable`\u003C\u002Fsmall>|[**CVPR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.10337), [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FXPretrain\u002Fblob\u002Fmain\u002Fhd-vila-100m\u002FREADME.md)|\n| 6) **VideoCC** - Learning Audio-Video Modalities from Image Captions\u003Cbr>\u003Csmall>`10.3M Clips, 720P, Downloadable`\u003C\u002Fsmall>|[**ECCV 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.00679), [Github](https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002FvideoCC-data)|\n| 7) **YT-Temporal-180M** - MERLOT: Multimodal Neural Script Knowledge Models\u003Cbr>\u003Csmall>`180M Clips, 480P, Downloadable`\u003C\u002Fsmall>| [**NeurIPS 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.02636), [Github](https:\u002F\u002Fgithub.com\u002Frowanz\u002Fmerlot), [Project](https:\u002F\u002Frowanzellers.com\u002Fmerlot\u002F#data)|\n| 8) **HowTo100M** - HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips\u003Cbr>\u003Csmall>`136M Clips, 240P, Downloadable`\u003C\u002Fsmall>| [**ICCV 19 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.03327), [Github](https:\u002F\u002Fgithub.com\u002Fantoine77340\u002Fhowto100m), [Project](https:\u002F\u002Fwww.di.ens.fr\u002Fwillow\u002Fresearch\u002Fhowto100m\u002F)|\n| 9) **UCF101** - UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild\u003Cbr>\u003Csmall>`13K Clips, 240P, Downloadable`\u003C\u002Fsmall>| [**CVPR 12 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1212.0402), [Project](https:\u002F\u002Fwww.crcv.ucf.edu\u002Fdata\u002FUCF101.php)|\n| 10) **MSVD** - Collecting Highly Parallel Data for Paraphrase Evaluation\u003Cbr>\u003Csmall>`122K Clips, 240P, Downloadable`\u003C\u002Fsmall> | [**ACL 11 Paper**](https:\u002F\u002Faclanthology.org\u002FP11-1020.pdf), [Project](https:\u002F\u002Fwww.cs.utexas.edu\u002Fusers\u002Fml\u002Fclamp\u002FvideoDescription\u002F)|\n| 11) **Fashion-Text2Video** - A human video dataset with rich label and text annotations\u003Cbr>\u003Csmall>`600 Videos, 480P, Downloadable`\u003C\u002Fsmall> | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.08483.pdf), [Project](https:\u002F\u002Fyumingj.github.io\u002Fprojects\u002FText2Performer.html) |\n| 12) **LAION-5B** - A dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M\u003Cbr>\u003Csmall>`5B Clips, Downloadable`\u003C\u002Fsmall> | [**NeurIPS 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.08402), [Project](https:\u002F\u002Flaion.ai\u002Fblog\u002Flaion-5b\u002F)|\n| 13) **ActivityNet Captions** -  ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time\u003Cbr>\u003Csmall>`20k videos, Downloadable`\u003C\u002Fsmall> | [**Arxiv 17 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.00754), [Project](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Franjaykrishna\u002Fdensevid\u002F)|\n| 14) **MSR-VTT** -  A large-scale video benchmark for video understanding\u003Cbr>\u003Csmall>`10k Clips, Downloadable`\u003C\u002Fsmall> | [**CVPR 16 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7780940), [Project](https:\u002F\u002Fcove.thecvf.com\u002Fdatasets\u002F839)|\n| 15) **The Cityscapes Dataset** -  Benchmark suite and evaluation server for pixel-level, instance-level, and panoptic semantic labeling\u003Cbr>\u003Csmall>`Downloadable`\u003C\u002Fsmall> | [**Arxiv 16 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1608.02192v1.pdf), [Project](https:\u002F\u002Fwww.cityscapes-dataset.com\u002F)|\n| 16) **Youku-mPLUG** -  First open-source large-scale Chinese video text dataset\u003Cbr>\u003Csmall>`Downloadable`\u003C\u002Fsmall> | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04362), [Project](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FYouku-mPLUG), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002Fmodelscope\u002FYouku-AliceMind\u002Fsummary) |\n| 17) **VidProM** - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models\u003Cbr>\u003Csmall>`6.69M, Downloadable`\u003C\u002Fsmall>| [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06098), [Github](https:\u002F\u002Fgithub.com\u002FWangWenhao0716\u002FVidProM) |\n| 18) **Pixabay100** - A video dataset collected from Pixabay\u003Cbr>\u003Csmall>`Downloadable`\u003C\u002Fsmall>| [Github](https:\u002F\u002Fgithub.com\u002FECNU-CILAB\u002FPixabay100\u002F) |\n| 19) **WebVid** -  Large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sites\u003Cbr>\u003Csmall>`Long Durations and Structured Captions`\u003C\u002Fsmall> | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.00650), [Project](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fresearch\u002Ffrozen-in-time\u002F) , [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002FAI-ModelScope\u002Fwebvid-10M\u002Fsummary)|\n| 20) **MiraData(Mini-Sora Data)**: A Large-Scale Video Dataset with Long Durations and Structured Captions\u003Cbr>\u003Csmall>`10M video-text pairs`\u003C\u002Fsmall> | [Github](https:\u002F\u002Fgithub.com\u002Fmira-space\u002FMiraData), [Project](https:\u002F\u002Fmira-space.github.io\u002F) |\n| 21) **IDForge**: A video dataset featuring scenes of people speaking.\u003Cbr>\u003Csmall>`300k Clips, Downloadable`\u003C\u002Fsmall> | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.11764), [Github](https:\u002F\u002Fgithub.com\u002Fxyyandxyy\u002FIDForge)  |\n| \u003Ch4 id=\"video_aug\">6.2 Video Augmentation Methods\u003C\u002Fh4> |  |\n| \u003Ch5 id=\"video_aug_basic\">6.2.1 Basic Transformations\u003C\u002Fh5> | |\n| Three-stream CNNs for action recognition | [**PRL 17 Paper**](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS0167865517301071) |\n| Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural Networks | [**EL 19 Paper**](http:\u002F\u002Fwww.engineeringletters.com\u002Fissues_v27\u002Fissue_3\u002FEL_27_3_12.pdf)|\n| Intra-clip Aggregation for Video Person Re-identification | [**ICIP 20 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1905.01722)|\n| VideoMix: Rethinking Data Augmentation for Video Classification | [**CVPR 20 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.03457) |\n| mixup: Beyond Empirical Risk Minimization | [**ICLR 17 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.09412) |\n| CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features | [**ICCV 19 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_ICCV_2019\u002Fhtml\u002FYun_CutMix_Regularization_Strategy_to_Train_Strong_Classifiers_With_Localizable_Features_ICCV_2019_paper.html) |\n| Video Salient Object Detection via Fully Convolutional Networks | [**ICIP 18 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8047320) |\n| Illumination-Based Data Augmentation for Robust Background Subtraction | [**SKIMA 19 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8982527) |\n| Image editing-based data augmentation for illumination-insensitive background subtraction | [**EIM 20 Paper**](https:\u002F\u002Fwww.emerald.com\u002Finsight\u002Fcontent\u002Fdoi\u002F10.1108\u002FJEIM-02-2020-0042\u002Ffull\u002Fhtml) |\n| \u003Ch5 id=\"video_aug_feature\">6.2.2 Feature Space\u003C\u002Fh5> | |\n| Feature Re-Learning with Data Augmentation for Content-based Video Recommendation | [**ACM 18 Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3240508.3266441) |\n| GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer | [**Trans 21 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9147027) |\n| \u003Ch5 id=\"video_aug_gan\">6.2.3 GAN-based Augmentation\u003C\u002Fh5> | |\n| Deep Video-Based Performance Cloning | [**CVPR 18 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1808.06847) |\n| Adversarial Action Data Augmentation for Similar Gesture Action Recognition | [**IJCNN 19 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8851993) |\n| Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples | [**MM 20 Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3394171.3414003) |\n| GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer | [**Trans 20 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9147027) |\n| Dynamic Facial Expression Generation on Hilbert Hypersphere With Conditional Wasserstein Generative Adversarial Nets | [**TPAMI 20 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9117185) |\n| CrowdGAN: Identity-Free Interactive Crowd Video Generation and Beyond | [**TPAMI 22 Paper**](https:\u002F\u002Fwww.computer.org\u002Fcsdl\u002Fjournal\u002Ftp\u002F5555\u002F01\u002F09286483\u002F1por0TYwZvG) |\n| \u003Ch5 id=\"video_aug_ed\">6.2.4 Encoder\u002FDecoder Based\u003C\u002Fh5> | |\n| Rotationally-Temporally Consistent Novel View Synthesis of Human Performance Video | [**ECCV 20 Paper**](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-030-58548-8_23) |\n| Autoencoder-based Data Augmentation for Deepfake Detection | [**ACM 23 Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3592572.3592840) |\n| \u003Ch5 id=\"video_aug_simulation\">6.2.5 Simulation\u003C\u002Fh5> | |\n| A data augmentation methodology for training machine\u002Fdeep learning gait recognition algorithms | [**CVPR 16 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.07570) |\n| ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications | [**IEEE 21 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9324837) |\n| Mid-Air: A Multi-Modal Dataset for Extremely Low Altitude Drone Flights | [**CVPR 19 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_CVPRW_2019\u002Fhtml\u002FUAVision\u002FFonder_Mid-Air_A_Multi-Modal_Dataset_for_Extremely_Low_Altitude_Drone_Flights_CVPRW_2019_paper.html) |\n| Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models | [**IJCV 19 Paper**](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs11263-019-01222-z) |\n| Using synthetic data for person tracking under adverse weather conditions | [**IVC 21 Paper**](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS0262885621000925) |\n| Unlimited Road-scene Synthetic Annotation (URSA) Dataset | [**ITSC 18 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8569519) |\n| SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video Data | [**CVPR 21 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2021\u002Fhtml\u002FHu_SAIL-VOS_3D_A_Synthetic_Dataset_and_Baselines_for_Object_Detection_CVPR_2021_paper.html) |\n| Universal Semantic Segmentation for Fisheye Urban Driving Images | [**SMC 20 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9283099) |\n| \u003Ch3 id=\"patchifying-methods\">07 Patchifying Methods\u003C\u002Fh3> | |\n| **Paper** | **Link** |\n| 1) **ViT**: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | [**CVPR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929), [Github](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fvision_transformer) |\n| 2) **MAE**: Masked Autoencoders Are Scalable Vision Learners| [**CVPR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.06377), [Github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmae) |\n| 3) **ViViT**: A Video Vision Transformer (-)| [**ICCV 21 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.15691v2.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fscenic) |\n| 4) **DiT**: Scalable Diffusion Models with Transformers (-) | [**ICCV 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09748), [GitHub](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDiT), [Project](https:\u002F\u002Fwww.wpeebles.com\u002FDiT), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels?name=Dit&page=1)|\n| 5) **U-ViT**: All are Worth Words: A ViT Backbone for Diffusion Models (-) | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.12152), [GitHub](https:\u002F\u002Fgithub.com\u002Fbaofff\u002FU-ViT), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels?name=UVit&page=1) |\n| 6) **FlexiViT**: One Model for All Patch Sizes | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08013.pdf), [Github](https:\u002F\u002Fgithub.com\u002Fbwconrad\u002Fflexivit.git) |\n| 7) **Patch n’ Pack**: NaViT, a Vision Transformer for any Aspect Ratio and Resolution | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06304), [Github](https:\u002F\u002Fgithub.com\u002Fkyegomez\u002FNaViT) |\n| 8) **VQ-VAE**: Neural Discrete Representation Learning | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.00937), [Github](https:\u002F\u002Fgithub.com\u002FMishaLaskin\u002Fvqvae) |\n| 9) **VQ-GAN**: Neural Discrete Representation Learning | [**CVPR 21 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2021\u002Fhtml\u002FEsser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.html), [Github](https:\u002F\u002Fgithub.com\u002FCompVis\u002Ftaming-transformers) |\n| 10) **LVT**: Latent Video Transformer | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.10704), [Github](https:\u002F\u002Fgithub.com\u002Frakhimovv\u002Flvt) |\n| 11) **VideoGPT**: Video Generation using VQ-VAE and Transformers (-) | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.10157), [GitHub](https:\u002F\u002Fgithub.com\u002Fwilson1yan\u002FVideoGPT) |\n| 12) Predicting Video with VQVAE | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.01950) |\n| 13) **CogVideo**: Large-scale Pretraining for Text-to-Video Generation via Transformers | [**ICLR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15868.pdf), [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVideo.git) |\n| 14) **TATS**: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | [**ECCV 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.03638), [Github](https:\u002F\u002Fbnucsy.github.io\u002FTATS\u002F) |\n| 15) **MAGVIT**: Masked Generative Video Transformer (-) | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.05199), [GitHub](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fmagvit), [Project](https:\u002F\u002Fmagvit.cs.cmu.edu\u002F), [Colab](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fmagvit\u002Fblob\u002Fmain) |\n| 16) **MagViT2**: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.05737.pdf), [Github](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fmagvit2-pytorch) |\n| 17) **VideoPoet**: A Large Language Model for Zero-Shot Video Generation (-) | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14125), [Project](http:\u002F\u002Fsites.research.google\u002Fvideopoet\u002F), [Blog](https:\u002F\u002Fblog.research.google\u002F2023\u002F12\u002Fvideopoet-large-language-model-for-zero.html) |\n| 18) **CLIP**: Learning Transferable Visual Models From Natural Language Supervision | [**CVPR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929), [Github](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP) |\n| 19) **BLIP**: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.12086), [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FBLIP) |\n| 20) **BLIP-2**: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.12597), [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fblip2) |\n| \u003Ch3 id=\"long-context\">08 Long-context\u003C\u002Fh3> | |\n| **Paper** | **Link** |\n| 1) World Model on Million-Length Video And Language With RingAttention | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.08268), [GitHub](https:\u002F\u002Fgithub.com\u002FLargeWorldModel\u002FLWM) |\n| 2) Ring Attention with Blockwise Transformers for Near-Infinite Context | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01889), [GitHub](https:\u002F\u002Fgithub.com\u002Flhao499\u002FRingAttention) |\n| 3) Extending LLMs' Context Window with 100 Samples | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.07004), [GitHub](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FEntropy-ABF) |\n| 4) Efficient Streaming Language Models with Attention Sinks | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17453), [GitHub](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fstreaming-llm) |\n| 5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed Survey | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.07872) |\n| 6) **MovieChat**: From Dense Token to Sparse Memory for Long Video Understanding | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449), [GitHub](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat), [Project](https:\u002F\u002Frese1f.github.io\u002FMovieChat\u002F) |\n| 7) **MemoryBank**: Enhancing Large Language Models with Long-Term Memory | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.10250.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fzhongwanjun\u002FMemoryBank-SiliconFriend) |\n| \u003Ch3 id=\"audio-related-resource\">09 Audio Related Resource\u003C\u002Fh3> | |\n| **Paper**  | **Link** |\n| 1) **Stable Audio**: Fast Timing-Conditioned Latent Audio Diffusion | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.04825), [Github](https:\u002F\u002Fgithub.com\u002FStability-AI\u002Fstable-audio-tools), [Blog](https:\u002F\u002Fstability.ai\u002Fresearch\u002Fstable-audio-efficient-timing-latent-diffusion) |\n| 2) **MM-Diffusion**: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation | [**CVPR 23 Paper**](http:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FRuan_MM-Diffusion_Learning_Multi-Modal_Diffusion_Models_for_Joint_Audio_and_Video_CVPR_2023_paper.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fresearchmm\u002FMM-Diffusion) |\n| 3) **Pengi**: An Audio Language Model for Audio Tasks | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F3a2e5889b4bbef997ddb13b55d5acf77-Paper-Conference.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPengi) |\n| 4) **Vast:** A vision-audio-subtitle-text omni-modality foundation model and dataset | [**NeurlPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Fe6b2b48b5ed90d07c305932729927781-Paper-Conference.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FTXH-mercury\u002FVAST) |\n| 5) **Macaw-LLM**: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.09093), [GitHub](https:\u002F\u002Fgithub.com\u002Flyuchenyang\u002FMacaw-LLM) |\n| 6) **NaturalSpeech**: End-to-End Text to Speech Synthesis with Human-Level Quality | [**TPAMI 24 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04421v2.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fheatz123\u002Fnaturalspeech) |\n| 7) **NaturalSpeech 2**: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.09116), [GitHub](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fnaturalspeech2-pytorch) |\n| 8) **UniAudio**: An Audio Foundation Model Toward Universal Audio Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00704), [GitHub](https:\u002F\u002Fgithub.com\u002Funiaudio666\u002FUniAudio) |\n| 9) **Diffsound**: Discrete Diffusion Model for Text-to-sound Generation | [**TASLP 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.09983) |\n| 10) **AudioGen**: Textually Guided Audio Generation| [**ICLR 23 Paper**](https:\u002F\u002Ficlr.cc\u002Fvirtual\u002F2023\u002Fposter\u002F11521), [Project](https:\u002F\u002Ffelixkreuk.github.io\u002Faudiogen\u002F) |\n| 11) **AudioLDM**: Text-to-audio generation with latent diffusion models | [**ICML 23 Paper**](https:\u002F\u002Fproceedings.mlr.press\u002Fv202\u002Fliu23f\u002Fliu23f.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fhaoheliu\u002FAudioLDM), [Project](https:\u002F\u002Faudioldm.github.io\u002F), [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fhaoheliu\u002Faudioldm-text-to-audio-generation) |\n| 12) **AudioLDM2**: Learning Holistic Audio Generation with Self-supervised Pretraining | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.05734), [GitHub](https:\u002F\u002Fgithub.com\u002Fhaoheliu\u002Faudioldm2), [Project](https:\u002F\u002Faudioldm.github.io\u002Faudioldm2\u002F), [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fhaoheliu\u002Faudioldm2-text2audio-text2music) |\n| 13) **Make-An-Audio**: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models | [**ICML 23 Paper**](https:\u002F\u002Fproceedings.mlr.press\u002Fv202\u002Fhuang23i\u002Fhuang23i.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FText-to-Audio\u002FMake-An-Audio) |\n| 14) **Make-An-Audio 2**: Temporal-Enhanced Text-to-Audio Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18474) |\n| 15) **TANGO**: Text-to-audio generation using instruction-tuned LLM and latent diffusion model | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.13731), [GitHub](https:\u002F\u002Fgithub.com\u002Fdeclare-lab\u002Ftango), [Project](https:\u002F\u002Freplicate.com\u002Fdeclare-lab\u002Ftango), [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdeclare-lab\u002Ftango) |\n| 16) **AudioLM**: a Language Modeling Approach to Audio Generation | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.03143) |\n| 17) **AudioGPT**: Understanding and Generating Speech, Music, Sound, and Talking Head | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.12995), [GitHub](https:\u002F\u002Fgithub.com\u002FAIGC-Audio\u002FAudioGPT) |\n| 18) **MusicGen**: Simple and Controllable Music Generation | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F94b472a1842cd7c56dcb125fb2765fbd-Paper-Conference.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Faudiocraft) |\n| 19) **LauraGPT**: Listen, Attend, Understand, and Regenerate Audio with GPT | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.04673v3) |\n| 20) **Seeing and Hearing**: Open-domain Visual-Audio Generation with Diffusion Latent Aligners | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17723) |\n| 21) **Video-LLaMA**: An Instruction-tuned Audio-Visual Language Model for Video Understanding | [**EMNLP 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.02858) |\n| 22) Audio-Visual LLM for Video Understanding | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06720) |\n| 23) **VideoPoet**: A Large Language Model for Zero-Shot Video Generation (-) | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14125), [Project](http:\u002F\u002Fsites.research.google\u002Fvideopoet\u002F), [Blog](https:\u002F\u002Fblog.research.google\u002F2023\u002F12\u002Fvideopoet-large-language-model-for-zero.html) |\n| 24) **Movie Gen**: A Cast of Media Foundation Models | [**Paper**](https:\u002F\u002Fai.meta.com\u002Fstatic-resource\u002Fmovie-gen-research-paper), [Project](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fmovie-gen\u002F)|\n| \u003Ch3 id=\"consistency\">10 Consistency\u003C\u002Fh3> | |\n| **Paper**  | **Link** |\n| 1) Consistency Models | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.01469.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fconsistency_models) |\n| 2) Improved Techniques for Training Consistency Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.14189) |\n| 3) **Score-Based Diffusion**: Score-Based Generative Modeling through Stochastic Differential Equations (-) | [**ICLR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.13456), [GitHub](https:\u002F\u002Fgithub.com\u002Fyang-song\u002Fscore_sde), [Blog](https:\u002F\u002Fyang-song.net\u002Fblog\u002F2021\u002Fscore) |\n| 4) Improved Techniques for Training Score-Based Generative Models | [**NIPS 20 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2020\u002Fhash\u002F92c3b916311a5517d9290576e3ea37ad-Abstract.html), [GitHub](https:\u002F\u002Fgithub.com\u002Fermongroup\u002Fncsnv2) |\n| 4) Generative Modeling by Estimating Gradients of the Data Distribution | [**NIPS 19 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2019\u002Fhash\u002F3001ef257407d5a371a96dcd947c7d93-Abstract.html), [GitHub](https:\u002F\u002Fgithub.com\u002Fermongroup\u002Fncsn) |\n| 5) Maximum Likelihood Training of Score-Based Diffusion Models | [**NIPS 21 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2021\u002Fhash\u002F0a9fdbb17feb6ccb7ec405cfb85222c4-Abstract.html), [GitHub](https:\u002F\u002Fgithub.com\u002Fyang-song\u002Fscore_flow) |\n| 6) Layered Neural Atlases for Consistent Video Editing | [**TOG 21 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.11418.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fykasten\u002Flayered-neural-atlases), [Project](https:\u002F\u002Flayered-neural-atlases.github.io\u002F) |\n| 7) **StableVideo**: Text-driven Consistency-aware Diffusion Video Editing | [**ICCV 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.09592), [GitHub](https:\u002F\u002Fgithub.com\u002Frese1f\u002FStableVideo), [Project](https:\u002F\u002Frese1f.github.io\u002FStableVideo\u002F) |\n| 8) **CoDeF**: Content Deformation Fields for Temporally Consistent Video Processing | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.07926.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fqiuyu96\u002FCoDeF), [Project](https:\u002F\u002Fqiuyu96.github.io\u002FCoDeF\u002F) |\n| 9) Sora Generates Videos with Stunning Geometrical Consistency | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.17403.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fmeteorshowers\u002FSora-Generates-Videos-with-Stunning-Geometrical-Consistency), [Project](https:\u002F\u002Fsora-geometrical-consistency.github.io\u002F) |\n| 10) Efficient One-stage Video Object Detection by Exploiting Temporal Consistency | [**ECCV 22 Paper**](https:\u002F\u002Fwww.ecva.net\u002Fpapers\u002Feccv_2022\u002Fpapers_ECCV\u002Fpapers\u002F136950001.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fguanxiongsun\u002FEOVOD) |\n| 11) Bootstrap Motion Forecasting With Self-Consistent Constraints | [**ICCV 23 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10377383) |\n| 12) Enforcing Realism and Temporal Consistency for Large-Scale Video Inpainting | [**Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fbook\u002F10.5555\u002FAAI28845594) |\n| 13) Enhancing Multi-Camera People Tracking with Anchor-Guided Clustering and Spatio-Temporal Consistency ID Re-Assignment | [**CVPRW 23 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10208943), [GitHub](https:\u002F\u002Fgithub.com\u002Fipl-uw\u002FAIC23_Track1_UWIPL_ETRI\u002Ftree\u002Fmain) |\n| 14) Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.02281) |\n| 15) Semi-Supervised Crowd Counting With Spatial Temporal Consistency and Pseudo-Label Filter | [**TCSVT 23 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10032602) |\n| 16) Spatio-temporal Consistency and Hierarchical Matching for Multi-Target Multi-Camera Vehicle Tracking | [**CVPRW 19 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_CVPRW_2019\u002Fhtml\u002FAI_City\u002FLi_Spatio-temporal_Consistency_and_Hierarchical_Matching_for_Multi-Target_Multi-Camera_Vehicle_Tracking_CVPRW_2019_paper.html) |\n| 17) **VideoDirectorGPT**: Consistent Multi-scene Video Generation via LLM-Guided Planning (-) | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15091) |\n| 18) **VideoDrafter**: Content-Consistent Multi-Scene Video Generation with LLM (-) | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.01256) |\n| 19) **MaskDiffusion**: Boosting Text-to-Image Consistency with Conditional Mask| [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04399) |\n| \u003Ch3 id=\"prompt-engineering\">11 Prompt Engineering\u003C\u002Fh3> | |\n| **Paper**  | **Link** |\n| 1) **RealCompo**: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12908), [GitHub](https:\u002F\u002Fgithub.com\u002FYangLing0818\u002FRealCompo), [Project](https:\u002F\u002Fcominclip.github.io\u002FRealCompo_Page\u002F) |\n| 2) **Mastering Text-to-Image Diffusion**: Recaptioning, Planning, and Generating with Multimodal LLMs | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.11708), [GitHub](https:\u002F\u002Fgithub.com\u002FYangLing0818\u002FRPG-DiffusionMaster) |\n| 3) **LLM-grounded Diffusion**: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models | [**TMLR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13655), [GitHub](https:\u002F\u002Fgithub.com\u002FTonyLianLong\u002FLLM-groundedDiffusion) |\n| 4) **LLM BLUEPRINT**: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTS | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.10640), [GitHub](https:\u002F\u002Fgithub.com\u002Fhananshafi\u002Fllmblueprint) |\n| 5) Progressive Text-to-Image Diffusion with Soft Latent Direction | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.09466) |\n| 6) Self-correcting LLM-controlled Diffusion Models | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16090), [GitHub](https:\u002F\u002Fgithub.com\u002Ftsunghan-wu\u002FSLD) |\n| 7) **LayoutLLM-T2I**: Eliciting Layout Guidance from LLM for Text-to-Image Generation | [**MM 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.05095) |\n| 8) **LayoutGPT**: Compositional Visual Planning and Generation with Large Language Models | [**NeurIPS 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15393), [GitHub](https:\u002F\u002Fgithub.com\u002Fweixi-feng\u002FLayoutGPT) |\n| 9) **Gen4Gen**: Generative Data Pipeline for Generative Multi-Concept Composition | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15504), [GitHub](https:\u002F\u002Fgithub.com\u002FlouisYen\u002FGen4Gen) |\n| 10) **InstructEdit**: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18047), [GitHub](https:\u002F\u002Fgithub.com\u002FQianWangX\u002FInstructEdit) |\n| 11) Controllable Text-to-Image Generation with GPT-4 | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18583) |\n| 12) LLM-grounded Video Diffusion Models | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17444) |\n| 13) **VideoDirectorGPT**: Consistent Multi-scene Video Generation via LLM-Guided Planning | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15091) |\n| 14) **FlowZero**: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15813), [Github](https:\u002F\u002Fgithub.com\u002Faniki-ly\u002FFlowZero), [Project](https:\u002F\u002Fflowzero-video.github.io\u002F) |\n| 15) **VideoDrafter**: Content-Consistent Multi-Scene Video Generation with LLM | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.01256) |\n| 16) **Free-Bloom**: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator | [**NeurIPS 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.14494) |\n| 17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.13812) |\n| 18) **MotionZero**: Exploiting Motion Priors for Zero-shot Text-to-Video Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16635) |\n| 19) **GPT4Motion**: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.12631) |\n| 20) Multimodal Procedural Planning via Dual Text-Image Prompting | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.01795), [Github](https:\u002F\u002Fgithub.com\u002FYujieLu10\u002FTIP) |\n| 21) **InstructCV**: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00390), [Github](https:\u002F\u002Fgithub.com\u002FAlaaLab\u002FInstructCV) |\n| 22) **DreamSync**: Aligning Text-to-Image Generation with Image Understanding Feedback | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17946) |\n| 23) **TaleCrafter**: Interactive Story Visualization with Multiple Characters | [**SIGGRAPH Asia 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00390) |\n| 24) **Reason out Your Layout**: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17126), [Github](https:\u002F\u002Fgithub.com\u002FXiaohui9607\u002FLLM_layout_generator) |\n| 25) **COLE**: A Hierarchical Generation Framework for Graphic Design | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16974) |\n| 26) Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08056) |\n| 27) **Vlogger**: Make Your Dream A Vlog | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09414), [Github](https:\u002F\u002Fgithub.com\u002FVchitect\u002FVlogger) |\n| 28) **GALA3D**: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting | [**Paper**](https:\u002F\u002Fgithub.com\u002FVDIGPKU\u002FGALA3D) |\n| 29) **MuLan**: Multimodal-LLM Agent for Progressive Multi-Object Diffusion | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12741) |\n| \u003Ch4 id=\"theoretical-foundations-and-model-architecture\">Recaption\u003C\u002Fh4> |  |\n| **Paper** | **Link** |\n| 1) **LAVIE**: High-Quality Video Generation with Cascaded Latent Diffusion Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15103), [GitHub](https:\u002F\u002Fgithub.com\u002FVchitect\u002FLaVie) |\n| 2) **Reuse and Diffuse**: Iterative Denoising for Text-to-Video Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.03549), [GitHub](https:\u002F\u002Fgithub.com\u002Fanonymous0x233\u002FReuseAndDiffuse) |\n| 3) **CoCa**: Contrastive Captioners are Image-Text Foundation Models | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.01917), [Github](https:\u002F\u002Fgithub.com\u002Flucidrains\u002FCoCa-pytorch) |\n| 4) **CogView3**: Finer and Faster Text-to-Image Generation via Relay Diffusion | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05121) |\n| 5) **VideoChat**: Chat-Centric Video Understanding | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06355), [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything) |\n| 6) De-Diffusion Makes Text a Strong Cross-Modal Interface | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00618) |\n| 7) **HowToCaption**: Prompting LLMs to Transform Video Annotations at Scale | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.04900) |\n| 8) **SELMA**: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06952) |\n| 9) **LLMGA**: Multimodal Large Language Model based Generation Assistant | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16500), [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLMGA) |\n| 10) **ELLA**: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05135), [Github](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA) |\n| 11) **MyVLM**: Personalizing VLMs for User-Specific Queries | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.14599.pdf) |\n| 12) A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.16656), [Github](https:\u002F\u002Fgithub.com\u002Fgirliemac\u002Fa-picture-is-worth-a-1000-words) |\n| 13) **Mastering Text-to-Image Diffusion**: Recaptioning, Planning, and Generating with Multimodal LLMs(-) | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2401.11708v2), [Github](https:\u002F\u002Fgithub.com\u002FYangLing0818\u002FRPG-DiffusionMaster) |\n| 14) **FlexCap**: Generating Rich, Localized, and Flexible Captions in Images | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12026) |\n| 15) **Video ReCap**: Recursive Captioning of Hour-Long Videos | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.13250.pdf), [Github](https:\u002F\u002Fgithub.com\u002Fmd-mohaiminul\u002FVideoRecap) |\n| 16) **BLIP**: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | [**ICML 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.12086), [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FBLIP) |\n| 17) **PromptCap**: Prompt-Guided Task-Aware Image Captioning | [**ICCV 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.09699), [Github](https:\u002F\u002Fgithub.com\u002FYushi-Hu\u002FPromptCap) |\n| 18) **CIC**: A framework for Culturally-aware Image Captioning | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.05374) |\n| 19) Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11593) |\n| 20) **FuseCap**: Leveraging Large Language Models for Enriched Fused Image Captions | [**WACV 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17718), [Github](https:\u002F\u002Fgithub.com\u002FRotsteinNoam\u002FFuseCap) |\n| \u003Ch3 id=\"security\">12 Security\u003C\u002Fh3> |  |\n| **Paper**  | **Link** |\n| 1) **BeaverTails:** Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F4dbb61cb68671edc4ca3712d70083b9f-Paper-Datasets_and_Benchmarks.pdf), [Github](https:\u002F\u002Fgithub.com\u002FPKU-Alignment\u002Fbeavertails) |\n| 2) **LIMA:** Less Is More for Alignment | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Fac662d74829e4407ce1d126477f4a03a-Paper-Conference.pdf) |\n| 3) **Jailbroken:** How Does LLM Safety Training Fail? | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Ffd6613131889a4b656206c50a8bd7790-Paper-Conference.pdf) |\n| 4) **Safe Latent Diffusion:** Mitigating Inappropriate Degeneration in Diffusion Models | [**CVPR 23 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FSchramowski_Safe_Latent_Diffusion_Mitigating_Inappropriate_Degeneration_in_Diffusion_Models_CVPR_2023_paper.pdf) |\n| 5) **Stable Bias:** Evaluating Societal Representations in Diffusion Models | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Fb01153e7112b347d8ed54f317840d8af-Paper-Datasets_and_Benchmarks.pdf) |\n| 6) Ablating concepts in text-to-image  diffusion models | **[ICCV 23 Paper](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fpapers\u002FKumari_Ablating_Concepts_in_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf)** |\n| 7) Diffusion art or digital forgery?  investigating data replication in diffusion models | [**ICCV 23 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FSomepalli_Diffusion_Art_or_Digital_Forgery_Investigating_Data_Replication_in_Diffusion_CVPR_2023_paper.pdf), [Project](https:\u002F\u002Fsomepago.github.io\u002Fdiffrep.html) |\n| 8) Eternal Sunshine of the Spotless Net:  Selective Forgetting in Deep Networks | **[ICCV 20 Paper](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_CVPR_2020\u002Fpapers\u002FGolatkar_Eternal_Sunshine_of_the_Spotless_Net_Selective_Forgetting_in_Deep_CVPR_2020_paper.pdf)** |\n| 9) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks | [**ICML 20 Paper**](http:\u002F\u002Fproceedings.mlr.press\u002Fv119\u002Fcroce20b\u002Fcroce20b.pdf) |\n| 10) A pilot study of query-free adversarial  attack against stable diffusion | **[ICCV 23 Paper](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023W\u002FAML\u002Fpapers\u002FZhuang_A_Pilot_Study_of_Query-Free_Adversarial_Attack_Against_Stable_Diffusion_CVPRW_2023_paper.pdf)** |\n| 11) Interpretable-Through-Prototypes Deepfake Detection for Diffusion Models | **[ICCV 23 Paper](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023W\u002FDFAD\u002Fpapers\u002FAghasanli_Interpretable-Through-Prototypes_Deepfake_Detection_for_Diffusion_Models_ICCVW_2023_paper.pdf)** |\n| 12) Erasing Concepts from Diffusion Models                   | **[ICCV 23 Paper](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fpapers\u002FGandikota_Erasing_Concepts_from_Diffusion_Models_ICCV_2023_paper.pdf)**, [Project](http:\u002F\u002Ferasing.baulab.info\u002F) |\n| 13) Ablating Concepts in Text-to-Image Diffusion Models      | **[ICCV 23 Paper](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fpapers\u002FKumari_Ablating_Concepts_in_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf)**, [Project](https:\u002F\u002Fwww.cs.cmu.edu\u002F) |\n| 14) **BEAVERTAILS:** Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | **[NeurIPS 23 Paper](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F4dbb61cb68671edc4ca3712d70083b9f-Paper-Datasets_and_Benchmarks.pdf)**, [Project](https:\u002F\u002Fsites.google.com\u002Fview\u002Fpku-beavertails) |\n| 15) **Stable Bias:** Evaluating Societal Representations in Diffusion Models | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Fb01153e7112b347d8ed54f317840d8af-Paper-Datasets_and_Benchmarks.pdf) |\n| 16) Threat Model-Agnostic Adversarial Defense using Diffusion Models | **[Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08089)**                |\n| 17) How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions? | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15230), [Github](https:\u002F\u002Fgithub.com\u002FHritikbansal\u002Fentigen_emnlp) |\n| 18) Differentially Private Diffusion Models Generate Useful Synthetic Images | **[Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.13861)**                |\n| 19) Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models | **[SIGSAC 23 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13873)**, [Github](https:\u002F\u002Fgithub.com\u002FYitingQu\u002Funsafe-diffusion) |\n| 20) Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models | **[Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.17591)**, [Github](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FForget-Me-Not) |\n| 21) Unified Concept Editing in Diffusion Models              | [**WACV 24 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FWACV2024\u002Fpapers\u002FGandikota_Unified_Concept_Editing_in_Diffusion_Models_WACV_2024_paper.pdf), [Project](https:\u002F\u002Funified.baulab.info\u002F) |\n| 22) Diffusion Model Alignment Using Direct Preference Optimization | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.12908) |\n| 23) **RAFT:** Reward rAnked FineTuning for Generative Foundation Model Alignment | [**TMLR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.06767) , [Github](https:\u002F\u002Fgithub.com\u002FOptimalScale\u002FLMFlow) |\n| 24) Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.05699), [Github](https:\u002F\u002Fgithub.com\u002FShuoTang123\u002FMATRIX), [Project](https:\u002F\u002Fshuotang123.github.io\u002FMATRIX\u002F) |\n| \u003Ch3 id=\"world-model\">13 World Model\u003C\u002Fh3> | |\n| **Paper**  | **Link** |\n| 1) **NExT-GPT**: Any-to-Any Multimodal LLM | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05519), [GitHub](https:\u002F\u002Fgithub.com\u002FNExT-GPT\u002FNExT-GPT) |\n| \u003Ch3 id=\"video-compression\">14 Video Compression\u003C\u002Fh3> ||\n| **Paper**  | **Link** |\n| 1) **H.261**: Video codec for audiovisual services at p x 64 kbit\u002Fs | [**Paper**](https:\u002F\u002Fwww.itu.int\u002Frec\u002FT-REC-H.261-199303-I\u002Fen) |\n| 2) **H.262**: Information technology - Generic coding of moving pictures and associated audio information: Video | [**Paper**](https:\u002F\u002Fwww.itu.int\u002Frec\u002FT-REC-H.262-201202-I\u002Fen) |\n| 3) **H.263**: Video coding for low bit rate communication | [**Paper**](https:\u002F\u002Fwww.itu.int\u002Frec\u002FT-REC-H.263-200501-I\u002Fen) |\n| 4) **H.264**: Overview of the H.264\u002FAVC video coding standard | [**Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F1218189) |\n| 5) **H.265**: Overview of the High Efficiency Video Coding (HEVC) Standard | [**Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F6316136) |\n| 6) **H.266**: Overview of the Versatile Video Coding (VVC) Standard and its Applications | [**Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F9503377) |\n| 7) **DVC**: An End-to-end Deep Video Compression Framework | [**CVPR 19 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1812.00101), [GitHub](https:\u002F\u002Fgithub.com\u002FGuoLusjtu\u002FDVC\u002Ftree\u002Fmaster) |\n| 8) **OpenDVC**: An Open Source Implementation of the DVC Video Compression Method | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.15862), [GitHub](https:\u002F\u002Fgithub.com\u002FRenYang-home\u002FOpenDVC) |\n| 9) **HLVC**: Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement | [**CVPR 20 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2003.01966), [Github](https:\u002F\u002Fgithub.com\u002FRenYang-home\u002FHLVC) |\n| 10) **RLVC**: Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model | [**J-STSP 21 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9288876), [Github](https:\u002F\u002Fgithub.com\u002FRenYang-home\u002FRLVC) |\n| 11) **PLVC**: Perceptual Learned Video Compression with Recurrent Conditional GAN | [**IJCAI 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.03082), [Github](https:\u002F\u002Fgithub.com\u002FRenYang-home\u002FPLVC) |\n| 12) **ALVC**: Advancing Learned Video Compression with In-loop Frame Prediction | [**T-CSVT 22 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9950550), [Github](https:\u002F\u002Fgithub.com\u002FRenYang-home\u002FALVC) |\n| 13) **DCVC**: Deep Contextual Video Compression | [**NeurIPS 21 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2021\u002Ffile\u002F96b250a90d3cf0868c83f8c965142d2a-Paper.pdf), [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDCVC\u002Ftree\u002Fmain\u002FDCVC) |\n| 14) **DCVC-TCM**: Temporal Context Mining for Learned Video Compression | [**TM 22 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F9941493), [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDCVC\u002Ftree\u002Fmain\u002FDCVC-TCM) |\n| 15) **DCVC-HEM**: Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression | [**MM 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.05894), [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDCVC\u002Ftree\u002Fmain\u002FDCVC-HEM) |\n| 16) **DCVC-DC**: Neural Video Compression with Diverse Contexts | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.14402), [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDCVC\u002Ftree\u002Fmain\u002FDCVC-DC) |\n| 17) **DCVC-FM**: Neural Video Compression with Feature Modulation | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17414), [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDCVC\u002Ftree\u002Fmain\u002FDCVC-FM) |\n| 18) **SSF**: Scale-Space Flow for End-to-End Optimized Video Compression | [**CVPR 20 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_CVPR_2020\u002Fhtml\u002FAgustsson_Scale-Space_Flow_for_End-to-End_Optimized_Video_Compression_CVPR_2020_paper.html), [Github](https:\u002F\u002Fgithub.com\u002FInterDigitalInc\u002FCompressAI) |\n| \u003Ch3 id=\"Mamba\">15 Mamba\u003C\u002Fh3> ||\n| \u003Ch4 id=\"theoretical-foundations-and-model-architecture\">15.1 Theoretical Foundations and Model Architecture\u003C\u002Fh4> | |\n| **Paper** | **Link** |\n| 1) **Mamba**: Linear-Time Sequence Modeling with Selective State Spaces | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752), [Github](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba) |\n| 2) Efficiently Modeling Long Sequences with Structured State Spaces | [**ICLR 22 Paper**](https:\u002F\u002Ficlr.cc\u002Fvirtual\u002F2022\u002Fposter\u002F6959), [Github](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fs4) |\n| 3) Modeling Sequences with Structured State Spaces | [**Paper**](https:\u002F\u002Fpurl.stanford.edu\u002Fmb976vf9362) |\n| 4) Long Range Language Modeling via Gated State Spaces | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.13947), [GitHub](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fgated-state-spaces-pytorch) |\n| \u003Ch4 id=\"image-generation-and-visual-applications\">15.2 Image Generation and Visual Applications\u003C\u002Fh4> | |\n| **Paper** | **Link** |\n| 1) Diffusion Models Without Attention | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.18257) |\n| 2) **Pan-Mamba**: Effective Pan-Sharpening with State Space Model  | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12192), [Github](https:\u002F\u002Fgithub.com\u002Falexhe101\u002FPan-Mamba) |\n| 3) Pretraining Without Attention | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.10544), [Github](https:\u002F\u002Fgithub.com\u002Fjxiw\u002FBiGS) |\n| 4) Block-State Transformers | [**NIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F16ccd203e9e3696a7ab0dcf568316379-Abstract-Conference.html) |\n| 5) **Vision Mamba**: Efficient Visual Representation Learning with Bidirectional State Space Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09417), [Github](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FVim) |\n| 6) VMamba: Visual State Space Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10166), [Github](https:\u002F\u002Fgithub.com\u002FMzeroMiko\u002FVMamba) |\n| 7) ZigMa: Zigzag Mamba Diffusion Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.13802), [Github](https:\u002F\u002Ftaohu.me\u002Fzigma\u002F) |\n| 8) **MambaVision**: A Hybrid Mamba-Transformer Vision Backbone | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.08083), [GitHub](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FMambaVision) |\n| \u003Ch4 id=\"video-processing-and-understanding\">15.3 Video Processing and Understanding\u003C\u002Fh4> | |\n| **Paper** | **Link** |\n| 1) Long Movie Clip Classification with State-Space Video Models | [**ECCV 22 Paper**](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-031-19833-5_6), [Github](https:\u002F\u002Fgithub.com\u002Fmd-mohaiminul\u002FViS4mer) |\n| 2) Selective Structured State-Spaces for Long-Form Video Understanding | [**CVPR 23 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fhtml\u002FWang_Selective_Structured_State-Spaces_for_Long-Form_Video_Understanding_CVPR_2023_paper.html) |\n| 3) Efficient Movie Scene Detection Using State-Space Transformers | [**CVPR 23 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fhtml\u002FIslam_Efficient_Movie_Scene_Detection_Using_State-Space_Transformers_CVPR_2023_paper.html), [Github](https:\u002F\u002Fgithub.com\u002Fmd-mohaiminul\u002FTranS4mer) |\n| 4) VideoMamba: State Space Model for Efficient Video Understanding | [**Paper**](http:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06977), [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FVideoMamba) |\n| \u003Ch4 id=\"medical-image-processing\">15.4 Medical Image Processing\u003C\u002Fh4> | |\n| **Paper** | **Link** |\n| 1) **Swin-UMamba**: Mamba-based UNet with ImageNet-based pretraining | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.03302), [Github](https:\u002F\u002Fgithub.com\u002FJiarunLiu\u002FSwin-UMamba) |\n| 2) **MambaIR**: A Simple Baseline for Image Restoration with State-Space Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15648), [Github](https:\u002F\u002Fgithub.com\u002Fcsguoh\u002FMambaIR) |\n| 3) VM-UNet: Vision Mamba UNet for Medical Image Segmentation | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.02491), [Github](https:\u002F\u002Fgithub.com\u002FJCruan519\u002FVM-UNet) |\n|  | |\n| \u003Ch3 id=\"existing-high-quality-resources\">16 Existing high-quality resources\u003C\u002Fh3> | |\n| **Resources**  | **Link** |\n| 1) Datawhale - AI视频生成学习 | [Feishu doc](https:\u002F\u002Fdatawhaler.feishu.cn\u002Fdocx\u002FG4LkdaffWopVbwxT1oHceiv9n0c) |\n| 2) A Survey on Generative Diffusion Model | [**TKDE 24 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02646.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fchq1155\u002FA-Survey-on-Generative-Diffusion-Model) |\n| 3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.10647), [GitHub](https:\u002F\u002Fgithub.com\u002FChenHsing\u002FAwesome-Video-Diffusion-Models) |\n| 4) Awesome-Text-To-Video：A Survey on Text-to-Video Generation\u002FSynthesis  | [GitHub](https:\u002F\u002Fgithub.com\u002Fjianzhnie\u002Fawesome-text-to-video)|\n| 5) video-generation-survey: A reading list of video generation| [GitHub](https:\u002F\u002Fgithub.com\u002Fyzhang2016\u002Fvideo-generation-survey)|\n| 6) Awesome-Video-Diffusion |  [GitHub](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FAwesome-Video-Diffusion) |\n| 7) Video Generation Task in Papers With Code |  [Task](https:\u002F\u002Fpaperswithcode.com\u002Ftask\u002Fvideo-generation) |\n| 8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17177), [GitHub](https:\u002F\u002Fgithub.com\u002Flichao-sun\u002FSoraReview) |\n| 9) Open-Sora-Plan (PKU-YuanGroup) |  [GitHub](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan) |\n| 10) State of the Art on Diffusion Models for Visual Computing | [**Paper**](http:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07204) |\n| 11) Diffusion Models: A Comprehensive Survey of Methods and Applications | [**CSUR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.00796), [GitHub](https:\u002F\u002Fgithub.com\u002FYangLing0818\u002FDiffusion-Models-Papers-Survey-Taxonomy) |\n| 12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable | [**Paper**](https:\u002F\u002Fwww.techrxiv.org\u002Fusers\u002F684880\u002Farticles\u002F718900-generate-impressive-videos-with-text-instructions-a-review-of-openai-sora-stable-diffusion-lumiere-and-comparable) |\n| 13) On the Design Fundamentals of Diffusion Models: A Survey | [**Paper**](http:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04542) |\n| 14) Efficient Diffusion Models for Vision: A Survey | [**Paper**](http:\u002F\u002Farxiv.org\u002Fabs\u002F2210.09292) |\n| 15) Text-to-Image Diffusion Models in Generative AI: A Survey | [**Paper**](http:\u002F\u002Farxiv.org\u002Fabs\u002F2303.07909) |\n| 16) Awesome-Diffusion-Transformers | [GitHub](https:\u002F\u002Fgithub.com\u002FShoufaChen\u002FAwesome-Diffusion-Transformers), [Project](https:\u002F\u002Fwww.shoufachen.com\u002FAwesome-Diffusion-Transformers\u002F) |\n| 17) Open-Sora (HPC-AI Tech) |  [GitHub](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora), [Blog](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fopen-sora) |\n| 18) **LAVIS** - A Library for Language-Vision Intelligence | [**ACL 23 Paper**](https:\u002F\u002Faclanthology.org\u002F2023.acl-demo.3.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002Flavis), [Project](https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Findex.html) |\n| 19) **OpenDiT**: An Easy, Fast and Memory-Efficient System for DiT Training and Inference | [GitHub](https:\u002F\u002Fgithub.com\u002FNUS-HPC-AI-Lab\u002FOpenDiT) |\n| 20) Awesome-Long-Context |[GitHub1](https:\u002F\u002Fgithub.com\u002Fzetian1025\u002Fawesome-long-context), [GitHub2](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FAwesome-Long-Context) |\n| 21) Lite-Sora |[GitHub](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Flite-sora\u002F) |\n| 22) **Mira**: A Mini-step Towards Sora-like Long Video Generation |[GitHub](https:\u002F\u002Fgithub.com\u002Fmira-space\u002FMira), [Project](https:\u002F\u002Fmira-space.github.io\u002F) |\n| \u003Ch3 id=\"train\">17 Efficient Training\u003C\u002Fh3> | |\n| \u003Ch4 id=\"train_paral\">17.1 Parallelism based Approach\u003C\u002Fh4> | |\n| \u003Ch5 id=\"train_paral_dp\">17.1.1 Data Parallelism (DP)\u003C\u002Fh5> | |\n| 1) A bridging model for parallel computation | [**Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F79173.79181)|\n| 2) PyTorch Distributed: Experiences on Accelerating Data Parallel Training | [**VLDB 20 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.15704) |\n| \u003Ch5 id=\"train_paral_mp\">17.1.2 Model Parallelism (MP)\u003C\u002Fh5> | |\n| 1) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | [**ArXiv 19 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.08053) |\n| 2) TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models | [**PMLR 21 Paper**](https:\u002F\u002Fproceedings.mlr.press\u002Fv139\u002Fli21y.html) |\n| \u003Ch5 id=\"train_paral_pp\">17.1.3 Pipeline Parallelism (PP)\u003C\u002Fh5> | |\n| 1) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism | [**NeurIPS 19 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2019\u002Fhash\u002F093f65e080a295f8076b1c5722a46aa2-Abstract.html) |\n| 2) PipeDream: generalized pipeline parallelism for DNN training | [**SOSP 19 Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3341301.3359646) |\n| \u003Ch5 id=\"train_paral_gp\">17.1.4 Generalized Parallelism (GP)\u003C\u002Fh5> | |\n| 1) Mesh-TensorFlow: Deep Learning for Supercomputers | [**ArXiv 18 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.02084) |\n| 2) Beyond Data and Model Parallelism for Deep Neural Networks | [**MLSys 19 Paper**](https:\u002F\u002Fproceedings.mlsys.org\u002Fpaper_files\u002Fpaper\u002F2019\u002Fhash\u002Fb422680f3db0986ddd7f8f126baaf0fa-Abstract.html) |\n| \u003Ch5 id=\"train_paral_zp\">17.1.5 ZeRO Parallelism (ZP)\u003C\u002Fh5> | |\n| 1) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | [**ArXiv 20**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.02054) |\n| 2) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters | [**ACM 20 Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3394486.3406703) |\n| 3) ZeRO-Offload: Democratizing Billion-Scale Model Training | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2101.06840) |\n| 4) PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.11277) |\n| \u003Ch4 id=\"train_non\">17.2 Non-parallelism based Approach\u003C\u002Fh4> | |\n| \u003Ch5 id=\"train_non_reduce\">17.2.1 Reducing Activation Memory\u003C\u002Fh5> | |\n| 1) Gist: Efficient Data Encoding for Deep Neural Network Training | [**IEEE 18 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8416872) |\n| 2) Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization | [**MLSys 20 Paper**](https:\u002F\u002Fproceedings.mlsys.org\u002Fpaper_files\u002Fpaper\u002F2020\u002Fhash\u002F0b816ae8f06f8dd3543dc3d9ef196cab-Abstract.html) |\n| 3) Training Deep Nets with Sublinear Memory Cost | [**ArXiv 16 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.06174) |\n| 4) Superneurons: dynamic GPU memory management for training deep neural networks | [**ACM 18 Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3178487.3178491) |\n| \u003Ch5 id=\"train_non_cpu\">17.2.2 CPU-Offloading\u003C\u002Fh5> | |\n| 1) Training Large Neural Networks with Constant Memory using a New Execution Algorithm | [**ArXiv 20 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.05645) |\n| 2) vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design | [**IEEE 16 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F7783721) |\n| \u003Ch5 id=\"train_non_mem\">17.2.3 Memory Efficient Optimizer\u003C\u002Fh5> | |\n| 1) Adafactor: Adaptive Learning Rates with Sublinear Memory Cost | [**PMLR 18 Paper**](https:\u002F\u002Fproceedings.mlr.press\u002Fv80\u002Fshazeer18a.html?ref=https:\u002F\u002Fgithubhelp.com) |\n| 2) Memory-Efficient Adaptive Optimization for Large-Scale Learning | [**Paper**](http:\u002F\u002Fdml.mathdoc.fr\u002Fitem\u002F1901.11150\u002F) |\n| \u003Ch4 id=\"train_struct\">17.3 Novel Structure\u003C\u002Fh4> | |\n| 1) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05135) [Github](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA) |\n| \u003Ch3 id=\"infer\">18 Efficient Inference\u003C\u002Fh3> | |\n| \u003Ch4 id=\"infer_reduce\">18.1 Reduce Sampling Steps\u003C\u002Fh4> | |\n| \u003Ch5 id=\"infer_reduce_continuous\">18.1.1 Continuous Steps\u003C\u002Fh4> | |\n| 1) Generative Modeling by Estimating Gradients of the Data Distribution | [**NeurIPS 19 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1907.05600) |\n| 2) WaveGrad: Estimating Gradients for Waveform Generation | [**ArXiv 20**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2009.00713) |\n| 3) Noise Level Limited Sub-Modeling for Diffusion Probabilistic Vocoders | [**ICASSP 21 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9415087) |\n| 4) Noise Estimation for Generative Diffusion Models | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.02600) |\n| \u003Ch5 id=\"infer_reduce_fast\">18.1.2 Fast Sampling\u003C\u002Fh5> | |\n| 1) Denoising Diffusion Implicit Models | [**ICLR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.02502) |\n| 2) DiffWave: A Versatile Diffusion Model for Audio Synthesis | [**ICLR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2009.09761) |\n| 3) On Fast Sampling of Diffusion Probabilistic Models | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.00132) |\n| 4) DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps | [**NeurIPS 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.00927) |\n| 5) DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.01095) |\n| 6) Fast Sampling of Diffusion Models with Exponential Integrator | [**ICLR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.13902) |\n| \u003Ch5 id=\"infer_reduce_dist\">18.1.3 Step distillation\u003C\u002Fh5> | |\n| 1) On Distillation of Guided Diffusion Models | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.03142) |\n| 2) Progressive Distillation for Fast Sampling of Diffusion Models | [**ICLR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.00512) |\n| 3) SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F41bcc9d3bddd9c90e1f44b29e26d97ff-Abstract-Conference.html) |\n| 4) Tackling the Generative Learning Trilemma with Denoising Diffusion GANs | [**ICLR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.07804) |\n| \u003Ch4 id=\"infer_opt\">18.2 Optimizing Inference\u003C\u002Fh4> | |\n| \u003Ch5 id=\"infer_opt_low\">18.2.1 Low-bit Quantization\u003C\u002Fh5> | |\n| 1) Q-Diffusion: Quantizing Diffusion Models | [**CVPR 23 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fhtml\u002FLi_Q-Diffusion_Quantizing_Diffusion_Models_ICCV_2023_paper.html) |\n| 2) Q-DM: An Efficient Low-bit Quantized Diffusion Model | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002Ff1ee1cca0721de55bb35cf28ab95e1b4-Abstract-Conference.html) |\n| 3) Temporal Dynamic Quantization for Diffusion Models | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F983591c3e9a0dc94a99134b3238bbe52-Abstract-Conference.html) |\n| \u003Ch5 id=\"infer_opt_ps\">18.2.2 Parallel\u002FSparse inference\u003C\u002Fh5> | |\n| 1) DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.19481) |\n| 2) Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models | [**NeurIPS 22 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2022\u002Fhash\u002Fb9603de9e49d0838e53b6c9cf9d06556-Abstract-Conference.html) |\n| 3) PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14430) |\n\n## Citation\n\nIf this project is helpful to your work, please cite it using the following format:\n\n```bibtex\n@misc{minisora,\n    title={MiniSora},\n    author={MiniSora Community},\n    url={https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora},\n    year={2024}\n}\n```\n\n```bibtex\n@misc{minisora,\n    title={Diffusion Model-based Video Generation Models From DDPM to Sora: A Survey},\n    author={Survey Paper Group of MiniSora Community},\n    url={https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora},\n    year={2024}\n}\n```\n\n## Minisora Community WeChat Group\n\n\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_d3907a8723c3.png\" width=\"200\"\u002F>\n  \u003Cdiv>&nbsp;\u003C\u002Fdiv>\n  \u003Cdiv align=\"center\">\n  \u003C\u002Fdiv>\n\u003C\u002Fdiv>\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_18a48be19e7f.png)](https:\u002F\u002Fstar-history.com\u002F#mini-sora\u002Fminisora&Date)\n\n## How to Contribute to the Mini Sora Community\n\nWe greatly appreciate your contributions to the Mini Sora open-source community and helping us make it even better than it is now!\n\nFor more details, please refer to the [Contribution Guidelines](.\u002F.github\u002FCONTRIBUTING.md)\n\n## Community contributors\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_a338462049c0.png\" \u002F>\n\u003C\u002Fa>\n\n[your-project-path]: mini-sora\u002Fminisora\n[contributors-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002Fmini-sora\u002Fminisora.svg?style=flat-square\n[contributors-url]: https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fgraphs\u002Fcontributors\n[forks-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fmini-sora\u002Fminisora.svg?style=flat-square\n[forks-url]: https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fnetwork\u002Fmembers\n[stars-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmini-sora\u002Fminisora.svg?style=flat-square\n[stars-url]: https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fstargazers\n[issues-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Fmini-sora\u002Fminisora.svg?style=flat-square\n[issues-url]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Fmini-sora\u002Fminisora.svg\n[license-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fmini-sora\u002Fminisora.svg?style=flat-square\n[license-url]: https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fblob\u002Fmain\u002FLICENSE\n","# MiniSora 社区\n\n\u003C!-- 项目盾牌 -->\n\n[![贡献者][contributors-shield]][contributors-url]\n[![分支][forks-shield]][forks-url]\n[![问题][issues-shield]][issues-url]\n[![MIT 许可证][license-shield]][license-url]\n[![星标][stars-shield]][stars-url]\n\u003Cbr \u002F>\n\n\u003Cdiv align=\"center\">\n\u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F8252\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_1145cd82417e.png\" alt=\"mini-sora%2Fminisora | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003C!-- 项目Logo -->\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_4c0382e96db7.jpg\" width=\"600\"\u002F>\n\n  \u003Cdiv>&nbsp;\u003C\u002Fdiv>\n  \u003Cdiv align=\"center\">\n  \u003C\u002Fdiv>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\n英语 | [简体中文](README_zh-CN.md)\n\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n    👋 加入我们的 \u003Ca href=\"https:\u002F\u002Fcdn.vansin.top\u002Fminisora.jpg\" target=\"_blank\">微信\u003C\u002Fa>\n\u003C\u002Fp>\n\nMiniSora 开源社区定位为由社区成员自发组织的社区驱动型倡议。MiniSora 社区旨在探索 Sora 的实现路径及未来发展方向。\n\n- 将定期与 Sora 团队及社区举行圆桌讨论，探讨各种可能性。\n- 深入研究现有的视频生成技术路径。\n- 引领复现与 Sora 相关的论文或研究成果，例如 DiT（[MiniSora-DiT](https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora-DiT)）等。\n- 对 Sora 相关技术和其实现进行全面回顾，即“从 DDPM 到 Sora：基于扩散模型的视频生成模型综述”。\n\n## 热点新闻\n\n- [OpenAI Sora](https:\u002F\u002Fopenai.com\u002Findex\u002Fsora-system-card\u002F) 即将发布！\n- [**Movie Gen**：一系列媒体基础模型](https:\u002F\u002Fai.meta.com\u002Fstatic-resource\u002Fmovie-gen-research-paper)\n- [**Stable Diffusion 3**：MM-DiT：用于高分辨率图像合成的缩放修正流变换器](https:\u002F\u002Fstability.ai\u002Fnews\u002Fstable-diffusion-3-research-paper)\n- [**MiniSora-DiT**](..\u002Fminisora-DiT\u002FREADME.md)：使用 XTuner 复现 DiT 论文\n- [**MiniSora 简介及复现 Sora 的最新进展**](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_7bbc71f17e0e.png)\n\n![[空](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_7bbc71f17e0e.png)](.\u002Fdocs\u002FMinisora_LPRS\u002F0001.jpg)\n\n## [MiniSora 社区复现小组](.\u002Fcodes\u002FREADME.md)\n\n### MiniSora 复现 Sora 的目标\n\n1. **GPU 友好**：理想情况下，对 GPU 显存大小和 GPU 数量的要求较低，例如能够使用类似 8 张 A100 80G 显卡、8 张 A6000 48G 显卡或 RTX4090 24G 显卡的算力进行训练和推理。\n2. **训练效率**：无需长时间训练即可取得良好效果。\n3. **推理效率**：在推理生成视频时，无需高长度或高分辨率；可接受的参数包括 3–10 秒长度和 480p 分辨率。\n\n### [MiniSora-DiT](https:\u002F\u002Fgithub.com\u002Fmini-sora\u002FMiniSora-DiT)：使用 XTuner 复现 DiT 论文\n\n[https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora-DiT](https:\u002F\u002Fgithub.com\u002Fmini-sora\u002FMiniSora-DiT)\n\n#### 要求\n\n我们正在招募 MiniSora 社区贡献者，使用 [XTuner](https:\u002F\u002Fgithub.com\u002FinternLM\u002Fxtuner) 复现 `DiT`。\n\n我们希望社区成员具备以下特点：\n\n1. 熟悉 `OpenMMLab MMEngine` 机制。\n2. 熟悉 `DiT`。\n\n#### 背景\n\n1. `DiT` 的作者与 `Sora` 的作者相同。\n2. [XTuner](https:\u002F\u002Fgithub.com\u002FinternLM\u002Fxtuner) 具有高效训练长度为 `1000K` 序列的核心技术。\n\n#### 支持\n\n1. 计算资源：2 张 A100。\n2. 来自 [XTuner](https:\u002F\u002Fgithub.com\u002FinternLM\u002Fxtuner) 核心开发者 [P佬@pppppM](https:\u002F\u002Fgithub.com\u002FpppppM) 的大力支持。\n\n## 最近的圆桌讨论\n\n### Stable Diffusion 3 论文解读：MM-DiT\n\n**演讲者**：MMagic 核心贡献者\n\n**直播时间**：12 月 3 日 20:00\n\n**亮点**：MMagic 核心贡献者将带领大家解读 Stable Diffusion 3 论文，讨论 Stable Diffusion 3 的架构细节和设计原则。\n\n**PPT**：[飞书链接](https:\u002F\u002Faicarrier.feishu.cn\u002Ffile\u002FNXnTbo5eqo8xNYxeHnecjLdJnQq)\n\n\u003C!-- 请使用微信扫描二维码预约直播。\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_bca6e068793f.png\" width=\"100\"\u002F>\n\n  \u003Cdiv>&nbsp;\u003C\u002Fdiv>\n  \u003Cdiv align=\"center\">\n  \u003C\u002Fdiv>\n\u003C\u002Fdiv> -->\n\n### 往期讨论亮点\n\n#### [**与 Sora 的夜间谈话：视频扩散概述**](https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fblob\u002Fmain\u002Fnotes\u002FREADME.md)\n\n**知乎笔记**：[生成式扩散模型综述：生成式扩散模型概览](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F684795460)\n\n## [论文阅读计划](.\u002Fnotes\u002FREADME.md)\n\n- [**Sora**：从文本创建视频](https:\u002F\u002Fopenai.com\u002Fsora)\n- **技术报告**：[视频生成模型作为世界模拟器](https:\u002F\u002Fopenai.com\u002Fresearch\u002Fvideo-generation-models-as-world-simulators)\n- **Latte**：[Latte：用于视频生成的潜在扩散 Transformer](https:\u002F\u002Fmaxin-cn.github.io\u002Flatte_project\u002F)\n  - [Latte 论文解读（简体中文）](.\u002Fnotes\u002FLatte.md)，[知乎（简体中文）](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F686407292)\n- **DiT**：[具有 Transformer 的可扩展扩散模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09748)\n- **Stable Cascade（ICLR 24 论文）**：[Würstchen：用于大规模文本到图像扩散模型的高效架构](https:\u002F\u002Fopenreview.net\u002Fforum?id=gU58d5QeGv)\n- [**Stable Diffusion 3**：MM-DiT：用于高分辨率图像合成的缩放修正流变换器](https:\u002F\u002Fstability.ai\u002Fnews\u002Fstable-diffusion-3-research-paper)\n  - [SD3 论文解读（简体中文）](.\u002Fnotes\u002FSD3_zh-CN.md)，[知乎（简体中文）](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F686273242)\n\n- 更新中...\n\n### 演讲者招募\n\n- [**DiT**（ICCV 23 论文）](https:\u002F\u002Fgithub.com\u002Forgs\u002Fmini-sora\u002Fdiscussions\u002F39)\n- [**Stable Cascade**（ICLR 24 论文）](https:\u002F\u002Fgithub.com\u002Forgs\u002Fmini-sora\u002Fdiscussions\u002F145)\n\n## 相关工作\n\n- 01 [扩散模型](#diffusion-models)\n- 02 [扩散Transformer](#diffusion-transformer)\n- 03 [基准视频生成模型](#baseline-video-generation-models)\n- 04 [扩散UNet](#diffusion-unet)\n- 05 [视频生成](#video-generation)\n- 06 [数据集](#dataset)\n  - 6.1 [公开数据集](#dataset_paper)\n  - 6.2 [视频增强方法](#video_aug)\n    - 6.2.1 [基础变换](#video_aug_basic)\n    - 6.2.2 [特征空间](#video_aug_feature)\n    - 6.2.3 [基于GAN的增强](#video_aug_gan)\n    - 6.2.4 [基于编码器\u002F解码器的方法](#video_aug_ed)\n    - 6.2.5 [仿真](#video_aug_simulation)\n- 07 [补丁化方法](#patchifying-methods)\n- 08 [长上下文](#long-context)\n- 09 [音频相关资源](#audio-related-resource)\n- 10 [一致性](#consistency)\n- 11 [提示工程](#prompt-engineering)\n- 12 [安全性](#security)\n- 13 [世界模型](#world-model)\n- 14 [视频压缩](#video-compression)\n- 15 [Mamba](#Mamba)\n  - 15.1 [理论基础与模型架构](#theoretical-foundations-and-model-architecture)\n  - 15.2 [图像生成与视觉应用](#image-generation-and-visual-applications)\n  - 15.3 [视频处理与理解](#video-processing-and-understanding)\n  - 15.4 [医学图像处理](#medical-image-processing)\n- 16 [现有高质量资源](#existing-high-quality-resources)\n- 17 [高效训练](#train)\n  - 17.1 [基于并行的方法](#train_paral)\n    - 17.1.1 [数据并行(DP)](#train_paral_dp)\n    - 17.1.2 [模型并行(MP)](#train_paral_mp)\n    - 17.1.3 [流水线并行(PP)](#train_paral_pp)\n    - 17.1.4 [广义并行(GP)](#train_paral_gp)\n    - 17.1.5 [ZeRO并行(ZP)](#train_paral_zp)\n  - 17.2 [非并行方法](#train_non)\n    - 17.2.1 [减少激活内存](#train_non_reduce)\n    - 17.2.2 [CPU卸载](#train_non_cpu)\n    - 17.2.3 [内存高效的优化器](#train_non_mem)\n  - 17.3 [新型结构](#train_struct)\n- 18 [高效推理](#infer)\n  - 18.1 [减少采样步数](#infer_reduce)\n    - 18.1.1 [连续步数](#infer_reduce_continuous)\n    - 18.1.2 [快速采样](#infer_reduce_fast)\n    - 18.1.3 [步数蒸馏](#infer_reduce_dist)\n  - 18.2 [优化推理](#infer_opt)\n    - 18.2.1 [低比特量化](#infer_opt_low)\n    - 18.2.2 [并行\u002F稀疏推理](#infer_opt_ps)\n\n| \u003Ch3 id=\"diffusion-models\">01 Diffusion Models\u003C\u002Fh3> |  |\n| :------------- | :------------- |\n| **Paper** | **Link** |\n| 1) **Guided-Diffusion**: Diffusion Models Beat GANs on Image Synthesis | [**NeurIPS 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.05233), [GitHub](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fguided-diffusion)|\n| 2) **Latent Diffusion**: High-Resolution Image Synthesis with Latent Diffusion Models | [**CVPR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.10752), [GitHub](https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion) |\n| 3) **EDM**: Elucidating the Design Space of Diffusion-Based Generative Models | [**NeurIPS 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.00364), [GitHub](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fedm) |\n| 4) **DDPM**: Denoising Diffusion Probabilistic Models | [**NeurIPS 20 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.11239), [GitHub](https:\u002F\u002Fgithub.com\u002Fhojonathanho\u002Fdiffusion) |\n| 5) **DDIM**: Denoising Diffusion Implicit Models | [**ICLR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.02502), [GitHub](https:\u002F\u002Fgithub.com\u002Fermongroup\u002Fddim) |\n| 6) **Score-Based Diffusion**: Score-Based Generative Modeling through Stochastic Differential Equations | [**ICLR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.13456), [GitHub](https:\u002F\u002Fgithub.com\u002Fyang-song\u002Fscore_sde), [Blog](https:\u002F\u002Fyang-song.net\u002Fblog\u002F2021\u002Fscore) |\n| 7) **Stable Cascade**: Würstchen: An efficient architecture for large-scale text-to-image diffusion models | [**ICLR 24 Paper**](https:\u002F\u002Fopenreview.net\u002Fforum?id=gU58d5QeGv), [GitHub](https:\u002F\u002Fgithub.com\u002FStability-AI\u002FStableCascade), [Blog](https:\u002F\u002Fstability.ai\u002Fnews\u002Fintroducing-stable-cascade) |\n| 8) Diffusion Models in Vision: A Survey| [**TPAMI 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.13456), [GitHub](https:\u002F\u002Fgithub.com\u002FCroitoruAlin\u002FDiffusion-Models-in-Vision-A-Survey)|\n| 9) **Improved DDPM**: Improved Denoising Diffusion Probabilistic Models | [**ICML 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.09672), [Github](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fimproved-diffusion) |\n| 10) Classifier-free diffusion guidance | [**NIPS 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.12598) |\n| 11) **Glide**: Towards photorealistic image generation and editing with text-guided diffusion models | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.10741), [Github](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fglide-text2im) |\n| 12) **VQ-DDM**: Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation | [**CVPR 22 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2022\u002Fpapers\u002FHu_Global_Context_With_Discrete_Diffusion_in_Vector_Quantised_Modelling_for_CVPR_2022_paper.pdf), [Github](https:\u002F\u002Fgithub.com\u002Fanonymrelease\u002FVQ-DDM) |\n| 13) Diffusion Models for Medical Anomaly Detection | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.04306), [Github](https:\u002F\u002Fgithub.com\u002FJuliaWolleb\u002Fdiffusion-anomaly) |\n| 14) Generation of Anonymous Chest Radiographs Using Latent Diffusion Models for Training Thoracic Abnormality Classification Systems | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.01323) |\n| 15) **DiffusionDet**: Diffusion Model for Object Detection | [**ICCV 23 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fpapers\u002FChen_DiffusionDet_Diffusion_Model_for_Object_Detection_ICCV_2023_paper.pdf), [Github](https:\u002F\u002Fgithub.com\u002FShoufaChen\u002FDiffusionDet) |\n| 16) Label-efficient semantic segmentation with diffusion models | [**ICLR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.03126), [Github](https:\u002F\u002Fgithub.com\u002Fyandex-research\u002Fddpm-segmentation), [Project](https:\u002F\u002Fyandex-research.github.io\u002Fddpm-segmentation\u002F) |\n| \u003Ch3 id=\"diffusion-transformer\">02 Diffusion Transformer\u003C\u002Fh3> | |\n| **Paper** | **Link** |\n| 1) **UViT**: All are Worth Words: A ViT Backbone for Diffusion Models | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.12152), [GitHub](https:\u002F\u002Fgithub.com\u002Fbaofff\u002FU-ViT), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels?name=UVit&page=1) |\n| 2) **DiT**: Scalable Diffusion Models with Transformers | [**ICCV 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09748), [GitHub](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDiT), [Project](https:\u002F\u002Fwww.wpeebles.com\u002FDiT), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels?name=Dit&page=1)|\n| 3) **SiT**: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.08740), [GitHub](https:\u002F\u002Fgithub.com\u002Fwillisma\u002FSiT), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FAI-ModelScope\u002FSiT-XL-2-256\u002Fsummary) |\n| 4) **FiT**: Flexible Vision Transformer for Diffusion Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12376), [GitHub](https:\u002F\u002Fgithub.com\u002Fwhlzy\u002FFiT) |\n| 5) **k-diffusion**: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.11605v1.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fcrowsonkb\u002Fk-diffusion) |\n| 6) **Large-DiT**: Large Diffusion Transformer | [GitHub](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLLaMA2-Accessory\u002Ftree\u002Fmain\u002FLarge-DiT) |\n| 7) **VisionLLaMA**: A Unified LLaMA Interface for Vision Tasks | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.00522), [GitHub](https:\u002F\u002Fgithub.com\u002FMeituan-AutoML\u002FVisionLLaMA) |\n| 8) **Stable Diffusion 3**: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis | [**Paper**](https:\u002F\u002Fstabilityai-public-packages.s3.us-west-2.amazonaws.com\u002FStable+Diffusion+3+Paper.pdf), [Blog](https:\u002F\u002Fstability.ai\u002Fnews\u002Fstable-diffusion-3-research-paper) |\n| 9) **PIXART-Σ**: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.04692.pdf), [Project](https:\u002F\u002Fpixart-alpha.github.io\u002FPixArt-sigma-project\u002F) |\n| 10) **PIXART-α**: Fast Training of Diffusion Transformer for Photorealistic Text-To-Image Synthesis | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00426.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FPixArt-alpha\u002FPixArt-alpha) [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Faojie1997\u002Fcv_PixArt-alpha_text-to-image\u002Fsummary)|\n| 11) **PIXART-δ**: Fast and Controllable Image Generation With Latent Consistency Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.05252.pdf), |\n| 12) **Lumina-T2X**: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05945), [GitHub](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X) |\n| 13) **DDM**: Deconstructing Denoising Diffusion Models for Self-Supervised Learning | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.14404v1)|\n| 14) Autoregressive Image Generation without Vector Quantization | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.11838), [GitHub](https:\u002F\u002Fgithub.com\u002FLTH14\u002Fmar) |\n| 15) **Transfusion**: Predict the Next Token and Diffuse Images with One Multi-Modal Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.11039)|\n| 16) Scaling Diffusion Language Models via Adaptation from Autoregressive Models | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.17891)|\n| 17) Large Language Diffusion Models | [**ArXiv 25**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.09992)|\n| \u003Ch3 id=\"baseline-video-generation-models\">03 Baseline Video Generation Models\u003C\u002Fh3> | |\n| **Paper** | **Link** |\n| 1) **ViViT**: A Video Vision Transformer | [**ICCV 21 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.15691v2.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fscenic) |\n| 2) **VideoLDM**: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.08818) |\n| 3) **DiT**: Scalable Diffusion Models with Transformers | [**ICCV 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09748), [Github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDiT), [Project](https:\u002F\u002Fwww.wpeebles.com\u002FDiT), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels?name=Dit&page=1) |\n| 4) **Text2Video-Zero**: Text-to-Image Diffusion Models are Zero-Shot Video Generators | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.13439), [GitHub](https:\u002F\u002Fgithub.com\u002FPicsart-AI-Research\u002FText2Video-Zero) |\n| 5) **Latte**: Latent Diffusion Transformer for Video Generation | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.03048v1.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FVchitect\u002FLatte), [Project](https:\u002F\u002Fmaxin-cn.github.io\u002Flatte_project\u002F), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FAI-ModelScope\u002FLatte\u002Fsummary)|\n| \u003Ch3 id=\"diffusion-unet\">04 Diffusion UNet\u003C\u002Fh3> |\n| **Paper** | **Link** |\n| 1) Taming Transformers for High-Resolution Image Synthesis | [**CVPR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2012.09841.pdf),[GitHub](https:\u002F\u002Fgithub.com\u002FCompVis\u002Ftaming-transformers) ,[Project](https:\u002F\u002Fcompvis.github.io\u002Ftaming-transformers\u002F)|\n| 2) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05135) [Github](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA) |\n| \u003Ch3 id=\"video-generation\">05 Video Generation\u003C\u002Fh3> | |\n| **Paper** | **Link** |\n| 1) **Animatediff**: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.04725), [GitHub](https:\u002F\u002Fgithub.com\u002Fguoyww\u002Fanimatediff\u002F), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels?name=Animatediff&page=1) |\n| 2) **I2VGen-XL**: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.04145), [GitHub](https:\u002F\u002Fgithub.com\u002Fali-vilab\u002Fi2vgen-xl),  [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fi2vgen-xl\u002Fsummary) |\n| 3) **Imagen Video**: High Definition Video Generation with Diffusion Models | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.02303) |\n| 4) **MoCoGAN**: Decomposing Motion and Content for Video Generation | [**CVPR 18 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.04993) |\n| 5) Adversarial Video Generation on Complex Datasets | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1907.06571) |\n| 6) **W.A.L.T**: Photorealistic Video Generation with Diffusion Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06662), [Project](https:\u002F\u002Fwalt-video-diffusion.github.io\u002F) |\n| 7) **VideoGPT**: Video Generation using VQ-VAE and Transformers | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.10157), [GitHub](https:\u002F\u002Fgithub.com\u002Fwilson1yan\u002FVideoGPT) |\n| 8) Video Diffusion Models | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.03458), [GitHub](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fvideo-diffusion-pytorch), [Project](https:\u002F\u002Fvideo-diffusion.github.io\u002F) |\n| 9) **MCVD**: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation | [**NeurIPS 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.09853), [GitHub](https:\u002F\u002Fgithub.com\u002Fvoletiv\u002Fmcvd-pytorch), [Project](https:\u002F\u002Fmask-cond-video-diffusion.github.io\u002F), [Blog](https:\u002F\u002Fajolicoeur.ca\u002F2022\u002F05\u002F22\u002Fmasked-conditional-video-diffusion\u002F) |\n| 10) **VideoPoet**: A Large Language Model for Zero-Shot Video Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14125), [Project](http:\u002F\u002Fsites.research.google\u002Fvideopoet\u002F), [Blog](https:\u002F\u002Fblog.research.google\u002F2023\u002F12\u002Fvideopoet-large-language-model-for-zero.html) |\n| 11) **MAGVIT**: Masked Generative Video Transformer | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.05199), [GitHub](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fmagvit), [Project](https:\u002F\u002Fmagvit.cs.cmu.edu\u002F), [Colab](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fmagvit\u002Fblob\u002Fmain) |\n| 12) **EMO**: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17485), [GitHub](https:\u002F\u002Fgithub.com\u002FHumanAIGC\u002FEMO), [Project](https:\u002F\u002Fhumanaigc.github.io\u002Femote-portrait-alive\u002F) |\n| 13) **SimDA**: Simple Diffusion Adapter for Efficient Video Generation | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.09710.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FChenHsing\u002FSimDA), [Project](https:\u002F\u002Fchenhsing.github.io\u002FSimDA\u002F) |\n| 14) **StableVideo**: Text-driven Consistency-aware Diffusion Video Editing | [**ICCV 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.09592), [GitHub](https:\u002F\u002Fgithub.com\u002Frese1f\u002FStableVideo), [Project](https:\u002F\u002Frese1f.github.io\u002FStableVideo\u002F) |\n| 15) **SVD**: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets| [**Paper**](https:\u002F\u002Fstatic1.squarespace.com\u002Fstatic\u002F6213c340453c3f502425776e\u002Ft\u002F655ce779b9d47d342a93c890\u002F1700587395994\u002Fstable_video_diffusion.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FStability-AI\u002Fgenerative-models)|\n| 16) **ADD**: Adversarial Diffusion Distillation| [**Paper**](https:\u002F\u002Fstatic1.squarespace.com\u002Fstatic\u002F6213c340453c3f502425776e\u002Ft\u002F65663480a92fba51d0e1023f\u002F1701197769659\u002Fadversarial_diffusion_distillation.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FStability-AI\u002Fgenerative-models) |\n| 17) **GenTron:** Diffusion Transformers for Image and Video Generation | [**CVPR 24 Paper**](http:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04557), [Project](https:\u002F\u002Fwww.shoufachen.com\u002Fgentron_website\u002F)|\n| 18) **LFDM**: Conditional Image-to-Video Generation with Latent Flow Diffusion Models | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.13744), [GitHub](https:\u002F\u002Fgithub.com\u002Fnihaomiao\u002FCVPR23_LFDM) |\n| 19) **MotionDirector**: Motion Customization of Text-to-Video Diffusion Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08465), [GitHub](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FMotionDirector) |\n| 20) **TGAN-ODE**: Latent Neural Differential Equations for Video Generation | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2011.03864v3.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FZasder3\u002FLatent-Neural-Differential-Equations-for-Video-Generation) |\n| 21) **VideoCrafter1**: Open Diffusion Models for High-Quality Video Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.19512), [GitHub](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FVideoCrafter) |\n| 22) **VideoCrafter2**: Overcoming Data Limitations for High-Quality Video Diffusion Models | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09047), [GitHub](https:\u002F\u002Fgithub.com\u002FAILab-CVC\u002FVideoCrafter) |\n| 23) **LVDM**: Latent Video Diffusion Models for High-Fidelity Long Video Generation | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.13221), [GitHub](https:\u002F\u002Fgithub.com\u002FYingqingHe\u002FLVDM) |\n| 24) **LaVie**: High-Quality Video Generation with Cascaded Latent Diffusion Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15103), [GitHub](https:\u002F\u002Fgithub.com\u002FVchitect\u002FLaVie), [Project](https:\u002F\u002Fvchitect.github.io\u002FLaVie-project\u002F) |\n| 25) **PYoCo**: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models | [**ICCV 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10474), [Project](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fdir\u002Fpyoco\u002F)|\n| 26) **VideoFusion**: Decomposed Diffusion Models for High-Quality Video Generation | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.08320)|\n| 27) **Movie Gen**: A Cast of Media Foundation Models | [**Paper**](https:\u002F\u002Fai.meta.com\u002Fstatic-resource\u002Fmovie-gen-research-paper), [Project](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fmovie-gen\u002F)|\n| 28) Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model| [**ArXiv 25**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.10248), [Project](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep-Video-T2V)|\n| \u003Ch3 id=\"dataset\">06 Dataset\u003C\u002Fh3> | |\n| \u003Ch4 id=\"dataset_paper\">6.1 Public Datasets\u003C\u002Fh4> | |\n| **Dataset Name - Paper** | **Link** |\n| 1) **Panda-70M** - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers\u003Cbr>\u003Csmall>`70M Clips, 720P, Downloadable`\u003C\u002Fsmall>|[**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.19479), [Github](https:\u002F\u002Fgithub.com\u002Fsnap-research\u002FPanda-70M), [Project](https:\u002F\u002Fsnap-research.github.io\u002FPanda-70M\u002F), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002FAI-ModelScope\u002Fpanda-70m\u002Fsummary)|\n| 2) **InternVid-10M** - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation\u003Cbr>\u003Csmall>`10M Clips, 720P, Downloadable`\u003C\u002Fsmall>|[**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06942), [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVideo\u002Ftree\u002Fmain\u002FData\u002FInternVid)|\n| 3) **CelebV-Text** - CelebV-Text: A Large-Scale Facial Text-Video Dataset\u003Cbr>\u003Csmall>`70K Clips, 720P, Downloadable`\u003C\u002Fsmall>|[**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.14717), [Github](https:\u002F\u002Fgithub.com\u002Fcelebv-text\u002FCelebV-Text), [Project](https:\u002F\u002Fcelebv-text.github.io\u002F)|\n| 4) **HD-VG-130M** - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation\u003Cbr>\u003Csmall> `130M Clips, 720P, Downloadable`\u003C\u002Fsmall>|[**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10874), [Github](https:\u002F\u002Fgithub.com\u002Fdaooshee\u002FHD-VG-130M), [Tool](https:\u002F\u002Fgithub.com\u002FBreakthrough\u002FPySceneDetect)|\n| 5) **HD-VILA-100M** - Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions\u003Cbr>\u003Csmall> `100M Clips, 720P, Downloadable`\u003C\u002Fsmall>|[**CVPR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.10337), [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FXPretrain\u002Fblob\u002Fmain\u002Fhd-vila-100m\u002FREADME.md)|\n| 6) **VideoCC** - Learning Audio-Video Modalities from Image Captions\u003Cbr>\u003Csmall>`10.3M Clips, 720P, Downloadable`\u003C\u002Fsmall>|[**ECCV 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.00679), [Github](https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002FvideoCC-data)|\n| 7) **YT-Temporal-180M** - MERLOT: Multimodal Neural Script Knowledge Models\u003Cbr>\u003Csmall>`180M Clips, 480P, Downloadable`\u003C\u002Fsmall>| [**NeurIPS 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.02636), [Github](https:\u002F\u002Fgithub.com\u002Frowanz\u002Fmerlot), [Project](https:\u002F\u002Frowanzellers.com\u002Fmerlot\u002F#data)|\n| 8) **HowTo100M** - HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips\u003Cbr>\u003Csmall>`136M Clips, 240P, Downloadable`\u003C\u002Fsmall>| [**ICCV 19 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.03327), [Github](https:\u002F\u002Fgithub.com\u002Fantoine77340\u002Fhowto100m), [Project](https:\u002F\u002Fwww.di.ens.fr\u002Fwillow\u002Fresearch\u002Fhowto100m\u002F)|\n| 9) **UCF101** - UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild\u003Cbr>\u003Csmall>`13K Clips, 240P, Downloadable`\u003C\u002Fsmall>| [**CVPR 12 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1212.0402), [Project](https:\u002F\u002Fwww.crcv.ucf.edu\u002Fdata\u002FUCF101.php)|\n| 10) **MSVD** - Collecting Highly Parallel Data for Paraphrase Evaluation\u003Cbr>\u003Csmall>`122K Clips, 240P, Downloadable`\u003C\u002Fsmall> | [**ACL 11 Paper**](https:\u002F\u002Faclanthology.org\u002FP11-1020.pdf), [Project](https:\u002F\u002Fwww.cs.utexas.edu\u002Fusers\u002Fml\u002Fclamp\u002FvideoDescription\u002F)|\n| 11) **Fashion-Text2Video** - A human video dataset with rich label and text annotations\u003Cbr>\u003Csmall>`600 Videos, 480P, Downloadable`\u003C\u002Fsmall> | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.08483.pdf), [Project](https:\u002F\u002Fyumingj.github.io\u002Fprojects\u002FText2Performer.html) |\n| 12) **LAION-5B** - A dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M\u003Cbr>\u003Csmall>`5B Clips, Downloadable`\u003C\u002Fsmall> | [**NeurIPS 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.08402), [Project](https:\u002F\u002Flaion.ai\u002Fblog\u002Flaion-5b\u002F)|\n| 13) **ActivityNet Captions** -  ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time\u003Cbr>\u003Csmall>`20k videos, Downloadable`\u003C\u002Fsmall> | [**Arxiv 17 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.00754), [Project](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Franjaykrishna\u002Fdensevid\u002F)|\n| 14) **MSR-VTT** -  A large-scale video benchmark for video understanding\u003Cbr>\u003Csmall>`10k Clips, Downloadable`\u003C\u002Fsmall> | [**CVPR 16 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7780940), [Project](https:\u002F\u002Fcove.thecvf.com\u002Fdatasets\u002F839)|\n| 15) **The Cityscapes Dataset** -  Benchmark suite and evaluation server for pixel-level, instance-level, and panoptic semantic labeling\u003Cbr>\u003Csmall>`Downloadable`\u003C\u002Fsmall> | [**Arxiv 16 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1608.02192v1.pdf), [Project](https:\u002F\u002Fwww.cityscapes-dataset.com\u002F)|\n| 16) **Youku-mPLUG** -  First open-source large-scale Chinese video text dataset\u003Cbr>\u003Csmall>`Downloadable`\u003C\u002Fsmall> | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04362), [Project](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FYouku-mPLUG), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002Fmodelscope\u002FYouku-AliceMind\u002Fsummary) |\n| 17) **VidProM** - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models\u003Cbr>\u003Csmall>`6.69M, Downloadable`\u003C\u002Fsmall>| [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06098), [Github](https:\u002F\u002Fgithub.com\u002FWangWenhao0716\u002FVidProM) |\n| 18) **Pixabay100** - A video dataset collected from Pixabay\u003Cbr>\u003Csmall>`Downloadable`\u003C\u002Fsmall>| [Github](https:\u002F\u002Fgithub.com\u002FECNU-CILAB\u002FPixabay100\u002F) |\n| 19) **WebVid** -  Large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sites\u003Cbr>\u003Csmall>`Long Durations and Structured Captions`\u003C\u002Fsmall> | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.00650), [Project](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fresearch\u002Ffrozen-in-time\u002F) , [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002FAI-ModelScope\u002Fwebvid-10M\u002Fsummary)|\n| 20) **MiraData(Mini-Sora Data)**: A Large-Scale Video Dataset with Long Durations and Structured Captions\u003Cbr>\u003Csmall>`10M video-text pairs`\u003C\u002Fsmall> | [Github](https:\u002F\u002Fgithub.com\u002Fmira-space\u002FMiraData), [Project](https:\u002F\u002Fmira-space.github.io\u002F) |\n| 21) **IDForge**: A video dataset featuring scenes of people speaking.\u003Cbr>\u003Csmall>`300k Clips, Downloadable`\u003C\u002Fsmall> | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.11764), [Github](https:\u002F\u002Fgithub.com\u002Fxyyandxyy\u002FIDForge)  |\n| \u003Ch4 id=\"video_aug\">6.2 Video Augmentation Methods\u003C\u002Fh4> |  |\n| \u003Ch5 id=\"video_aug_basic\">6.2.1 Basic Transformations\u003C\u002Fh5> | |\n| Three-stream CNNs for action recognition | [**PRL 17 Paper**](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS0167865517301071) |\n| Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural Networks | [**EL 19 Paper**](http:\u002F\u002Fwww.engineeringletters.com\u002Fissues_v27\u002Fissue_3\u002FEL_27_3_12.pdf)|\n| Intra-clip Aggregation for Video Person Re-identification | [**ICIP 20 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1905.01722)|\n| VideoMix: Rethinking Data Augmentation for Video Classification | [**CVPR 20 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.03457) |\n| mixup: Beyond Empirical Risk Minimization | [**ICLR 17 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.09412) |\n| CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features | [**ICCV 19 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_ICCV_2019\u002Fhtml\u002FYun_CutMix_Regularization_Strategy_to_Train_Strong_Classifiers_With_Localizable_Features_ICCV_2019_paper.html) |\n| Video Salient Object Detection via Fully Convolutional Networks | [**ICIP 18 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8047320) |\n| Illumination-Based Data Augmentation for Robust Background Subtraction | [**SKIMA 19 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8982527) |\n| Image editing-based data augmentation for illumination-insensitive background subtraction | [**EIM 20 Paper**](https:\u002F\u002Fwww.emerald.com\u002Finsight\u002Fcontent\u002Fdoi\u002F10.1108\u002FJEIM-02-2020-0042\u002Ffull\u002Fhtml) |\n| \u003Ch5 id=\"video_aug_feature\">6.2.2 Feature Space\u003C\u002Fh5> | |\n| Feature Re-Learning with Data Augmentation for Content-based Video Recommendation | [**ACM 18 Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3240508.3266441) |\n| GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer | [**Trans 21 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9147027) |\n| \u003Ch5 id=\"video_aug_gan\">6.2.3 GAN-based Augmentation\u003C\u002Fh5> | |\n| Deep Video-Based Performance Cloning | [**CVPR 18 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1808.06847) |\n| Adversarial Action Data Augmentation for Similar Gesture Action Recognition | [**IJCNN 19 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8851993) |\n| Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples | [**MM 20 Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3394171.3414003) |\n| GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer | [**Trans 20 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9147027) |\n| Dynamic Facial Expression Generation on Hilbert Hypersphere With Conditional Wasserstein Generative Adversarial Nets | [**TPAMI 20 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9117185) |\n| CrowdGAN: Identity-Free Interactive Crowd Video Generation and Beyond | [**TPAMI 22 Paper**](https:\u002F\u002Fwww.computer.org\u002Fcsdl\u002Fjournal\u002Ftp\u002F5555\u002F01\u002F09286483\u002F1por0TYwZvG) |\n| \u003Ch5 id=\"video_aug_ed\">6.2.4 Encoder\u002FDecoder Based\u003C\u002Fh5> | |\n| Rotationally-Temporally Consistent Novel View Synthesis of Human Performance Video | [**ECCV 20 Paper**](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-030-58548-8_23) |\n| Autoencoder-based Data Augmentation for Deepfake Detection | [**ACM 23 Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3592572.3592840) |\n| \u003Ch5 id=\"video_aug_simulation\">6.2.5 Simulation\u003C\u002Fh5> | |\n| A data augmentation methodology for training machine\u002Fdeep learning gait recognition algorithms | [**CVPR 16 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.07570) |\n| ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications | [**IEEE 21 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9324837) |\n| Mid-Air: A Multi-Modal Dataset for Extremely Low Altitude Drone Flights | [**CVPR 19 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_CVPRW_2019\u002Fhtml\u002FUAVision\u002FFonder_Mid-Air_A_Multi-Modal_Dataset_for_Extremely_Low_Altitude_Drone_Flights_CVPRW_2019_paper.html) |\n| Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models | [**IJCV 19 Paper**](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs11263-019-01222-z) |\n| Using synthetic data for person tracking under adverse weather conditions | [**IVC 21 Paper**](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS0262885621000925) |\n| Unlimited Road-scene Synthetic Annotation (URSA) Dataset | [**ITSC 18 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8569519) |\n| SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video Data | [**CVPR 21 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2021\u002Fhtml\u002FHu_SAIL-VOS_3D_A_Synthetic_Dataset_and_Baselines_for_Object_Detection_CVPR_2021_paper.html) |\n| Universal Semantic Segmentation for Fisheye Urban Driving Images | [**SMC 20 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9283099) |\n| \u003Ch3 id=\"patchifying-methods\">07 Patchifying Methods\u003C\u002Fh3> | |\n| **Paper** | **Link** |\n| 1) **ViT**: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | [**CVPR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929), [Github](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fvision_transformer) |\n| 2) **MAE**: Masked Autoencoders Are Scalable Vision Learners| [**CVPR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.06377), [Github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmae) |\n| 3) **ViViT**: A Video Vision Transformer (-)| [**ICCV 21 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.15691v2.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fscenic) |\n| 4) **DiT**: Scalable Diffusion Models with Transformers (-) | [**ICCV 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09748), [GitHub](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDiT), [Project](https:\u002F\u002Fwww.wpeebles.com\u002FDiT), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels?name=Dit&page=1)|\n| 5) **U-ViT**: All are Worth Words: A ViT Backbone for Diffusion Models (-) | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.12152), [GitHub](https:\u002F\u002Fgithub.com\u002Fbaofff\u002FU-ViT), [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels?name=UVit&page=1) |\n| 6) **FlexiViT**: One Model for All Patch Sizes | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08013.pdf), [Github](https:\u002F\u002Fgithub.com\u002Fbwconrad\u002Fflexivit.git) |\n| 7) **Patch n’ Pack**: NaViT, a Vision Transformer for any Aspect Ratio and Resolution | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06304), [Github](https:\u002F\u002Fgithub.com\u002Fkyegomez\u002FNaViT) |\n| 8) **VQ-VAE**: Neural Discrete Representation Learning | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.00937), [Github](https:\u002F\u002Fgithub.com\u002FMishaLaskin\u002Fvqvae) |\n| 9) **VQ-GAN**: Neural Discrete Representation Learning | [**CVPR 21 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2021\u002Fhtml\u002FEsser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.html), [Github](https:\u002F\u002Fgithub.com\u002FCompVis\u002Ftaming-transformers) |\n| 10) **LVT**: Latent Video Transformer | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.10704), [Github](https:\u002F\u002Fgithub.com\u002Frakhimovv\u002Flvt) |\n| 11) **VideoGPT**: Video Generation using VQ-VAE and Transformers (-) | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.10157), [GitHub](https:\u002F\u002Fgithub.com\u002Fwilson1yan\u002FVideoGPT) |\n| 12) Predicting Video with VQVAE | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.01950) |\n| 13) **CogVideo**: Large-scale Pretraining for Text-to-Video Generation via Transformers | [**ICLR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15868.pdf), [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVideo.git) |\n| 14) **TATS**: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | [**ECCV 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.03638), [Github](https:\u002F\u002Fbnucsy.github.io\u002FTATS\u002F) |\n| 15) **MAGVIT**: Masked Generative Video Transformer (-) | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.05199), [GitHub](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fmagvit), [Project](https:\u002F\u002Fmagvit.cs.cmu.edu\u002F), [Colab](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fmagvit\u002Fblob\u002Fmain) |\n| 16) **MagViT2**: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.05737.pdf), [Github](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fmagvit2-pytorch) |\n| 17) **VideoPoet**: A Large Language Model for Zero-Shot Video Generation (-) | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14125), [Project](http:\u002F\u002Fsites.research.google\u002Fvideopoet\u002F), [Blog](https:\u002F\u002Fblog.research.google\u002F2023\u002F12\u002Fvideopoet-large-language-model-for-zero.html) |\n| 18) **CLIP**: Learning Transferable Visual Models From Natural Language Supervision | [**CVPR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929), [Github](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP) |\n| 19) **BLIP**: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.12086), [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FBLIP) |\n| 20) **BLIP-2**: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.12597), [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FLAVIS\u002Ftree\u002Fmain\u002Fprojects\u002Fblip2) |\n| \u003Ch3 id=\"long-context\">08 Long-context\u003C\u002Fh3> | |\n| **Paper** | **Link** |\n| 1) World Model on Million-Length Video And Language With RingAttention | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.08268), [GitHub](https:\u002F\u002Fgithub.com\u002FLargeWorldModel\u002FLWM) |\n| 2) Ring Attention with Blockwise Transformers for Near-Infinite Context | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01889), [GitHub](https:\u002F\u002Fgithub.com\u002Flhao499\u002FRingAttention) |\n| 3) Extending LLMs' Context Window with 100 Samples | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.07004), [GitHub](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FEntropy-ABF) |\n| 4) Efficient Streaming Language Models with Attention Sinks | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17453), [GitHub](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fstreaming-llm) |\n| 5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed Survey | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.07872) |\n| 6) **MovieChat**: From Dense Token to Sparse Memory for Long Video Understanding | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16449), [GitHub](https:\u002F\u002Fgithub.com\u002Frese1f\u002FMovieChat), [Project](https:\u002F\u002Frese1f.github.io\u002FMovieChat\u002F) |\n| 7) **MemoryBank**: Enhancing Large Language Models with Long-Term Memory | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.10250.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fzhongwanjun\u002FMemoryBank-SiliconFriend) |\n| \u003Ch3 id=\"audio-related-resource\">09 Audio Related Resource\u003C\u002Fh3> | |\n| **Paper**  | **Link** |\n| 1) **Stable Audio**: Fast Timing-Conditioned Latent Audio Diffusion | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.04825), [Github](https:\u002F\u002Fgithub.com\u002FStability-AI\u002Fstable-audio-tools), [Blog](https:\u002F\u002Fstability.ai\u002Fresearch\u002Fstable-audio-efficient-timing-latent-diffusion) |\n| 2) **MM-Diffusion**: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation | [**CVPR 23 Paper**](http:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FRuan_MM-Diffusion_Learning_Multi-Modal_Diffusion_Models_for_Joint_Audio_and_Video_CVPR_2023_paper.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fresearchmm\u002FMM-Diffusion) |\n| 3) **Pengi**: An Audio Language Model for Audio Tasks | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F3a2e5889b4bbef997ddb13b55d5acf77-Paper-Conference.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FPengi) |\n| 4) **Vast:** A vision-audio-subtitle-text omni-modality foundation model and dataset | [**NeurlPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Fe6b2b48b5ed90d07c305932729927781-Paper-Conference.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FTXH-mercury\u002FVAST) |\n| 5) **Macaw-LLM**: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.09093), [GitHub](https:\u002F\u002Fgithub.com\u002Flyuchenyang\u002FMacaw-LLM) |\n| 6) **NaturalSpeech**: End-to-End Text to Speech Synthesis with Human-Level Quality | [**TPAMI 24 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.04421v2.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fheatz123\u002Fnaturalspeech) |\n| 7) **NaturalSpeech 2**: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.09116), [GitHub](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fnaturalspeech2-pytorch) |\n| 8) **UniAudio**: An Audio Foundation Model Toward Universal Audio Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00704), [GitHub](https:\u002F\u002Fgithub.com\u002Funiaudio666\u002FUniAudio) |\n| 9) **Diffsound**: Discrete Diffusion Model for Text-to-sound Generation | [**TASLP 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.09983) |\n| 10) **AudioGen**: Textually Guided Audio Generation| [**ICLR 23 Paper**](https:\u002F\u002Ficlr.cc\u002Fvirtual\u002F2023\u002Fposter\u002F11521), [Project](https:\u002F\u002Ffelixkreuk.github.io\u002Faudiogen\u002F) |\n| 11) **AudioLDM**: Text-to-audio generation with latent diffusion models | [**ICML 23 Paper**](https:\u002F\u002Fproceedings.mlr.press\u002Fv202\u002Fliu23f\u002Fliu23f.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fhaoheliu\u002FAudioLDM), [Project](https:\u002F\u002Faudioldm.github.io\u002F), [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fhaoheliu\u002Faudioldm-text-to-audio-generation) |\n| 12) **AudioLDM2**: Learning Holistic Audio Generation with Self-supervised Pretraining | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.05734), [GitHub](https:\u002F\u002Fgithub.com\u002Fhaoheliu\u002Faudioldm2), [Project](https:\u002F\u002Faudioldm.github.io\u002Faudioldm2\u002F), [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fhaoheliu\u002Faudioldm2-text2audio-text2music) |\n| 13) **Make-An-Audio**: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models | [**ICML 23 Paper**](https:\u002F\u002Fproceedings.mlr.press\u002Fv202\u002Fhuang23i\u002Fhuang23i.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002FText-to-Audio\u002FMake-An-Audio) |\n| 14) **Make-An-Audio 2**: Temporal-Enhanced Text-to-Audio Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18474) |\n| 15) **TANGO**: Text-to-audio generation using instruction-tuned LLM and latent diffusion model | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.13731), [GitHub](https:\u002F\u002Fgithub.com\u002Fdeclare-lab\u002Ftango), [Project](https:\u002F\u002Freplicate.com\u002Fdeclare-lab\u002Ftango), [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdeclare-lab\u002Ftango) |\n| 16) **AudioLM**: a Language Modeling Approach to Audio Generation | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.03143) |\n| 17) **AudioGPT**: Understanding and Generating Speech, Music, Sound, and Talking Head | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.12995), [GitHub](https:\u002F\u002Fgithub.com\u002FAIGC-Audio\u002FAudioGPT) |\n| 18) **MusicGen**: Simple and Controllable Music Generation | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F94b472a1842cd7c56dcb125fb2765fbd-Paper-Conference.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Faudiocraft) |\n| 19) **LauraGPT**: Listen, Attend, Understand, and Regenerate Audio with GPT | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.04673v3) |\n| 20) **Seeing and Hearing**: Open-domain Visual-Audio Generation with Diffusion Latent Aligners | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17723) |\n| 21) **Video-LLaMA**: An Instruction-tuned Audio-Visual Language Model for Video Understanding | [**EMNLP 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.02858) |\n| 22) Audio-Visual LLM for Video Understanding | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06720) |\n| 23) **VideoPoet**: A Large Language Model for Zero-Shot Video Generation (-) | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14125), [Project](http:\u002F\u002Fsites.research.google\u002Fvideopoet\u002F), [Blog](https:\u002F\u002Fblog.research.google\u002F2023\u002F12\u002Fvideopoet-large-language-model-for-zero.html) |\n| 24) **Movie Gen**: A Cast of Media Foundation Models | [**Paper**](https:\u002F\u002Fai.meta.com\u002Fstatic-resource\u002Fmovie-gen-research-paper), [Project](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fmovie-gen\u002F)|\n| \u003Ch3 id=\"consistency\">10 Consistency\u003C\u002Fh3> | |\n| **Paper**  | **Link** |\n| 1) Consistency Models | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.01469.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fconsistency_models) |\n| 2) Improved Techniques for Training Consistency Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.14189) |\n| 3) **Score-Based Diffusion**: Score-Based Generative Modeling through Stochastic Differential Equations (-) | [**ICLR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.13456), [GitHub](https:\u002F\u002Fgithub.com\u002Fyang-song\u002Fscore_sde), [Blog](https:\u002F\u002Fyang-song.net\u002Fblog\u002F2021\u002Fscore) |\n| 4) Improved Techniques for Training Score-Based Generative Models | [**NIPS 20 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2020\u002Fhash\u002F92c3b916311a5517d9290576e3ea37ad-Abstract.html), [GitHub](https:\u002F\u002Fgithub.com\u002Fermongroup\u002Fncsnv2) |\n| 4) Generative Modeling by Estimating Gradients of the Data Distribution | [**NIPS 19 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2019\u002Fhash\u002F3001ef257407d5a371a96dcd947c7d93-Abstract.html), [GitHub](https:\u002F\u002Fgithub.com\u002Fermongroup\u002Fncsn) |\n| 5) Maximum Likelihood Training of Score-Based Diffusion Models | [**NIPS 21 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2021\u002Fhash\u002F0a9fdbb17feb6ccb7ec405cfb85222c4-Abstract.html), [GitHub](https:\u002F\u002Fgithub.com\u002Fyang-song\u002Fscore_flow) |\n| 6) Layered Neural Atlases for Consistent Video Editing | [**TOG 21 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.11418.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fykasten\u002Flayered-neural-atlases), [Project](https:\u002F\u002Flayered-neural-atlases.github.io\u002F) |\n| 7) **StableVideo**: Text-driven Consistency-aware Diffusion Video Editing | [**ICCV 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.09592), [GitHub](https:\u002F\u002Fgithub.com\u002Frese1f\u002FStableVideo), [Project](https:\u002F\u002Frese1f.github.io\u002FStableVideo\u002F) |\n| 8) **CoDeF**: Content Deformation Fields for Temporally Consistent Video Processing | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.07926.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fqiuyu96\u002FCoDeF), [Project](https:\u002F\u002Fqiuyu96.github.io\u002FCoDeF\u002F) |\n| 9) Sora Generates Videos with Stunning Geometrical Consistency | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.17403.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fmeteorshowers\u002FSora-Generates-Videos-with-Stunning-Geometrical-Consistency), [Project](https:\u002F\u002Fsora-geometrical-consistency.github.io\u002F) |\n| 10) Efficient One-stage Video Object Detection by Exploiting Temporal Consistency | [**ECCV 22 Paper**](https:\u002F\u002Fwww.ecva.net\u002Fpapers\u002Feccv_2022\u002Fpapers_ECCV\u002Fpapers\u002F136950001.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fguanxiongsun\u002FEOVOD) |\n| 11) Bootstrap Motion Forecasting With Self-Consistent Constraints | [**ICCV 23 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10377383) |\n| 12) Enforcing Realism and Temporal Consistency for Large-Scale Video Inpainting | [**Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fbook\u002F10.5555\u002FAAI28845594) |\n| 13) Enhancing Multi-Camera People Tracking with Anchor-Guided Clustering and Spatio-Temporal Consistency ID Re-Assignment | [**CVPRW 23 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10208943), [GitHub](https:\u002F\u002Fgithub.com\u002Fipl-uw\u002FAIC23_Track1_UWIPL_ETRI\u002Ftree\u002Fmain) |\n| 14) Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.02281) |\n| 15) Semi-Supervised Crowd Counting With Spatial Temporal Consistency and Pseudo-Label Filter | [**TCSVT 23 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10032602) |\n| 16) Spatio-temporal Consistency and Hierarchical Matching for Multi-Target Multi-Camera Vehicle Tracking | [**CVPRW 19 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_CVPRW_2019\u002Fhtml\u002FAI_City\u002FLi_Spatio-temporal_Consistency_and_Hierarchical_Matching_for_Multi-Target_Multi-Camera_Vehicle_Tracking_CVPRW_2019_paper.html) |\n| 17) **VideoDirectorGPT**: Consistent Multi-scene Video Generation via LLM-Guided Planning (-) | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15091) |\n| 18) **VideoDrafter**: Content-Consistent Multi-Scene Video Generation with LLM (-) | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.01256) |\n| 19) **MaskDiffusion**: Boosting Text-to-Image Consistency with Conditional Mask| [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04399) |\n| \u003Ch3 id=\"prompt-engineering\">11 Prompt Engineering\u003C\u002Fh3> | |\n| **Paper**  | **Link** |\n| 1) **RealCompo**: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12908), [GitHub](https:\u002F\u002Fgithub.com\u002FYangLing0818\u002FRealCompo), [Project](https:\u002F\u002Fcominclip.github.io\u002FRealCompo_Page\u002F) |\n| 2) **Mastering Text-to-Image Diffusion**: Recaptioning, Planning, and Generating with Multimodal LLMs | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.11708), [GitHub](https:\u002F\u002Fgithub.com\u002FYangLing0818\u002FRPG-DiffusionMaster) |\n| 3) **LLM-grounded Diffusion**: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models | [**TMLR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13655), [GitHub](https:\u002F\u002Fgithub.com\u002FTonyLianLong\u002FLLM-groundedDiffusion) |\n| 4) **LLM BLUEPRINT**: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTS | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.10640), [GitHub](https:\u002F\u002Fgithub.com\u002Fhananshafi\u002Fllmblueprint) |\n| 5) Progressive Text-to-Image Diffusion with Soft Latent Direction | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.09466) |\n| 6) Self-correcting LLM-controlled Diffusion Models | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16090), [GitHub](https:\u002F\u002Fgithub.com\u002Ftsunghan-wu\u002FSLD) |\n| 7) **LayoutLLM-T2I**: Eliciting Layout Guidance from LLM for Text-to-Image Generation | [**MM 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.05095) |\n| 8) **LayoutGPT**: Compositional Visual Planning and Generation with Large Language Models | [**NeurIPS 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15393), [GitHub](https:\u002F\u002Fgithub.com\u002Fweixi-feng\u002FLayoutGPT) |\n| 9) **Gen4Gen**: Generative Data Pipeline for Generative Multi-Concept Composition | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15504), [GitHub](https:\u002F\u002Fgithub.com\u002FlouisYen\u002FGen4Gen) |\n| 10) **InstructEdit**: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18047), [GitHub](https:\u002F\u002Fgithub.com\u002FQianWangX\u002FInstructEdit) |\n| 11) Controllable Text-to-Image Generation with GPT-4 | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18583) |\n| 12) LLM-grounded Video Diffusion Models | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17444) |\n| 13) **VideoDirectorGPT**: Consistent Multi-scene Video Generation via LLM-Guided Planning | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15091) |\n| 14) **FlowZero**: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15813), [Github](https:\u002F\u002Fgithub.com\u002Faniki-ly\u002FFlowZero), [Project](https:\u002F\u002Fflowzero-video.github.io\u002F) |\n| 15) **VideoDrafter**: Content-Consistent Multi-Scene Video Generation with LLM | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.01256) |\n| 16) **Free-Bloom**: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator | [**NeurIPS 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.14494) |\n| 17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.13812) |\n| 18) **MotionZero**: Exploiting Motion Priors for Zero-shot Text-to-Video Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16635) |\n| 19) **GPT4Motion**: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.12631) |\n| 20) Multimodal Procedural Planning via Dual Text-Image Prompting | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.01795), [Github](https:\u002F\u002Fgithub.com\u002FYujieLu10\u002FTIP) |\n| 21) **InstructCV**: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists | [**ICLR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00390), [Github](https:\u002F\u002Fgithub.com\u002FAlaaLab\u002FInstructCV) |\n| 22) **DreamSync**: Aligning Text-to-Image Generation with Image Understanding Feedback | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17946) |\n| 23) **TaleCrafter**: Interactive Story Visualization with Multiple Characters | [**SIGGRAPH Asia 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00390) |\n| 24) **Reason out Your Layout**: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17126), [Github](https:\u002F\u002Fgithub.com\u002FXiaohui9607\u002FLLM_layout_generator) |\n| 25) **COLE**: A Hierarchical Generation Framework for Graphic Design | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16974) |\n| 26) Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08056) |\n| 27) **Vlogger**: Make Your Dream A Vlog | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09414), [Github](https:\u002F\u002Fgithub.com\u002FVchitect\u002FVlogger) |\n| 28) **GALA3D**: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting | [**Paper**](https:\u002F\u002Fgithub.com\u002FVDIGPKU\u002FGALA3D) |\n| 29) **MuLan**: Multimodal-LLM Agent for Progressive Multi-Object Diffusion | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12741) |\n| \u003Ch4 id=\"theoretical-foundations-and-model-architecture\">Recaption\u003C\u002Fh4> |  |\n| **Paper** | **Link** |\n| 1) **LAVIE**: High-Quality Video Generation with Cascaded Latent Diffusion Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15103), [GitHub](https:\u002F\u002Fgithub.com\u002FVchitect\u002FLaVie) |\n| 2) **Reuse and Diffuse**: Iterative Denoising for Text-to-Video Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.03549), [GitHub](https:\u002F\u002Fgithub.com\u002Fanonymous0x233\u002FReuseAndDiffuse) |\n| 3) **CoCa**: Contrastive Captioners are Image-Text Foundation Models | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.01917), [Github](https:\u002F\u002Fgithub.com\u002Flucidrains\u002FCoCa-pytorch) |\n| 4) **CogView3**: Finer and Faster Text-to-Image Generation via Relay Diffusion | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05121) |\n| 5) **VideoChat**: Chat-Centric Video Understanding | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06355), [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything) |\n| 6) De-Diffusion Makes Text a Strong Cross-Modal Interface | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00618) |\n| 7) **HowToCaption**: Prompting LLMs to Transform Video Annotations at Scale | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.04900) |\n| 8) **SELMA**: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06952) |\n| 9) **LLMGA**: Multimodal Large Language Model based Generation Assistant | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.16500), [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLMGA) |\n| 10) **ELLA**: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05135), [Github](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA) |\n| 11) **MyVLM**: Personalizing VLMs for User-Specific Queries | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.14599.pdf) |\n| 12) A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.16656), [Github](https:\u002F\u002Fgithub.com\u002Fgirliemac\u002Fa-picture-is-worth-a-1000-words) |\n| 13) **Mastering Text-to-Image Diffusion**: Recaptioning, Planning, and Generating with Multimodal LLMs(-) | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2401.11708v2), [Github](https:\u002F\u002Fgithub.com\u002FYangLing0818\u002FRPG-DiffusionMaster) |\n| 14) **FlexCap**: Generating Rich, Localized, and Flexible Captions in Images | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12026) |\n| 15) **Video ReCap**: Recursive Captioning of Hour-Long Videos | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.13250.pdf), [Github](https:\u002F\u002Fgithub.com\u002Fmd-mohaiminul\u002FVideoRecap) |\n| 16) **BLIP**: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | [**ICML 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.12086), [Github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FBLIP) |\n| 17) **PromptCap**: Prompt-Guided Task-Aware Image Captioning | [**ICCV 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.09699), [Github](https:\u002F\u002Fgithub.com\u002FYushi-Hu\u002FPromptCap) |\n| 18) **CIC**: A framework for Culturally-aware Image Captioning | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.05374) |\n| 19) Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11593) |\n| 20) **FuseCap**: Leveraging Large Language Models for Enriched Fused Image Captions | [**WACV 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17718), [Github](https:\u002F\u002Fgithub.com\u002FRotsteinNoam\u002FFuseCap) |\n| \u003Ch3 id=\"security\">12 Security\u003C\u002Fh3> |  |\n| **Paper**  | **Link** |\n| 1) **BeaverTails:** Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F4dbb61cb68671edc4ca3712d70083b9f-Paper-Datasets_and_Benchmarks.pdf), [Github](https:\u002F\u002Fgithub.com\u002FPKU-Alignment\u002Fbeavertails) |\n| 2) **LIMA:** Less Is More for Alignment | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Fac662d74829e4407ce1d126477f4a03a-Paper-Conference.pdf) |\n| 3) **Jailbroken:** How Does LLM Safety Training Fail? | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Ffd6613131889a4b656206c50a8bd7790-Paper-Conference.pdf) |\n| 4) **Safe Latent Diffusion:** Mitigating Inappropriate Degeneration in Diffusion Models | [**CVPR 23 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FSchramowski_Safe_Latent_Diffusion_Mitigating_Inappropriate_Degeneration_in_Diffusion_Models_CVPR_2023_paper.pdf) |\n| 5) **Stable Bias:** Evaluating Societal Representations in Diffusion Models | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Fb01153e7112b347d8ed54f317840d8af-Paper-Datasets_and_Benchmarks.pdf) |\n| 6) Ablating concepts in text-to-image  diffusion models | **[ICCV 23 Paper](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fpapers\u002FKumari_Ablating_Concepts_in_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf)** |\n| 7) Diffusion art or digital forgery?  investigating data replication in diffusion models | [**ICCV 23 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FSomepalli_Diffusion_Art_or_Digital_Forgery_Investigating_Data_Replication_in_Diffusion_CVPR_2023_paper.pdf), [Project](https:\u002F\u002Fsomepago.github.io\u002Fdiffrep.html) |\n| 8) Eternal Sunshine of the Spotless Net:  Selective Forgetting in Deep Networks | **[ICCV 20 Paper](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_CVPR_2020\u002Fpapers\u002FGolatkar_Eternal_Sunshine_of_the_Spotless_Net_Selective_Forgetting_in_Deep_CVPR_2020_paper.pdf)** |\n| 9) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks | [**ICML 20 Paper**](http:\u002F\u002Fproceedings.mlr.press\u002Fv119\u002Fcroce20b\u002Fcroce20b.pdf) |\n| 10) A pilot study of query-free adversarial  attack against stable diffusion | **[ICCV 23 Paper](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023W\u002FAML\u002Fpapers\u002FZhuang_A_Pilot_Study_of_Query-Free_Adversarial_Attack_Against_Stable_Diffusion_CVPRW_2023_paper.pdf)** |\n| 11) Interpretable-Through-Prototypes Deepfake Detection for Diffusion Models | **[ICCV 23 Paper](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023W\u002FDFAD\u002Fpapers\u002FAghasanli_Interpretable-Through-Prototypes_Deepfake_Detection_for_Diffusion_Models_ICCVW_2023_paper.pdf)** |\n| 12) Erasing Concepts from Diffusion Models                   | **[ICCV 23 Paper](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fpapers\u002FGandikota_Erasing_Concepts_from_Diffusion_Models_ICCV_2023_paper.pdf)**, [Project](http:\u002F\u002Ferasing.baulab.info\u002F) |\n| 13) Ablating Concepts in Text-to-Image Diffusion Models      | **[ICCV 23 Paper](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fpapers\u002FKumari_Ablating_Concepts_in_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf)**, [Project](https:\u002F\u002Fwww.cs.cmu.edu\u002F) |\n| 14) **BEAVERTAILS:** Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | **[NeurIPS 23 Paper](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F4dbb61cb68671edc4ca3712d70083b9f-Paper-Datasets_and_Benchmarks.pdf)**, [Project](https:\u002F\u002Fsites.google.com\u002Fview\u002Fpku-beavertails) |\n| 15) **Stable Bias:** Evaluating Societal Representations in Diffusion Models | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Fb01153e7112b347d8ed54f317840d8af-Paper-Datasets_and_Benchmarks.pdf) |\n| 16) Threat Model-Agnostic Adversarial Defense using Diffusion Models | **[Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.08089)**                |\n| 17) How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions? | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.15230), [Github](https:\u002F\u002Fgithub.com\u002FHritikbansal\u002Fentigen_emnlp) |\n| 18) Differentially Private Diffusion Models Generate Useful Synthetic Images | **[Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.13861)**                |\n| 19) Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models | **[SIGSAC 23 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13873)**, [Github](https:\u002F\u002Fgithub.com\u002FYitingQu\u002Funsafe-diffusion) |\n| 20) Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models | **[Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.17591)**, [Github](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FForget-Me-Not) |\n| 21) Unified Concept Editing in Diffusion Models              | [**WACV 24 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FWACV2024\u002Fpapers\u002FGandikota_Unified_Concept_Editing_in_Diffusion_Models_WACV_2024_paper.pdf), [Project](https:\u002F\u002Funified.baulab.info\u002F) |\n| 22) Diffusion Model Alignment Using Direct Preference Optimization | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.12908) |\n| 23) **RAFT:** Reward rAnked FineTuning for Generative Foundation Model Alignment | [**TMLR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.06767) , [Github](https:\u002F\u002Fgithub.com\u002FOptimalScale\u002FLMFlow) |\n| 24) Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation | [**Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.05699), [Github](https:\u002F\u002Fgithub.com\u002FShuoTang123\u002FMATRIX), [Project](https:\u002F\u002Fshuotang123.github.io\u002FMATRIX\u002F) |\n| \u003Ch3 id=\"world-model\">13 World Model\u003C\u002Fh3> | |\n| **Paper**  | **Link** |\n| 1) **NExT-GPT**: Any-to-Any Multimodal LLM | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05519), [GitHub](https:\u002F\u002Fgithub.com\u002FNExT-GPT\u002FNExT-GPT) |\n| \u003Ch3 id=\"video-compression\">14 Video Compression\u003C\u002Fh3> ||\n| **Paper**  | **Link** |\n| 1) **H.261**: Video codec for audiovisual services at p x 64 kbit\u002Fs | [**Paper**](https:\u002F\u002Fwww.itu.int\u002Frec\u002FT-REC-H.261-199303-I\u002Fen) |\n| 2) **H.262**: Information technology - Generic coding of moving pictures and associated audio information: Video | [**Paper**](https:\u002F\u002Fwww.itu.int\u002Frec\u002FT-REC-H.262-201202-I\u002Fen) |\n| 3) **H.263**: Video coding for low bit rate communication | [**Paper**](https:\u002F\u002Fwww.itu.int\u002Frec\u002FT-REC-H.263-200501-I\u002Fen) |\n| 4) **H.264**: Overview of the H.264\u002FAVC video coding standard | [**Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F1218189) |\n| 5) **H.265**: Overview of the High Efficiency Video Coding (HEVC) Standard | [**Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F6316136) |\n| 6) **H.266**: Overview of the Versatile Video Coding (VVC) Standard and its Applications | [**Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F9503377) |\n| 7) **DVC**: An End-to-end Deep Video Compression Framework | [**CVPR 19 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1812.00101), [GitHub](https:\u002F\u002Fgithub.com\u002FGuoLusjtu\u002FDVC\u002Ftree\u002Fmaster) |\n| 8) **OpenDVC**: An Open Source Implementation of the DVC Video Compression Method | [**Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.15862), [GitHub](https:\u002F\u002Fgithub.com\u002FRenYang-home\u002FOpenDVC) |\n| 9) **HLVC**: Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement | [**CVPR 20 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2003.01966), [Github](https:\u002F\u002Fgithub.com\u002FRenYang-home\u002FHLVC) |\n| 10) **RLVC**: Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model | [**J-STSP 21 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9288876), [Github](https:\u002F\u002Fgithub.com\u002FRenYang-home\u002FRLVC) |\n| 11) **PLVC**: Perceptual Learned Video Compression with Recurrent Conditional GAN | [**IJCAI 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.03082), [Github](https:\u002F\u002Fgithub.com\u002FRenYang-home\u002FPLVC) |\n| 12) **ALVC**: Advancing Learned Video Compression with In-loop Frame Prediction | [**T-CSVT 22 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9950550), [Github](https:\u002F\u002Fgithub.com\u002FRenYang-home\u002FALVC) |\n| 13) **DCVC**: Deep Contextual Video Compression | [**NeurIPS 21 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2021\u002Ffile\u002F96b250a90d3cf0868c83f8c965142d2a-Paper.pdf), [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDCVC\u002Ftree\u002Fmain\u002FDCVC) |\n| 14) **DCVC-TCM**: Temporal Context Mining for Learned Video Compression | [**TM 22 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F9941493), [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDCVC\u002Ftree\u002Fmain\u002FDCVC-TCM) |\n| 15) **DCVC-HEM**: Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression | [**MM 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.05894), [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDCVC\u002Ftree\u002Fmain\u002FDCVC-HEM) |\n| 16) **DCVC-DC**: Neural Video Compression with Diverse Contexts | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.14402), [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDCVC\u002Ftree\u002Fmain\u002FDCVC-DC) |\n| 17) **DCVC-FM**: Neural Video Compression with Feature Modulation | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17414), [Github](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDCVC\u002Ftree\u002Fmain\u002FDCVC-FM) |\n| 18) **SSF**: Scale-Space Flow for End-to-End Optimized Video Compression | [**CVPR 20 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_CVPR_2020\u002Fhtml\u002FAgustsson_Scale-Space_Flow_for_End-to-End_Optimized_Video_Compression_CVPR_2020_paper.html), [Github](https:\u002F\u002Fgithub.com\u002FInterDigitalInc\u002FCompressAI) |\n| \u003Ch3 id=\"Mamba\">15 Mamba\u003C\u002Fh3> ||\n| \u003Ch4 id=\"theoretical-foundations-and-model-architecture\">15.1 Theoretical Foundations and Model Architecture\u003C\u002Fh4> | |\n| **Paper** | **Link** |\n| 1) **Mamba**: Linear-Time Sequence Modeling with Selective State Spaces | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752), [Github](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba) |\n| 2) Efficiently Modeling Long Sequences with Structured State Spaces | [**ICLR 22 Paper**](https:\u002F\u002Ficlr.cc\u002Fvirtual\u002F2022\u002Fposter\u002F6959), [Github](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fs4) |\n| 3) Modeling Sequences with Structured State Spaces | [**Paper**](https:\u002F\u002Fpurl.stanford.edu\u002Fmb976vf9362) |\n| 4) Long Range Language Modeling via Gated State Spaces | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.13947), [GitHub](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fgated-state-spaces-pytorch) |\n| \u003Ch4 id=\"image-generation-and-visual-applications\">15.2 Image Generation and Visual Applications\u003C\u002Fh4> | |\n| **Paper** | **Link** |\n| 1) Diffusion Models Without Attention | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.18257) |\n| 2) **Pan-Mamba**: Effective Pan-Sharpening with State Space Model  | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12192), [Github](https:\u002F\u002Fgithub.com\u002Falexhe101\u002FPan-Mamba) |\n| 3) Pretraining Without Attention | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.10544), [Github](https:\u002F\u002Fgithub.com\u002Fjxiw\u002FBiGS) |\n| 4) Block-State Transformers | [**NIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F16ccd203e9e3696a7ab0dcf568316379-Abstract-Conference.html) |\n| 5) **Vision Mamba**: Efficient Visual Representation Learning with Bidirectional State Space Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09417), [Github](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FVim) |\n| 6) VMamba: Visual State Space Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10166), [Github](https:\u002F\u002Fgithub.com\u002FMzeroMiko\u002FVMamba) |\n| 7) ZigMa: Zigzag Mamba Diffusion Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.13802), [Github](https:\u002F\u002Ftaohu.me\u002Fzigma\u002F) |\n| 8) **MambaVision**: A Hybrid Mamba-Transformer Vision Backbone | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.08083), [GitHub](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FMambaVision) |\n| \u003Ch4 id=\"video-processing-and-understanding\">15.3 Video Processing and Understanding\u003C\u002Fh4> | |\n| **Paper** | **Link** |\n| 1) Long Movie Clip Classification with State-Space Video Models | [**ECCV 22 Paper**](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-031-19833-5_6), [Github](https:\u002F\u002Fgithub.com\u002Fmd-mohaiminul\u002FViS4mer) |\n| 2) Selective Structured State-Spaces for Long-Form Video Understanding | [**CVPR 23 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fhtml\u002FWang_Selective_Structured_State-Spaces_for_Long-Form_Video_Understanding_CVPR_2023_paper.html) |\n| 3) Efficient Movie Scene Detection Using State-Space Transformers | [**CVPR 23 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fhtml\u002FIslam_Efficient_Movie_Scene_Detection_Using_State-Space_Transformers_CVPR_2023_paper.html), [Github](https:\u002F\u002Fgithub.com\u002Fmd-mohaiminul\u002FTranS4mer) |\n| 4) VideoMamba: State Space Model for Efficient Video Understanding | [**Paper**](http:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06977), [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FVideoMamba) |\n| \u003Ch4 id=\"medical-image-processing\">15.4 Medical Image Processing\u003C\u002Fh4> | |\n| **Paper** | **Link** |\n| 1) **Swin-UMamba**: Mamba-based UNet with ImageNet-based pretraining | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.03302), [Github](https:\u002F\u002Fgithub.com\u002FJiarunLiu\u002FSwin-UMamba) |\n| 2) **MambaIR**: A Simple Baseline for Image Restoration with State-Space Model | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15648), [Github](https:\u002F\u002Fgithub.com\u002Fcsguoh\u002FMambaIR) |\n| 3) VM-UNet: Vision Mamba UNet for Medical Image Segmentation | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.02491), [Github](https:\u002F\u002Fgithub.com\u002FJCruan519\u002FVM-UNet) |\n|  | |\n| \u003Ch3 id=\"existing-high-quality-resources\">16 Existing high-quality resources\u003C\u002Fh3> | |\n| **Resources**  | **Link** |\n| 1) Datawhale - AI视频生成学习 | [Feishu doc](https:\u002F\u002Fdatawhaler.feishu.cn\u002Fdocx\u002FG4LkdaffWopVbwxT1oHceiv9n0c) |\n| 2) A Survey on Generative Diffusion Model | [**TKDE 24 Paper**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.02646.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fchq1155\u002FA-Survey-on-Generative-Diffusion-Model) |\n| 3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion Models | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.10647), [GitHub](https:\u002F\u002Fgithub.com\u002FChenHsing\u002FAwesome-Video-Diffusion-Models) |\n| 4) Awesome-Text-To-Video：A Survey on Text-to-Video Generation\u002FSynthesis  | [GitHub](https:\u002F\u002Fgithub.com\u002Fjianzhnie\u002Fawesome-text-to-video)|\n| 5) video-generation-survey: A reading list of video generation| [GitHub](https:\u002F\u002Fgithub.com\u002Fyzhang2016\u002Fvideo-generation-survey)|\n| 6) Awesome-Video-Diffusion |  [GitHub](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FAwesome-Video-Diffusion) |\n| 7) Video Generation Task in Papers With Code |  [Task](https:\u002F\u002Fpaperswithcode.com\u002Ftask\u002Fvideo-generation) |\n| 8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17177), [GitHub](https:\u002F\u002Fgithub.com\u002Flichao-sun\u002FSoraReview) |\n| 9) Open-Sora-Plan (PKU-YuanGroup) |  [GitHub](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FOpen-Sora-Plan) |\n| 10) State of the Art on Diffusion Models for Visual Computing | [**Paper**](http:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07204) |\n| 11) Diffusion Models: A Comprehensive Survey of Methods and Applications | [**CSUR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.00796), [GitHub](https:\u002F\u002Fgithub.com\u002FYangLing0818\u002FDiffusion-Models-Papers-Survey-Taxonomy) |\n| 12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable | [**Paper**](https:\u002F\u002Fwww.techrxiv.org\u002Fusers\u002F684880\u002Farticles\u002F718900-generate-impressive-videos-with-text-instructions-a-review-of-openai-sora-stable-diffusion-lumiere-and-comparable) |\n| 13) On the Design Fundamentals of Diffusion Models: A Survey | [**Paper**](http:\u002F\u002Farxiv.org\u002Fabs\u002F2306.04542) |\n| 14) Efficient Diffusion Models for Vision: A Survey | [**Paper**](http:\u002F\u002Farxiv.org\u002Fabs\u002F2210.09292) |\n| 15) Text-to-Image Diffusion Models in Generative AI: A Survey | [**Paper**](http:\u002F\u002Farxiv.org\u002Fabs\u002F2303.07909) |\n| 16) Awesome-Diffusion-Transformers | [GitHub](https:\u002F\u002Fgithub.com\u002FShoufaChen\u002FAwesome-Diffusion-Transformers), [Project](https:\u002F\u002Fwww.shoufachen.com\u002FAwesome-Diffusion-Transformers\u002F) |\n| 17) Open-Sora (HPC-AI Tech) |  [GitHub](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora), [Blog](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fopen-sora) |\n| 18) **LAVIS** - A Library for Language-Vision Intelligence | [**ACL 23 Paper**](https:\u002F\u002Faclanthology.org\u002F2023.acl-demo.3.pdf), [GitHub](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002Flavis), [Project](https:\u002F\u002Fopensource.salesforce.com\u002FLAVIS\u002F\u002Flatest\u002Findex.html) |\n| 19) **OpenDiT**: An Easy, Fast and Memory-Efficient System for DiT Training and Inference | [GitHub](https:\u002F\u002Fgithub.com\u002FNUS-HPC-AI-Lab\u002FOpenDiT) |\n| 20) Awesome-Long-Context |[GitHub1](https:\u002F\u002Fgithub.com\u002Fzetian1025\u002Fawesome-long-context), [GitHub2](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FAwesome-Long-Context) |\n| 21) Lite-Sora |[GitHub](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Flite-sora\u002F) |\n| 22) **Mira**: A Mini-step Towards Sora-like Long Video Generation |[GitHub](https:\u002F\u002Fgithub.com\u002Fmira-space\u002FMira), [Project](https:\u002F\u002Fmira-space.github.io\u002F) |\n| \u003Ch3 id=\"train\">17 Efficient Training\u003C\u002Fh3> | |\n| \u003Ch4 id=\"train_paral\">17.1 Parallelism based Approach\u003C\u002Fh4> | |\n| \u003Ch5 id=\"train_paral_dp\">17.1.1 Data Parallelism (DP)\u003C\u002Fh5> | |\n| 1) A bridging model for parallel computation | [**Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F79173.79181)|\n| 2) PyTorch Distributed: Experiences on Accelerating Data Parallel Training | [**VLDB 20 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.15704) |\n| \u003Ch5 id=\"train_paral_mp\">17.1.2 Model Parallelism (MP)\u003C\u002Fh5> | |\n| 1) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | [**ArXiv 19 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.08053) |\n| 2) TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models | [**PMLR 21 Paper**](https:\u002F\u002Fproceedings.mlr.press\u002Fv139\u002Fli21y.html) |\n| \u003Ch5 id=\"train_paral_pp\">17.1.3 Pipeline Parallelism (PP)\u003C\u002Fh5> | |\n| 1) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism | [**NeurIPS 19 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2019\u002Fhash\u002F093f65e080a295f8076b1c5722a46aa2-Abstract.html) |\n| 2) PipeDream: generalized pipeline parallelism for DNN training | [**SOSP 19 Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3341301.3359646) |\n| \u003Ch5 id=\"train_paral_gp\">17.1.4 Generalized Parallelism (GP)\u003C\u002Fh5> | |\n| 1) Mesh-TensorFlow: Deep Learning for Supercomputers | [**ArXiv 18 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.02084) |\n| 2) Beyond Data and Model Parallelism for Deep Neural Networks | [**MLSys 19 Paper**](https:\u002F\u002Fproceedings.mlsys.org\u002Fpaper_files\u002Fpaper\u002F2019\u002Fhash\u002Fb422680f3db0986ddd7f8f126baaf0fa-Abstract.html) |\n| \u003Ch5 id=\"train_paral_zp\">17.1.5 ZeRO Parallelism (ZP)\u003C\u002Fh5> | |\n| 1) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | [**ArXiv 20**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.02054) |\n| 2) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters | [**ACM 20 Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3394486.3406703) |\n| 3) ZeRO-Offload: Democratizing Billion-Scale Model Training | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2101.06840) |\n| 4) PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel | [**ArXiv 23**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.11277) |\n| \u003Ch4 id=\"train_non\">17.2 Non-parallelism based Approach\u003C\u002Fh4> | |\n| \u003Ch5 id=\"train_non_reduce\">17.2.1 Reducing Activation Memory\u003C\u002Fh5> | |\n| 1) Gist: Efficient Data Encoding for Deep Neural Network Training | [**IEEE 18 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8416872) |\n| 2) Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization | [**MLSys 20 Paper**](https:\u002F\u002Fproceedings.mlsys.org\u002Fpaper_files\u002Fpaper\u002F2020\u002Fhash\u002F0b816ae8f06f8dd3543dc3d9ef196cab-Abstract.html) |\n| 3) Training Deep Nets with Sublinear Memory Cost | [**ArXiv 16 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.06174) |\n| 4) Superneurons: dynamic GPU memory management for training deep neural networks | [**ACM 18 Paper**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3178487.3178491) |\n| \u003Ch5 id=\"train_non_cpu\">17.2.2 CPU-Offloading\u003C\u002Fh5> | |\n| 1) Training Large Neural Networks with Constant Memory using a New Execution Algorithm | [**ArXiv 20 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.05645) |\n| 2) vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design | [**IEEE 16 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F7783721) |\n| \u003Ch5 id=\"train_non_mem\">17.2.3 Memory Efficient Optimizer\u003C\u002Fh5> | |\n| 1) Adafactor: Adaptive Learning Rates with Sublinear Memory Cost | [**PMLR 18 Paper**](https:\u002F\u002Fproceedings.mlr.press\u002Fv80\u002Fshazeer18a.html?ref=https:\u002F\u002Fgithubhelp.com) |\n| 2) Memory-Efficient Adaptive Optimization for Large-Scale Learning | [**Paper**](http:\u002F\u002Fdml.mathdoc.fr\u002Fitem\u002F1901.11150\u002F) |\n| \u003Ch4 id=\"train_struct\">17.3 Novel Structure\u003C\u002Fh4> | |\n| 1) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05135) [Github](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FELLA) |\n| \u003Ch3 id=\"infer\">18 Efficient Inference\u003C\u002Fh3> | |\n| \u003Ch4 id=\"infer_reduce\">18.1 Reduce Sampling Steps\u003C\u002Fh4> | |\n| \u003Ch5 id=\"infer_reduce_continuous\">18.1.1 Continuous Steps\u003C\u002Fh4> | |\n| 1) Generative Modeling by Estimating Gradients of the Data Distribution | [**NeurIPS 19 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1907.05600) |\n| 2) WaveGrad: Estimating Gradients for Waveform Generation | [**ArXiv 20**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2009.00713) |\n| 3) Noise Level Limited Sub-Modeling for Diffusion Probabilistic Vocoders | [**ICASSP 21 Paper**](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F9415087) |\n| 4) Noise Estimation for Generative Diffusion Models | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.02600) |\n| \u003Ch5 id=\"infer_reduce_fast\">18.1.2 Fast Sampling\u003C\u002Fh5> | |\n| 1) Denoising Diffusion Implicit Models | [**ICLR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.02502) |\n| 2) DiffWave: A Versatile Diffusion Model for Audio Synthesis | [**ICLR 21 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2009.09761) |\n| 3) On Fast Sampling of Diffusion Probabilistic Models | [**ArXiv 21**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.00132) |\n| 4) DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps | [**NeurIPS 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.00927) |\n| 5) DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models | [**ArXiv 22**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.01095) |\n| 6) Fast Sampling of Diffusion Models with Exponential Integrator | [**ICLR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.13902) |\n| \u003Ch5 id=\"infer_reduce_dist\">18.1.3 Step distillation\u003C\u002Fh5> | |\n| 1) On Distillation of Guided Diffusion Models | [**CVPR 23 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.03142) |\n| 2) Progressive Distillation for Fast Sampling of Diffusion Models | [**ICLR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.00512) |\n| 3) SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F41bcc9d3bddd9c90e1f44b29e26d97ff-Abstract-Conference.html) |\n| 4) Tackling the Generative Learning Trilemma with Denoising Diffusion GANs | [**ICLR 22 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.07804) |\n| \u003Ch4 id=\"infer_opt\">18.2 Optimizing Inference\u003C\u002Fh4> | |\n| \u003Ch5 id=\"infer_opt_low\">18.2.1 Low-bit Quantization\u003C\u002Fh5> | |\n| 1) Q-Diffusion: Quantizing Diffusion Models | [**CVPR 23 Paper**](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fhtml\u002FLi_Q-Diffusion_Quantizing_Diffusion_Models_ICCV_2023_paper.html) |\n| 2) Q-DM: An Efficient Low-bit Quantized Diffusion Model | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002Ff1ee1cca0721de55bb35cf28ab95e1b4-Abstract-Conference.html) |\n| 3) Temporal Dynamic Quantization for Diffusion Models | [**NeurIPS 23 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F983591c3e9a0dc94a99134b3238bbe52-Abstract-Conference.html) |\n| \u003Ch5 id=\"infer_opt_ps\">18.2.2 Parallel\u002FSparse inference\u003C\u002Fh5> | |\n| 1) DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models | [**CVPR 24 Paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.19481) |\n| 2) Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models | [**NeurIPS 22 Paper**](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2022\u002Fhash\u002Fb9603de9e49d0838e53b6c9cf9d06556-Abstract-Conference.html) |\n| 3) PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models | [**ArXiv 24**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14430) |\n\n## 引用\n\n如果本项目对您的工作有所帮助，请使用以下格式引用：\n\n```bibtex\n@misc{minisora,\n    title={MiniSora},\n    author={MiniSora社区},\n    url={https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora},\n    year={2024}\n}\n```\n\n```bibtex\n@misc{minisora,\n    title={从DDPM到Sora的基于扩散模型的视频生成模型综述},\n    author={MiniSora社区综述论文组},\n    url={https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora},\n    year={2024}\n}\n```\n\n## Minisora社区微信群\n\n\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_d3907a8723c3.png\" width=\"200\"\u002F>\n  \u003Cdiv>&nbsp;\u003C\u002Fdiv>\n  \u003Cdiv align=\"center\">\n  \u003C\u002Fdiv>\n\u003C\u002Fdiv>\n\n## 星标历史\n\n[![星标历史图表](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_18a48be19e7f.png)](https:\u002F\u002Fstar-history.com\u002F#mini-sora\u002Fminisora&Date)\n\n## 如何为Mini Sora社区贡献力量\n\n我们非常感谢您对Mini Sora开源社区的贡献，帮助我们让它比现在更加出色！\n\n更多详情，请参阅[贡献指南](.\u002F.github\u002FCONTRIBUTING.md)。\n\n## 社区贡献者\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_readme_a338462049c0.png\" \u002F>\n\u003C\u002Fa>\n\n[your-project-path]: mini-sora\u002Fminisora\n[contributors-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002Fmini-sora\u002Fminisora.svg?style=flat-square\n[contributors-url]: https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fgraphs\u002Fcontributors\n[forks-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fmini-sora\u002Fminisora.svg?style=flat-square\n[forks-url]: https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fnetwork\u002Fmembers\n[stars-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmini-sora\u002Fminisora.svg?style=flat-square\n[stars-url]: https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fstargazers\n[issues-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Fmini-sora\u002Fminisora.svg?style=flat-square\n[issues-url]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Fmini-sora\u002Fminisora.svg\n[license-shield]: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fmini-sora\u002Fminisora.svg?style=flat-square\n[license-url]: https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fblob\u002Fmain\u002FLICENSE","# MiniSora 快速上手指南\n\nMiniSora 是一个由社区驱动的开源项目，旨在探索 Sora 视频生成模型的实现路径。本项目目前主要聚焦于复现相关论文（如 DiT）、技术调研以及组织圆桌讨论，而非直接提供一个“一键生成视频”的成品软件。\n\n以下是参与 MiniSora 社区核心复现任务（基于 XTuner 复现 DiT）及获取学习资源的快速指南。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求，特别是如果您计划参与模型复现训练：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+)\n*   **Python**: 3.8 或更高版本\n*   **GPU 要求**:\n    *   **推理\u002F轻量实验**: 建议至少 16GB 显存 (如 RTX 3090\u002F4090)。\n    *   **训练复现 (DiT)**: 官方推荐配置为 2x A100 (80G) 或同等算力。社区目标是在 8x A100\u002FA6000 或消费级显卡集群上实现训练。\n*   **前置依赖**:\n    *   CUDA 11.7+\n    *   PyTorch 2.0+\n    *   Git\n\n## 安装步骤\n\nMiniSora 的核心复现工作主要依托于 [XTuner](https:\u002F\u002Fgithub.com\u002FinternLM\u002Fxtuner) 框架进行。请按照以下步骤搭建基础环境：\n\n1.  **克隆仓库**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora.git\n    cd minisora\n    ```\n\n2.  **安装核心依赖 (XTuner)**\n    由于 MiniSora-DiT 项目深度集成 XTuner，建议优先安装 XTuner 及其依赖。国内开发者推荐使用镜像源加速安装。\n\n    ```bash\n    # 创建虚拟环境 (推荐)\n    conda create -n minisora python=3.10 -y\n    conda activate minisora\n\n    # 安装 PyTorch (根据实际 CUDA 版本调整，此处以 CUDA 11.8 为例)\n    pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n\n    # 安装 XTuner (使用国内镜像加速)\n    pip install -U xtuner[all] -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n3.  **安装 MiniSora 特定依赖**\n    进入具体的复现子模块（如 MiniSora-DiT）安装额外需求：\n\n    ```bash\n    cd codes\u002Fminisora-DiT  # 路径可能随项目更新变动，请以实际目录结构为准\n    pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n## 基本使用\n\nMiniSora 目前主要提供**代码复现参考**、**技术文档**和**论文解读**。最直接的使用方式是运行社区复现的 DiT 模型或阅读技术综述。\n\n### 1. 运行 DiT 复现示例 (基于 XTuner)\n\n如果您已配置好环境，可以使用 XTuner 启动 DiT 的训练或推理流程。以下是一个典型的训练配置启动命令示例：\n\n```bash\n# 使用 XTuner 启动 DiT 训练 (需替换为具体的配置文件路径)\nxtuner train .\u002Fconfigs\u002Fdit_config.py --work-dir .\u002Fwork_dirs\u002Fdit_output\n```\n\n*注意：具体的配置文件 (`dit_config.py`) 和数据集路径需要在 `minisora-DiT` 子目录中根据官方文档进行详细配置。*\n\n### 2. 获取技术资源与论文解读\n\n对于大多数开发者，快速上手的最佳方式是查阅社区整理的技术综述和论文笔记：\n\n*   **查看 Sora 技术综述**:\n    访问 `docs\u002Fsurvey_README.md` 获取从 DDPM 到 Sora 的视频生成模型全面回顾。\n    \n*   **阅读论文解读笔记**:\n    项目 `notes\u002F` 目录下包含了社区对关键论文的详细中文解读，包括：\n    *   **Sora**: [Video generation models as world simulators](https:\u002F\u002Fopenai.com\u002Fresearch\u002Fvideo-generation-models-as-world-simulators)\n    *   **DiT**: [Scalable Diffusion Models with Transformers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09748)\n    *   **Stable Diffusion 3 (MM-DiT)**: 详见 `notes\u002FSD3_zh-CN.md`\n    *   **Latte**: [Latent Diffusion Transformer for Video Generation](https:\u002F\u002Fmaxin-cn.github.io\u002Flatte_project\u002F)\n\n### 3. 加入社区交流\n\nMiniSora 强调社区协作，建议加入微信群参与圆桌讨论和获取最新进展：\n*   **微信群二维码**: [点击此处查看](https:\u002F\u002Fcdn.vansin.top\u002Fminisora.jpg)\n*   **GitHub Discussions**: 关注 [Reproduction Group](.\u002Fcodes\u002FREADME.md) 获取最新的复现目标和招募信息。","某高校 AI 实验室的研究团队正致力于复现 Sora 的核心架构，以探索视频生成模型在有限算力下的落地路径。\n\n### 没有 minisora 时\n- **复现门槛极高**：团队需从零梳理 Sora 相关论文（如 DiT）与代码，缺乏系统性的技术路线图，导致研究方向模糊且耗时漫长。\n- **算力资源受限**：原始 Sora 架构对显存要求苛刻，实验室仅有的几张 RTX4090 或单卡 A100 无法支撑完整训练，被迫放弃实验。\n- **技术细节黑盒**：从 DDPM 到 Sora 的演进过程中，关键的技术改进点（如时空补丁化、Transformer 结构优化）缺乏详细解读，成员只能盲目试错。\n- **协作效率低下**：团队成员各自为战，缺乏统一的代码基准和定期的深度研讨，导致重复造轮子且难以沉淀通用成果。\n\n### 使用 minisora 后\n- **路径清晰指引**：minisora 提供了从理论综述到代码复现的完整路线图（如\"From DDPM to Sora\"综述），让团队迅速锁定 DiT 等关键突破口。\n- **轻量化适配**：借助 minisora-DiT 项目，团队利用 XTuner 技术在 2 张 A100 甚至消费级显卡上成功跑通了长序列训练，大幅降低了硬件门槛。\n- **深度技术拆解**：通过社区组织的圆桌会议和论文精读（如 Stable Diffusion 3 解读），成员快速掌握了 MM-DiT 等核心机制，避免了理解偏差。\n- **开源协同加速**：团队直接接入 minisora 社区的贡献者网络，复用已有的复现代码库，将原本数月的预研周期缩短至数周。\n\nminisora 通过构建低门槛、高协同的开源生态，让普通研究者也能在有限资源下触达视频生成领域的前沿技术。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmini-sora_minisora_4c0382e9.jpg","mini-sora","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmini-sora_e46a03a6.jpg",null,"https:\u002F\u002Fgithub.com\u002Fmini-sora",[79,83,87,90],{"name":80,"color":81,"percentage":82},"Python","#3572A5",97.9,{"name":84,"color":85,"percentage":86},"Jupyter Notebook","#DA5B0B",1,{"name":88,"color":89,"percentage":86},"Shell","#89e051",{"name":91,"color":92,"percentage":93},"CSS","#663399",0.2,1284,149,"2026-04-14T09:04:18","Apache-2.0",4,"未说明","训练推荐：8x A100 (80G), 8x A6000 (48G) 或 RTX4090 (24G)；MiniSora-DiT 复现项目支持：2x A100。目标为 GPU 友好型，支持低显存需求。",{"notes":102,"python":99,"dependencies":103},"该项目是一个社区驱动的开源计划，旨在探索 Sora 的实现路径。核心子项目 MiniSora-DiT 依赖 XTuner 框架复现 DiT 论文，要求贡献者熟悉 OpenMMLab MMEngine 机制。项目目标是实现低显存、高效率的训练与推理（目标分辨率 480p，时长 3-10 秒）。具体安装脚本和版本要求在提供的 README 片段中未详细列出，需参考子项目仓库。",[104,105,106,107],"XTuner","OpenMMLab MMEngine","DiT","PyTorch (隐含)",[15,61],[110,111,112],"diffusion","sora","video-generation","2026-03-27T02:49:30.150509","2026-04-19T03:05:10.289684",[116,121,126,131,136,141,146],{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},41504,"视频编码标准（如 H.264\u002FH.265）与视频生成中的压缩技术有什么关系？","视频压缩模型在编解码当前帧时会将最新解码的帧作为参考，其架构类似于变分自编码器（VAE）。如果将熵模型和字节流部分替换为生成模型（如 DiT），视频生成过程将在视频压缩编码器的潜在空间中进行，整个模型可被视为仅参考前一生成帧的自回归视频生成模型。此外，视频压缩模型作为一种典型的时空一致性模型，包含了不依赖 3D 卷积或时空补丁的时空信息，这种结合时空信息以实现一致性的方法是视频生成技术的重要组成部分。","https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fissues\u002F274",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},41505,"该项目使用什么开源许可证？","项目已添加 Apache 2.0 许可证（通过 PR #171），贡献者可以在此许可下自由使用和贡献代码。","https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fissues\u002F169",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},41506,"论文列表中应该保留 arXiv 链接还是会议官方链接？","建议同时保留 arXiv 链接和简写的会议名称（例如 'ICCV 23 Paper'）。因为 arXiv 版本通常包含更详细的信息和补充资源，且对所有人开放访问，具有更高的参考价值。如果论文已有会议版本，可以在标记中注明会议名称，但链接仍优先指向 arXiv 以便获取完整内容。","https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fissues\u002F101",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},41507,"如何正确提交 PR 以自动关闭相关的 Issue？","在提交 Pull Request (PR) 时，请在描述中包含 'Close #Issue编号'（例如 'Close #244'），并将其填写在 PR 的 'Related Issues' 字段中。这样当 PR 被合并时，对应的 Issue 会自动关闭。","https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fissues\u002F244",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},41508,"《Taming Transformers for High-Resolution Image Synthesis》这篇论文属于 Diffusion Transformer 类别吗？","这篇文章的方法与典型的扩散模型和扩散 Transformer 有所不同。它主要是一个结合了 CNN 和 Transformer 的架构，其核心结构是 Unet 或自编码器（AE），并非专门针对视频生成设计，因此分类较为特殊，通常不被直接归类为纯粹的 Diffusion Transformer。","https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fissues\u002F255",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},41509,"在更新英文 README 时，是否需要包含中文翻译材料？","不需要。在英文版 README_EN.md 中添加中文翻译材料是没有意义的。英文版应仅包含英文内容，保持语言的一致性。如果需要多语言支持，应通过独立的中文文件（如 README_CN.md）并提供语言切换链接来实现。","https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fissues\u002F147",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},41510,"关于 Latte 的笔记文件是中文写的，如何在英文文档中说明？","应在英文 README 中明确标注这些关于 Latte 的文件（位于 notes 文件夹）是用中文编写的。可以使用类似 'xx.pdf in Chinese' 的格式进行说明，以便英文用户了解语言情况。同时，英文版本的用词可以优化，例如用 'paper' 代替 'thesis'，并将 'Latte' 首字母大写。","https:\u002F\u002Fgithub.com\u002Fmini-sora\u002Fminisora\u002Fissues\u002F81",[]]