minisora
MiniSora 是一个由社区自发组织的开源项目,旨在探索 OpenAI Sora 视频生成模型的实现路径与未来发展方向。面对 Sora 技术细节尚未完全公开的现状,MiniSora 致力于通过复现相关核心论文(如 DiT)和梳理技术演进路线(从 DDPM 到 Sora),降低视频生成领域的研究门槛。
该项目主要解决了普通开发者和研究人员难以直接获取或复现顶尖视频生成模型的痛点。它设定了明确的复现目标:追求"GPU 友好”,力求在消费级显卡(如 RTX4090)或少量专业卡上即可进行训练与推理;同时注重效率,希望在较短的训练时间内生成 3-10 秒、480p 分辨率的视频。此外,社区还定期举办圆桌会议,深入解读 Stable Diffusion 3、Movie Gen 等前沿技术论文。
MiniSora 非常适合对 AI 视频生成感兴趣的开发者、研究人员以及希望深入了解扩散模型技术的学生使用。其独特的技术亮点在于结合 XTuner 框架高效复现 DiT 架构,并获得了核心开发者的技术支持与算力资源协助。加入 MiniSora,你不仅能获取系统的技术综述,还能与志同道合的伙伴共同推动开源视频生成技术的发展。
使用场景
某高校 AI 实验室的研究团队正致力于复现 Sora 的核心架构,以探索视频生成模型在有限算力下的落地路径。
没有 minisora 时
- 复现门槛极高:团队需从零梳理 Sora 相关论文(如 DiT)与代码,缺乏系统性的技术路线图,导致研究方向模糊且耗时漫长。
- 算力资源受限:原始 Sora 架构对显存要求苛刻,实验室仅有的几张 RTX4090 或单卡 A100 无法支撑完整训练,被迫放弃实验。
- 技术细节黑盒:从 DDPM 到 Sora 的演进过程中,关键的技术改进点(如时空补丁化、Transformer 结构优化)缺乏详细解读,成员只能盲目试错。
- 协作效率低下:团队成员各自为战,缺乏统一的代码基准和定期的深度研讨,导致重复造轮子且难以沉淀通用成果。
使用 minisora 后
- 路径清晰指引:minisora 提供了从理论综述到代码复现的完整路线图(如"From DDPM to Sora"综述),让团队迅速锁定 DiT 等关键突破口。
- 轻量化适配:借助 minisora-DiT 项目,团队利用 XTuner 技术在 2 张 A100 甚至消费级显卡上成功跑通了长序列训练,大幅降低了硬件门槛。
- 深度技术拆解:通过社区组织的圆桌会议和论文精读(如 Stable Diffusion 3 解读),成员快速掌握了 MM-DiT 等核心机制,避免了理解偏差。
- 开源协同加速:团队直接接入 minisora 社区的贡献者网络,复用已有的复现代码库,将原本数月的预研周期缩短至数周。
minisora 通过构建低门槛、高协同的开源生态,让普通研究者也能在有限资源下触达视频生成领域的前沿技术。
运行环境要求
- 未说明
- 训练推荐:8x A100 (80G), 8x A6000 (48G) 或 RTX4090 (24G)
- MiniSora-DiT 复现项目支持:2x A100
- 目标为 GPU 友好型,支持低显存需求
未说明

快速开始
MiniSora 社区
英语 | 简体中文
👋 加入我们的 微信
MiniSora 开源社区定位为由社区成员自发组织的社区驱动型倡议。MiniSora 社区旨在探索 Sora 的实现路径及未来发展方向。
- 将定期与 Sora 团队及社区举行圆桌讨论,探讨各种可能性。
- 深入研究现有的视频生成技术路径。
- 引领复现与 Sora 相关的论文或研究成果,例如 DiT(MiniSora-DiT)等。
- 对 Sora 相关技术和其实现进行全面回顾,即“从 DDPM 到 Sora:基于扩散模型的视频生成模型综述”。
热点新闻
- OpenAI Sora 即将发布!
- Movie Gen:一系列媒体基础模型
- Stable Diffusion 3:MM-DiT:用于高分辨率图像合成的缩放修正流变换器
- MiniSora-DiT:使用 XTuner 复现 DiT 论文
- MiniSora 简介及复现 Sora 的最新进展
](https://raw.githubusercontent.com/mini-sora/minisora/HEAD/./docs/Minisora_LPRS/0001.jpg)
MiniSora 社区复现小组
MiniSora 复现 Sora 的目标
- GPU 友好:理想情况下,对 GPU 显存大小和 GPU 数量的要求较低,例如能够使用类似 8 张 A100 80G 显卡、8 张 A6000 48G 显卡或 RTX4090 24G 显卡的算力进行训练和推理。
- 训练效率:无需长时间训练即可取得良好效果。
- 推理效率:在推理生成视频时,无需高长度或高分辨率;可接受的参数包括 3–10 秒长度和 480p 分辨率。
MiniSora-DiT:使用 XTuner 复现 DiT 论文
https://github.com/mini-sora/minisora-DiT
要求
我们正在招募 MiniSora 社区贡献者,使用 XTuner 复现 DiT。
我们希望社区成员具备以下特点:
- 熟悉
OpenMMLab MMEngine机制。 - 熟悉
DiT。
背景
DiT的作者与Sora的作者相同。- XTuner 具有高效训练长度为
1000K序列的核心技术。
支持
最近的圆桌讨论
Stable Diffusion 3 论文解读:MM-DiT
演讲者:MMagic 核心贡献者
直播时间:12 月 3 日 20:00
亮点:MMagic 核心贡献者将带领大家解读 Stable Diffusion 3 论文,讨论 Stable Diffusion 3 的架构细节和设计原则。
PPT:飞书链接
往期讨论亮点
与 Sora 的夜间谈话:视频扩散概述
知乎笔记:生成式扩散模型综述:生成式扩散模型概览
论文阅读计划
技术报告:视频生成模型作为世界模拟器
Stable Cascade(ICLR 24 论文):Würstchen:用于大规模文本到图像扩散模型的高效架构
更新中...
演讲者招募
相关工作
- 01 扩散模型
- 02 扩散Transformer
- 03 基准视频生成模型
- 04 扩散UNet
- 05 视频生成
- 06 数据集
- 07 补丁化方法
- 08 长上下文
- 09 音频相关资源
- 10 一致性
- 11 提示工程
- 12 安全性
- 13 世界模型
- 14 视频压缩
- 15 Mamba
- 16 现有高质量资源
- 17 高效训练
- 18 高效推理
01 Diffusion Models |
|
|---|---|
| Paper | Link |
| 1) Guided-Diffusion: Diffusion Models Beat GANs on Image Synthesis | NeurIPS 21 Paper, GitHub |
| 2) Latent Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models | CVPR 22 Paper, GitHub |
| 3) EDM: Elucidating the Design Space of Diffusion-Based Generative Models | NeurIPS 22 Paper, GitHub |
| 4) DDPM: Denoising Diffusion Probabilistic Models | NeurIPS 20 Paper, GitHub |
| 5) DDIM: Denoising Diffusion Implicit Models | ICLR 21 Paper, GitHub |
| 6) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations | ICLR 21 Paper, GitHub, Blog |
| 7) Stable Cascade: Würstchen: An efficient architecture for large-scale text-to-image diffusion models | ICLR 24 Paper, GitHub, Blog |
| 8) Diffusion Models in Vision: A Survey | TPAMI 23 Paper, GitHub |
| 9) Improved DDPM: Improved Denoising Diffusion Probabilistic Models | ICML 21 Paper, Github |
| 10) Classifier-free diffusion guidance | NIPS 21 Paper |
| 11) Glide: Towards photorealistic image generation and editing with text-guided diffusion models | Paper, Github |
| 12) VQ-DDM: Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation | CVPR 22 Paper, Github |
| 13) Diffusion Models for Medical Anomaly Detection | Paper, Github |
| 14) Generation of Anonymous Chest Radiographs Using Latent Diffusion Models for Training Thoracic Abnormality Classification Systems | Paper |
| 15) DiffusionDet: Diffusion Model for Object Detection | ICCV 23 Paper, Github |
| 16) Label-efficient semantic segmentation with diffusion models | ICLR 22 Paper, Github, Project |
02 Diffusion Transformer |
|
| Paper | Link |
| 1) UViT: All are Worth Words: A ViT Backbone for Diffusion Models | CVPR 23 Paper, GitHub, ModelScope |
| 2) DiT: Scalable Diffusion Models with Transformers | ICCV 23 Paper, GitHub, Project, ModelScope |
| 3) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers | ArXiv 23, GitHub, ModelScope |
| 4) FiT: Flexible Vision Transformer for Diffusion Model | ArXiv 24, GitHub |
| 5) k-diffusion: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers | ArXiv 24, GitHub |
| 6) Large-DiT: Large Diffusion Transformer | GitHub |
| 7) VisionLLaMA: A Unified LLaMA Interface for Vision Tasks | ArXiv 24, GitHub |
| 8) Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis | Paper, Blog |
| 9) PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation | ArXiv 24, Project |
| 10) PIXART-α: Fast Training of Diffusion Transformer for Photorealistic Text-To-Image Synthesis | ArXiv 23, GitHub ModelScope |
| 11) PIXART-δ: Fast and Controllable Image Generation With Latent Consistency Model | ArXiv 24, |
| 12) Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers | ArXiv 24, GitHub |
| 13) DDM: Deconstructing Denoising Diffusion Models for Self-Supervised Learning | ArXiv 24 |
| 14) Autoregressive Image Generation without Vector Quantization | ArXiv 24, GitHub |
| 15) Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model | ArXiv 24 |
| 16) Scaling Diffusion Language Models via Adaptation from Autoregressive Models | ArXiv 24 |
| 17) Large Language Diffusion Models | ArXiv 25 |
03 Baseline Video Generation Models |
|
| Paper | Link |
| 1) ViViT: A Video Vision Transformer | ICCV 21 Paper, GitHub |
| 2) VideoLDM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models | CVPR 23 Paper |
| 3) DiT: Scalable Diffusion Models with Transformers | ICCV 23 Paper, Github, Project, ModelScope |
| 4) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators | ArXiv 23, GitHub |
| 5) Latte: Latent Diffusion Transformer for Video Generation | ArXiv 24, GitHub, Project, ModelScope |
04 Diffusion UNet |
|
| Paper | Link |
| 1) Taming Transformers for High-Resolution Image Synthesis | CVPR 21 Paper,GitHub ,Project |
| 2) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | ArXiv 24 Github |
05 Video Generation |
|
| Paper | Link |
| 1) Animatediff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning | ICLR 24 Paper, GitHub, ModelScope |
| 2) I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models | ArXiv 23, GitHub, ModelScope |
| 3) Imagen Video: High Definition Video Generation with Diffusion Models | ArXiv 22 |
| 4) MoCoGAN: Decomposing Motion and Content for Video Generation | CVPR 18 Paper |
| 5) Adversarial Video Generation on Complex Datasets | Paper |
| 6) W.A.L.T: Photorealistic Video Generation with Diffusion Models | ArXiv 23, Project |
| 7) VideoGPT: Video Generation using VQ-VAE and Transformers | ArXiv 21, GitHub |
| 8) Video Diffusion Models | ArXiv 22, GitHub, Project |
| 9) MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation | NeurIPS 22 Paper, GitHub, Project, Blog |
| 10) VideoPoet: A Large Language Model for Zero-Shot Video Generation | ArXiv 23, Project, Blog |
| 11) MAGVIT: Masked Generative Video Transformer | CVPR 23 Paper, GitHub, Project, Colab |
| 12) EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions | ArXiv 24, GitHub, Project |
| 13) SimDA: Simple Diffusion Adapter for Efficient Video Generation | Paper, GitHub, Project |
| 14) StableVideo: Text-driven Consistency-aware Diffusion Video Editing | ICCV 23 Paper, GitHub, Project |
| 15) SVD: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets | Paper, GitHub |
| 16) ADD: Adversarial Diffusion Distillation | Paper, GitHub |
| 17) GenTron: Diffusion Transformers for Image and Video Generation | CVPR 24 Paper, Project |
| 18) LFDM: Conditional Image-to-Video Generation with Latent Flow Diffusion Models | CVPR 23 Paper, GitHub |
| 19) MotionDirector: Motion Customization of Text-to-Video Diffusion Models | ArXiv 23, GitHub |
| 20) TGAN-ODE: Latent Neural Differential Equations for Video Generation | Paper, GitHub |
| 21) VideoCrafter1: Open Diffusion Models for High-Quality Video Generation | ArXiv 23, GitHub |
| 22) VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models | ArXiv 24, GitHub |
| 23) LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation | ArXiv 22, GitHub |
| 24) LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models | ArXiv 23, GitHub, Project |
| 25) PYoCo: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models | ICCV 23 Paper, Project |
| 26) VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation | CVPR 23 Paper |
| 27) Movie Gen: A Cast of Media Foundation Models | Paper, Project |
| 28) Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model | ArXiv 25, Project |
06 Dataset |
|
6.1 Public Datasets |
|
| Dataset Name - Paper | Link |
1) Panda-70M - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers70M Clips, 720P, Downloadable |
CVPR 24 Paper, Github, Project, ModelScope |
2) InternVid-10M - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation10M Clips, 720P, Downloadable |
ArXiv 24, Github |
3) CelebV-Text - CelebV-Text: A Large-Scale Facial Text-Video Dataset70K Clips, 720P, Downloadable |
CVPR 23 Paper, Github, Project |
4) HD-VG-130M - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation130M Clips, 720P, Downloadable |
ArXiv 23, Github, Tool |
5) HD-VILA-100M - Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions100M Clips, 720P, Downloadable |
CVPR 22 Paper, Github |
6) VideoCC - Learning Audio-Video Modalities from Image Captions10.3M Clips, 720P, Downloadable |
ECCV 22 Paper, Github |
7) YT-Temporal-180M - MERLOT: Multimodal Neural Script Knowledge Models180M Clips, 480P, Downloadable |
NeurIPS 21 Paper, Github, Project |
8) HowTo100M - HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips136M Clips, 240P, Downloadable |
ICCV 19 Paper, Github, Project |
9) UCF101 - UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild13K Clips, 240P, Downloadable |
CVPR 12 Paper, Project |
10) MSVD - Collecting Highly Parallel Data for Paraphrase Evaluation122K Clips, 240P, Downloadable |
ACL 11 Paper, Project |
11) Fashion-Text2Video - A human video dataset with rich label and text annotations600 Videos, 480P, Downloadable |
ArXiv 23, Project |
12) LAION-5B - A dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M5B Clips, Downloadable |
NeurIPS 22 Paper, Project |
13) ActivityNet Captions - ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time20k videos, Downloadable |
Arxiv 17 Paper, Project |
14) MSR-VTT - A large-scale video benchmark for video understanding10k Clips, Downloadable |
CVPR 16 Paper, Project |
15) The Cityscapes Dataset - Benchmark suite and evaluation server for pixel-level, instance-level, and panoptic semantic labelingDownloadable |
Arxiv 16 Paper, Project |
16) Youku-mPLUG - First open-source large-scale Chinese video text datasetDownloadable |
ArXiv 23, Project, ModelScope |
17) VidProM - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models6.69M, Downloadable |
ArXiv 24, Github |
18) Pixabay100 - A video dataset collected from PixabayDownloadable |
Github |
19) WebVid - Large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sitesLong Durations and Structured Captions |
ArXiv 21, Project , ModelScope |
20) MiraData(Mini-Sora Data): A Large-Scale Video Dataset with Long Durations and Structured Captions10M video-text pairs |
Github, Project |
21) IDForge: A video dataset featuring scenes of people speaking.300k Clips, Downloadable |
ArXiv 24, Github |
6.2 Video Augmentation Methods |
|
6.2.1 Basic Transformations |
|
| Three-stream CNNs for action recognition | PRL 17 Paper |
| Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural Networks | EL 19 Paper |
| Intra-clip Aggregation for Video Person Re-identification | ICIP 20 Paper |
| VideoMix: Rethinking Data Augmentation for Video Classification | CVPR 20 Paper |
| mixup: Beyond Empirical Risk Minimization | ICLR 17 Paper |
| CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features | ICCV 19 Paper |
| Video Salient Object Detection via Fully Convolutional Networks | ICIP 18 Paper |
| Illumination-Based Data Augmentation for Robust Background Subtraction | SKIMA 19 Paper |
| Image editing-based data augmentation for illumination-insensitive background subtraction | EIM 20 Paper |
6.2.2 Feature Space |
|
| Feature Re-Learning with Data Augmentation for Content-based Video Recommendation | ACM 18 Paper |
| GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer | Trans 21 Paper |
6.2.3 GAN-based Augmentation |
|
| Deep Video-Based Performance Cloning | CVPR 18 Paper |
| Adversarial Action Data Augmentation for Similar Gesture Action Recognition | IJCNN 19 Paper |
| Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples | MM 20 Paper |
| GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer | Trans 20 Paper |
| Dynamic Facial Expression Generation on Hilbert Hypersphere With Conditional Wasserstein Generative Adversarial Nets | TPAMI 20 Paper |
| CrowdGAN: Identity-Free Interactive Crowd Video Generation and Beyond | TPAMI 22 Paper |
6.2.4 Encoder/Decoder Based |
|
| Rotationally-Temporally Consistent Novel View Synthesis of Human Performance Video | ECCV 20 Paper |
| Autoencoder-based Data Augmentation for Deepfake Detection | ACM 23 Paper |
6.2.5 Simulation |
|
| A data augmentation methodology for training machine/deep learning gait recognition algorithms | CVPR 16 Paper |
| ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications | IEEE 21 Paper |
| Mid-Air: A Multi-Modal Dataset for Extremely Low Altitude Drone Flights | CVPR 19 Paper |
| Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models | IJCV 19 Paper |
| Using synthetic data for person tracking under adverse weather conditions | IVC 21 Paper |
| Unlimited Road-scene Synthetic Annotation (URSA) Dataset | ITSC 18 Paper |
| SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video Data | CVPR 21 Paper |
| Universal Semantic Segmentation for Fisheye Urban Driving Images | SMC 20 Paper |
07 Patchifying Methods |
|
| Paper | Link |
| 1) ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | CVPR 21 Paper, Github |
| 2) MAE: Masked Autoencoders Are Scalable Vision Learners | CVPR 22 Paper, Github |
| 3) ViViT: A Video Vision Transformer (-) | ICCV 21 Paper, GitHub |
| 4) DiT: Scalable Diffusion Models with Transformers (-) | ICCV 23 Paper, GitHub, Project, ModelScope |
| 5) U-ViT: All are Worth Words: A ViT Backbone for Diffusion Models (-) | CVPR 23 Paper, GitHub, ModelScope |
| 6) FlexiViT: One Model for All Patch Sizes | Paper, Github |
| 7) Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution | ArXiv 23, Github |
| 8) VQ-VAE: Neural Discrete Representation Learning | Paper, Github |
| 9) VQ-GAN: Neural Discrete Representation Learning | CVPR 21 Paper, Github |
| 10) LVT: Latent Video Transformer | Paper, Github |
| 11) VideoGPT: Video Generation using VQ-VAE and Transformers (-) | ArXiv 21, GitHub |
| 12) Predicting Video with VQVAE | ArXiv 21 |
| 13) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers | ICLR 23 Paper, Github |
| 14) TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | ECCV 22 Paper, Github |
| 15) MAGVIT: Masked Generative Video Transformer (-) | CVPR 23 Paper, GitHub, Project, Colab |
| 16) MagViT2: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | ICLR 24 Paper, Github |
| 17) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-) | ArXiv 23, Project, Blog |
| 18) CLIP: Learning Transferable Visual Models From Natural Language Supervision | CVPR 21 Paper, Github |
| 19) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | ArXiv 22, Github |
| 20) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ArXiv 23, Github |
08 Long-context |
|
| Paper | Link |
| 1) World Model on Million-Length Video And Language With RingAttention | ArXiv 24, GitHub |
| 2) Ring Attention with Blockwise Transformers for Near-Infinite Context | ArXiv 23, GitHub |
| 3) Extending LLMs' Context Window with 100 Samples | ArXiv 24, GitHub |
| 4) Efficient Streaming Language Models with Attention Sinks | ICLR 24 Paper, GitHub |
| 5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed Survey | Paper |
| 6) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | CVPR 24 Paper, GitHub, Project |
| 7) MemoryBank: Enhancing Large Language Models with Long-Term Memory | Paper, GitHub |
09 Audio Related Resource |
|
| Paper | Link |
| 1) Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion | ArXiv 24, Github, Blog |
| 2) MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation | CVPR 23 Paper, GitHub |
| 3) Pengi: An Audio Language Model for Audio Tasks | NeurIPS 23 Paper, GitHub |
| 4) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset | NeurlPS 23 Paper, GitHub |
| 5) Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | ArXiv 23, GitHub |
| 6) NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality | TPAMI 24 Paper, GitHub |
| 7) NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers | ICLR 24 Paper, GitHub |
| 8) UniAudio: An Audio Foundation Model Toward Universal Audio Generation | ArXiv 23, GitHub |
| 9) Diffsound: Discrete Diffusion Model for Text-to-sound Generation | TASLP 22 Paper |
| 10) AudioGen: Textually Guided Audio Generation | ICLR 23 Paper, Project |
| 11) AudioLDM: Text-to-audio generation with latent diffusion models | ICML 23 Paper, GitHub, Project, Huggingface |
| 12) AudioLDM2: Learning Holistic Audio Generation with Self-supervised Pretraining | ArXiv 23, GitHub, Project, Huggingface |
| 13) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models | ICML 23 Paper, GitHub |
| 14) Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation | ArXiv 23 |
| 15) TANGO: Text-to-audio generation using instruction-tuned LLM and latent diffusion model | ArXiv 23, GitHub, Project, Huggingface |
| 16) AudioLM: a Language Modeling Approach to Audio Generation | ArXiv 22 |
| 17) AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head | ArXiv 23, GitHub |
| 18) MusicGen: Simple and Controllable Music Generation | NeurIPS 23 Paper, GitHub |
| 19) LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT | ArXiv 23 |
| 20) Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners | CVPR 24 Paper |
| 21) Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | EMNLP 23 Paper |
| 22) Audio-Visual LLM for Video Understanding | ArXiv 23 |
| 23) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-) | ArXiv 23, Project, Blog |
| 24) Movie Gen: A Cast of Media Foundation Models | Paper, Project |
10 Consistency |
|
| Paper | Link |
| 1) Consistency Models | Paper, GitHub |
| 2) Improved Techniques for Training Consistency Models | ArXiv 23 |
| 3) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations (-) | ICLR 21 Paper, GitHub, Blog |
| 4) Improved Techniques for Training Score-Based Generative Models | NIPS 20 Paper, GitHub |
| 4) Generative Modeling by Estimating Gradients of the Data Distribution | NIPS 19 Paper, GitHub |
| 5) Maximum Likelihood Training of Score-Based Diffusion Models | NIPS 21 Paper, GitHub |
| 6) Layered Neural Atlases for Consistent Video Editing | TOG 21 Paper, GitHub, Project |
| 7) StableVideo: Text-driven Consistency-aware Diffusion Video Editing | ICCV 23 Paper, GitHub, Project |
| 8) CoDeF: Content Deformation Fields for Temporally Consistent Video Processing | Paper, GitHub, Project |
| 9) Sora Generates Videos with Stunning Geometrical Consistency | Paper, GitHub, Project |
| 10) Efficient One-stage Video Object Detection by Exploiting Temporal Consistency | ECCV 22 Paper, GitHub |
| 11) Bootstrap Motion Forecasting With Self-Consistent Constraints | ICCV 23 Paper |
| 12) Enforcing Realism and Temporal Consistency for Large-Scale Video Inpainting | Paper |
| 13) Enhancing Multi-Camera People Tracking with Anchor-Guided Clustering and Spatio-Temporal Consistency ID Re-Assignment | CVPRW 23 Paper, GitHub |
| 14) Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing | ArXiv 21 |
| 15) Semi-Supervised Crowd Counting With Spatial Temporal Consistency and Pseudo-Label Filter | TCSVT 23 Paper |
| 16) Spatio-temporal Consistency and Hierarchical Matching for Multi-Target Multi-Camera Vehicle Tracking | CVPRW 19 Paper |
| 17) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning (-) | ArXiv 23 |
| 18) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM (-) | ArXiv 24 |
| 19) MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask | ArXiv 23 |
11 Prompt Engineering |
|
| Paper | Link |
| 1) RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models | ArXiv 24, GitHub, Project |
| 2) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs | ArXiv 24, GitHub |
| 3) LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models | TMLR 23 Paper, GitHub |
| 4) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTS | ICLR 24 Paper, GitHub |
| 5) Progressive Text-to-Image Diffusion with Soft Latent Direction | ArXiv 23 |
| 6) Self-correcting LLM-controlled Diffusion Models | CVPR 24 Paper, GitHub |
| 7) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation | MM 23 Paper |
| 8) LayoutGPT: Compositional Visual Planning and Generation with Large Language Models | NeurIPS 23 Paper, GitHub |
| 9) Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition | ArXiv 24, GitHub |
| 10) InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions | ArXiv 23, GitHub |
| 11) Controllable Text-to-Image Generation with GPT-4 | ArXiv 23 |
| 12) LLM-grounded Video Diffusion Models | ICLR 24 Paper |
| 13) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning | ArXiv 23 |
| 14) FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax | ArXiv 23, Github, Project |
| 15) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM | ArXiv 24 |
| 16) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator | NeurIPS 23 Paper |
| 17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models | ArXiv 23 |
| 18) MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation | ArXiv 23 |
| 19) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning | ArXiv 23 |
| 20) Multimodal Procedural Planning via Dual Text-Image Prompting | ArXiv 23, Github |
| 21) InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists | ICLR 24 Paper, Github |
| 22) DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback | ArXiv 23 |
| 23) TaleCrafter: Interactive Story Visualization with Multiple Characters | SIGGRAPH Asia 23 Paper |
| 24) Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis | ArXiv 23, Github |
| 25) COLE: A Hierarchical Generation Framework for Graphic Design | ArXiv 23 |
| 26) Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision | ArXiv 23 |
| 27) Vlogger: Make Your Dream A Vlog | CVPR 24 Paper, Github |
| 28) GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting | Paper |
| 29) MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion | ArXiv 24 |
Recaption |
|
| Paper | Link |
| 1) LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models | ArXiv 23, GitHub |
| 2) Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation | ArXiv 23, GitHub |
| 3) CoCa: Contrastive Captioners are Image-Text Foundation Models | ArXiv 22, Github |
| 4) CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion | ArXiv 24 |
| 5) VideoChat: Chat-Centric Video Understanding | CVPR 24 Paper, Github |
| 6) De-Diffusion Makes Text a Strong Cross-Modal Interface | ArXiv 23 |
| 7) HowToCaption: Prompting LLMs to Transform Video Annotations at Scale | ArXiv 23 |
| 8) SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data | ArXiv 24 |
| 9) LLMGA: Multimodal Large Language Model based Generation Assistant | ArXiv 23, Github |
| 10) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | ArXiv 24, Github |
| 11) MyVLM: Personalizing VLMs for User-Specific Queries | ArXiv 24 |
| 12) A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation | ArXiv 23, Github |
| 13) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs(-) | ArXiv 24, Github |
| 14) FlexCap: Generating Rich, Localized, and Flexible Captions in Images | ArXiv 24 |
| 15) Video ReCap: Recursive Captioning of Hour-Long Videos | ArXiv 24, Github |
| 16) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | ICML 22, Github |
| 17) PromptCap: Prompt-Guided Task-Aware Image Captioning | ICCV 23, Github |
| 18) CIC: A framework for Culturally-aware Image Captioning | ArXiv 24 |
| 19) Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion | ArXiv 24 |
| 20) FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions | WACV 24, Github |
12 Security |
|
| Paper | Link |
| 1) BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | NeurIPS 23 Paper, Github |
| 2) LIMA: Less Is More for Alignment | NeurIPS 23 Paper |
| 3) Jailbroken: How Does LLM Safety Training Fail? | NeurIPS 23 Paper |
| 4) Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models | CVPR 23 Paper |
| 5) Stable Bias: Evaluating Societal Representations in Diffusion Models | NeurIPS 23 Paper |
| 6) Ablating concepts in text-to-image diffusion models | ICCV 23 Paper |
| 7) Diffusion art or digital forgery? investigating data replication in diffusion models | ICCV 23 Paper, Project |
| 8) Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks | ICCV 20 Paper |
| 9) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks | ICML 20 Paper |
| 10) A pilot study of query-free adversarial attack against stable diffusion | ICCV 23 Paper |
| 11) Interpretable-Through-Prototypes Deepfake Detection for Diffusion Models | ICCV 23 Paper |
| 12) Erasing Concepts from Diffusion Models | ICCV 23 Paper, Project |
| 13) Ablating Concepts in Text-to-Image Diffusion Models | ICCV 23 Paper, Project |
| 14) BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | NeurIPS 23 Paper, Project |
| 15) Stable Bias: Evaluating Societal Representations in Diffusion Models | NeurIPS 23 Paper |
| 16) Threat Model-Agnostic Adversarial Defense using Diffusion Models | Paper |
| 17) How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions? | Paper, Github |
| 18) Differentially Private Diffusion Models Generate Useful Synthetic Images | Paper |
| 19) Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models | SIGSAC 23 Paper, Github |
| 20) Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models | Paper, Github |
| 21) Unified Concept Editing in Diffusion Models | WACV 24 Paper, Project |
| 22) Diffusion Model Alignment Using Direct Preference Optimization | ArXiv 23 |
| 23) RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment | TMLR 23 Paper , Github |
| 24) Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation | Paper, Github, Project |
13 World Model |
|
| Paper | Link |
| 1) NExT-GPT: Any-to-Any Multimodal LLM | ArXiv 23, GitHub |
14 Video Compression |
|
| Paper | Link |
| 1) H.261: Video codec for audiovisual services at p x 64 kbit/s | Paper |
| 2) H.262: Information technology - Generic coding of moving pictures and associated audio information: Video | Paper |
| 3) H.263: Video coding for low bit rate communication | Paper |
| 4) H.264: Overview of the H.264/AVC video coding standard | Paper |
| 5) H.265: Overview of the High Efficiency Video Coding (HEVC) Standard | Paper |
| 6) H.266: Overview of the Versatile Video Coding (VVC) Standard and its Applications | Paper |
| 7) DVC: An End-to-end Deep Video Compression Framework | CVPR 19 Paper, GitHub |
| 8) OpenDVC: An Open Source Implementation of the DVC Video Compression Method | Paper, GitHub |
| 9) HLVC: Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement | CVPR 20 Paper, Github |
| 10) RLVC: Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model | J-STSP 21 Paper, Github |
| 11) PLVC: Perceptual Learned Video Compression with Recurrent Conditional GAN | IJCAI 22 Paper, Github |
| 12) ALVC: Advancing Learned Video Compression with In-loop Frame Prediction | T-CSVT 22 Paper, Github |
| 13) DCVC: Deep Contextual Video Compression | NeurIPS 21 Paper, Github |
| 14) DCVC-TCM: Temporal Context Mining for Learned Video Compression | TM 22 Paper, Github |
| 15) DCVC-HEM: Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression | MM 22 Paper, Github |
| 16) DCVC-DC: Neural Video Compression with Diverse Contexts | CVPR 23 Paper, Github |
| 17) DCVC-FM: Neural Video Compression with Feature Modulation | CVPR 24 Paper, Github |
| 18) SSF: Scale-Space Flow for End-to-End Optimized Video Compression | CVPR 20 Paper, Github |
15 Mamba |
|
15.1 Theoretical Foundations and Model Architecture |
|
| Paper | Link |
| 1) Mamba: Linear-Time Sequence Modeling with Selective State Spaces | ArXiv 23, Github |
| 2) Efficiently Modeling Long Sequences with Structured State Spaces | ICLR 22 Paper, Github |
| 3) Modeling Sequences with Structured State Spaces | Paper |
| 4) Long Range Language Modeling via Gated State Spaces | ArXiv 22, GitHub |
15.2 Image Generation and Visual Applications |
|
| Paper | Link |
| 1) Diffusion Models Without Attention | ArXiv 23 |
| 2) Pan-Mamba: Effective Pan-Sharpening with State Space Model | ArXiv 24, Github |
| 3) Pretraining Without Attention | ArXiv 22, Github |
| 4) Block-State Transformers | NIPS 23 Paper |
| 5) Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model | ArXiv 24, Github |
| 6) VMamba: Visual State Space Model | ArXiv 24, Github |
| 7) ZigMa: Zigzag Mamba Diffusion Model | ArXiv 24, Github |
| 8) MambaVision: A Hybrid Mamba-Transformer Vision Backbone | ArXiv 24, GitHub |
15.3 Video Processing and Understanding |
|
| Paper | Link |
| 1) Long Movie Clip Classification with State-Space Video Models | ECCV 22 Paper, Github |
| 2) Selective Structured State-Spaces for Long-Form Video Understanding | CVPR 23 Paper |
| 3) Efficient Movie Scene Detection Using State-Space Transformers | CVPR 23 Paper, Github |
| 4) VideoMamba: State Space Model for Efficient Video Understanding | Paper, Github |
15.4 Medical Image Processing |
|
| Paper | Link |
| 1) Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining | ArXiv 24, Github |
| 2) MambaIR: A Simple Baseline for Image Restoration with State-Space Model | ArXiv 24, Github |
| 3) VM-UNet: Vision Mamba UNet for Medical Image Segmentation | ArXiv 24, Github |
16 Existing high-quality resources |
|
| Resources | Link |
| 1) Datawhale - AI视频生成学习 | Feishu doc |
| 2) A Survey on Generative Diffusion Model | TKDE 24 Paper, GitHub |
| 3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion Models | ArXiv 23, GitHub |
| 4) Awesome-Text-To-Video:A Survey on Text-to-Video Generation/Synthesis | GitHub |
| 5) video-generation-survey: A reading list of video generation | GitHub |
| 6) Awesome-Video-Diffusion | GitHub |
| 7) Video Generation Task in Papers With Code | Task |
| 8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models | ArXiv 24, GitHub |
| 9) Open-Sora-Plan (PKU-YuanGroup) | GitHub |
| 10) State of the Art on Diffusion Models for Visual Computing | Paper |
| 11) Diffusion Models: A Comprehensive Survey of Methods and Applications | CSUR 24 Paper, GitHub |
| 12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable | Paper |
| 13) On the Design Fundamentals of Diffusion Models: A Survey | Paper |
| 14) Efficient Diffusion Models for Vision: A Survey | Paper |
| 15) Text-to-Image Diffusion Models in Generative AI: A Survey | Paper |
| 16) Awesome-Diffusion-Transformers | GitHub, Project |
| 17) Open-Sora (HPC-AI Tech) | GitHub, Blog |
| 18) LAVIS - A Library for Language-Vision Intelligence | ACL 23 Paper, GitHub, Project |
| 19) OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference | GitHub |
| 20) Awesome-Long-Context | GitHub1, GitHub2 |
| 21) Lite-Sora | GitHub |
| 22) Mira: A Mini-step Towards Sora-like Long Video Generation | GitHub, Project |
17 Efficient Training |
|
17.1 Parallelism based Approach |
|
17.1.1 Data Parallelism (DP) |
|
| 1) A bridging model for parallel computation | Paper |
| 2) PyTorch Distributed: Experiences on Accelerating Data Parallel Training | VLDB 20 Paper |
17.1.2 Model Parallelism (MP) |
|
| 1) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | ArXiv 19 Paper |
| 2) TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models | PMLR 21 Paper |
17.1.3 Pipeline Parallelism (PP) |
|
| 1) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism | NeurIPS 19 Paper |
| 2) PipeDream: generalized pipeline parallelism for DNN training | SOSP 19 Paper |
17.1.4 Generalized Parallelism (GP) |
|
| 1) Mesh-TensorFlow: Deep Learning for Supercomputers | ArXiv 18 Paper |
| 2) Beyond Data and Model Parallelism for Deep Neural Networks | MLSys 19 Paper |
17.1.5 ZeRO Parallelism (ZP) |
|
| 1) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | ArXiv 20 |
| 2) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters | ACM 20 Paper |
| 3) ZeRO-Offload: Democratizing Billion-Scale Model Training | ArXiv 21 |
| 4) PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel | ArXiv 23 |
17.2 Non-parallelism based Approach |
|
17.2.1 Reducing Activation Memory |
|
| 1) Gist: Efficient Data Encoding for Deep Neural Network Training | IEEE 18 Paper |
| 2) Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization | MLSys 20 Paper |
| 3) Training Deep Nets with Sublinear Memory Cost | ArXiv 16 Paper |
| 4) Superneurons: dynamic GPU memory management for training deep neural networks | ACM 18 Paper |
17.2.2 CPU-Offloading |
|
| 1) Training Large Neural Networks with Constant Memory using a New Execution Algorithm | ArXiv 20 Paper |
| 2) vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design | IEEE 16 Paper |
17.2.3 Memory Efficient Optimizer |
|
| 1) Adafactor: Adaptive Learning Rates with Sublinear Memory Cost | PMLR 18 Paper |
| 2) Memory-Efficient Adaptive Optimization for Large-Scale Learning | Paper |
17.3 Novel Structure |
|
| 1) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | ArXiv 24 Github |
18 Efficient Inference |
|
18.1 Reduce Sampling Steps |
|
18.1.1 Continuous Steps |
|
| 1) Generative Modeling by Estimating Gradients of the Data Distribution | NeurIPS 19 Paper |
| 2) WaveGrad: Estimating Gradients for Waveform Generation | ArXiv 20 |
| 3) Noise Level Limited Sub-Modeling for Diffusion Probabilistic Vocoders | ICASSP 21 Paper |
| 4) Noise Estimation for Generative Diffusion Models | ArXiv 21 |
18.1.2 Fast Sampling |
|
| 1) Denoising Diffusion Implicit Models | ICLR 21 Paper |
| 2) DiffWave: A Versatile Diffusion Model for Audio Synthesis | ICLR 21 Paper |
| 3) On Fast Sampling of Diffusion Probabilistic Models | ArXiv 21 |
| 4) DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps | NeurIPS 22 Paper |
| 5) DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models | ArXiv 22 |
| 6) Fast Sampling of Diffusion Models with Exponential Integrator | ICLR 22 Paper |
18.1.3 Step distillation |
|
| 1) On Distillation of Guided Diffusion Models | CVPR 23 Paper |
| 2) Progressive Distillation for Fast Sampling of Diffusion Models | ICLR 22 Paper |
| 3) SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds | NeurIPS 23 Paper |
| 4) Tackling the Generative Learning Trilemma with Denoising Diffusion GANs | ICLR 22 Paper |
18.2 Optimizing Inference |
|
18.2.1 Low-bit Quantization |
|
| 1) Q-Diffusion: Quantizing Diffusion Models | CVPR 23 Paper |
| 2) Q-DM: An Efficient Low-bit Quantized Diffusion Model | NeurIPS 23 Paper |
| 3) Temporal Dynamic Quantization for Diffusion Models | NeurIPS 23 Paper |
18.2.2 Parallel/Sparse inference |
|
| 1) DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models | CVPR 24 Paper |
| 2) Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models | NeurIPS 22 Paper |
| 3) PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models | ArXiv 24 |
引用
如果本项目对您的工作有所帮助,请使用以下格式引用:
@misc{minisora,
title={MiniSora},
author={MiniSora社区},
url={https://github.com/mini-sora/minisora},
year={2024}
}
@misc{minisora,
title={从DDPM到Sora的基于扩散模型的视频生成模型综述},
author={MiniSora社区综述论文组},
url={https://github.com/mini-sora/minisora},
year={2024}
}
Minisora社区微信群
星标历史
如何为Mini Sora社区贡献力量
我们非常感谢您对Mini Sora开源社区的贡献,帮助我们让它比现在更加出色!
更多详情,请参阅贡献指南。
社区贡献者
常见问题
相似工具推荐
openclaw
OpenClaw 是一款专为个人打造的本地化 AI 助手,旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚,能够直接接入你日常使用的各类通讯渠道,包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息,OpenClaw 都能即时响应,甚至支持在 macOS、iOS 和 Android 设备上进行语音交互,并提供实时的画布渲染功能供你操控。 这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地,用户无需依赖云端服务即可享受快速、私密的智能辅助,真正实现了“你的数据,你做主”。其独特的技术亮点在于强大的网关架构,将控制平面与核心助手分离,确保跨平台通信的流畅性与扩展性。 OpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者,以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力(支持 macOS、Linux 及 Windows WSL2),即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你
stable-diffusion-webui
stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面,旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点,将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。 无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师,还是想要深入探索模型潜力的开发者与研究人员,都能从中获益。其核心亮点在于极高的功能丰富度:不仅支持文生图、图生图、局部重绘(Inpainting)和外绘(Outpainting)等基础模式,还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外,它内置了 GFPGAN 和 CodeFormer 等人脸修复工具,支持多种神经网络放大算法,并允许用户通过插件系统无限扩展能力。即使是显存有限的设备,stable-diffusion-webui 也提供了相应的优化选项,让高质量的 AI 艺术创作变得触手可及。
ComfyUI
ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎,专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式,采用直观的节点式流程图界面,让用户通过连接不同的功能模块即可构建个性化的生成管线。 这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景,也能自由组合模型、调整参数并实时预览效果,轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性,不仅支持 Windows、macOS 和 Linux 全平台,还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构,并率先支持 SDXL、Flux、SD3 等前沿模型。 无论是希望深入探索算法潜力的研究人员和开发者,还是追求极致创作自由度的设计师与资深 AI 绘画爱好者,ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能,使其成为当前最灵活、生态最丰富的开源扩散模型工具之一,帮助用户将创意高效转化为现实。
gemini-cli
gemini-cli 是一款由谷歌推出的开源 AI 命令行工具,它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言,它提供了一条从输入提示词到获取模型响应的最短路径,无需切换窗口即可享受智能辅助。 这款工具主要解决了开发过程中频繁上下文切换的痛点,让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用,还是执行复杂的 Git 操作,gemini-cli 都能通过自然语言指令高效处理。 它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口,具备出色的逻辑推理能力;内置 Google 搜索、文件操作及 Shell 命令执行等实用工具;更独特的是,它支持 MCP(模型上下文协议),允许用户灵活扩展自定义集成,连接如图像生成等外部能力。此外,个人谷歌账号即可享受免费的额度支持,且项目基于 Apache 2.0 协议完全开源,是提升终端工作效率的理想助手。
LLMs-from-scratch
LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目,旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型(LLM)。它不仅是同名技术著作的官方代码库,更提供了一套完整的实践方案,涵盖模型开发、预训练及微调的全过程。 该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型,却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码,用户能够透彻掌握 Transformer 架构、注意力机制等关键原理,从而真正理解大模型是如何“思考”的。此外,项目还包含了加载大型预训练权重进行微调的代码,帮助用户将理论知识延伸至实际应用。 LLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API,而是渴望探究模型构建细节的技术人员而言,这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计:将复杂的系统工程拆解为清晰的步骤,配合详细的图表与示例,让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础,还是为未来研发更大规模的模型做准备
Deep-Live-Cam
Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具,用户仅需一张静态照片,即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点,让高质量的数字内容创作变得触手可及。 这款工具不仅适合开发者和技术研究人员探索算法边界,更因其极简的操作逻辑(仅需三步:选脸、选摄像头、启动),广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换,还是制作趣味短视频和直播互动,Deep-Live-Cam 都能提供流畅的支持。 其核心技术亮点在于强大的实时处理能力,支持口型遮罩(Mouth Mask)以保留使用者原始的嘴部动作,确保表情自然精准;同时具备“人脸映射”功能,可同时对画面中的多个主体应用不同面孔。此外,项目内置了严格的内容安全过滤机制,自动拦截涉及裸露、暴力等不当素材,并倡导用户在获得授权及明确标注的前提下合规使用,体现了技术发展与伦理责任的平衡。

