minisora

1.3k 149 较难 1 次阅读 4天前Apache-2.0图像视频

AI 解读由 AI 自动生成，仅供参考

MiniSora 是一个由社区自发组织的开源项目，旨在探索 OpenAI Sora 视频生成模型的实现路径与未来发展方向。面对 Sora 技术细节尚未完全公开的现状，MiniSora 致力于通过复现相关核心论文（如 DiT）和梳理技术演进路线（从 DDPM 到 Sora），降低视频生成领域的研究门槛。

该项目主要解决了普通开发者和研究人员难以直接获取或复现顶尖视频生成模型的痛点。它设定了明确的复现目标：追求"GPU 友好”，力求在消费级显卡（如 RTX4090）或少量专业卡上即可进行训练与推理；同时注重效率，希望在较短的训练时间内生成 3-10 秒、480p 分辨率的视频。此外，社区还定期举办圆桌会议，深入解读 Stable Diffusion 3、Movie Gen 等前沿技术论文。

MiniSora 非常适合对 AI 视频生成感兴趣的开发者、研究人员以及希望深入了解扩散模型技术的学生使用。其独特的技术亮点在于结合 XTuner 框架高效复现 DiT 架构，并获得了核心开发者的技术支持与算力资源协助。加入 MiniSora，你不仅能获取系统的技术综述，还能与志同道合的伙伴共同推动开源视频生成技术的发展。

使用场景

某高校 AI 实验室的研究团队正致力于复现 Sora 的核心架构，以探索视频生成模型在有限算力下的落地路径。

没有 minisora 时

复现门槛极高：团队需从零梳理 Sora 相关论文（如 DiT）与代码，缺乏系统性的技术路线图，导致研究方向模糊且耗时漫长。
算力资源受限：原始 Sora 架构对显存要求苛刻，实验室仅有的几张 RTX4090 或单卡 A100 无法支撑完整训练，被迫放弃实验。
技术细节黑盒：从 DDPM 到 Sora 的演进过程中，关键的技术改进点（如时空补丁化、Transformer 结构优化）缺乏详细解读，成员只能盲目试错。
协作效率低下：团队成员各自为战，缺乏统一的代码基准和定期的深度研讨，导致重复造轮子且难以沉淀通用成果。

使用 minisora 后

路径清晰指引：minisora 提供了从理论综述到代码复现的完整路线图（如"From DDPM to Sora"综述），让团队迅速锁定 DiT 等关键突破口。
轻量化适配：借助 minisora-DiT 项目，团队利用 XTuner 技术在 2 张 A100 甚至消费级显卡上成功跑通了长序列训练，大幅降低了硬件门槛。
深度技术拆解：通过社区组织的圆桌会议和论文精读（如 Stable Diffusion 3 解读），成员快速掌握了 MM-DiT 等核心机制，避免了理解偏差。
开源协同加速：团队直接接入 minisora 社区的贡献者网络，复用已有的复现代码库，将原本数月的预研周期缩短至数周。

minisora 通过构建低门槛、高协同的开源生态，让普通研究者也能在有限资源下触达视频生成领域的前沿技术。

运行环境要求

操作系统

未说明

GPU

训练推荐：8x A100 (80G), 8x A6000 (48G) 或 RTX4090 (24G)
MiniSora-DiT 复现项目支持：2x A100
目标为 GPU 友好型，支持低显存需求

内存

未说明

依赖

notes该项目是一个社区驱动的开源计划，旨在探索 Sora 的实现路径。核心子项目 MiniSora-DiT 依赖 XTuner 框架复现 DiT 论文，要求贡献者熟悉 OpenMMLab MMEngine 机制。项目目标是实现低显存、高效率的训练与推理（目标分辨率 480p，时长 3-10 秒）。具体安装脚本和版本要求在提供的 README 片段中未详细列出，需参考子项目仓库。

python未说明

XTuner

OpenMMLab MMEngine

DiT

PyTorch (隐含)

快速开始

MiniSora 社区

英语 | 简体中文

👋 加入我们的微信

MiniSora 开源社区定位为由社区成员自发组织的社区驱动型倡议。MiniSora 社区旨在探索 Sora 的实现路径及未来发展方向。

将定期与 Sora 团队及社区举行圆桌讨论，探讨各种可能性。
深入研究现有的视频生成技术路径。
引领复现与 Sora 相关的论文或研究成果，例如 DiT（MiniSora-DiT）等。
对 Sora 相关技术和其实现进行全面回顾，即“从 DDPM 到 Sora：基于扩散模型的视频生成模型综述”。

热点新闻

OpenAI Sora 即将发布！
Movie Gen：一系列媒体基础模型
Stable Diffusion 3：MM-DiT：用于高分辨率图像合成的缩放修正流变换器
MiniSora-DiT：使用 XTuner 复现 DiT 论文
MiniSora 简介及复现 Sora 的最新进展

[空](https://oss.gittoolsai.com/images/mini-sora_minisora_readme_7bbc71f17e0e.png)

MiniSora 社区复现小组

MiniSora 复现 Sora 的目标

GPU 友好：理想情况下，对 GPU 显存大小和 GPU 数量的要求较低，例如能够使用类似 8 张 A100 80G 显卡、8 张 A6000 48G 显卡或 RTX4090 24G 显卡的算力进行训练和推理。
训练效率：无需长时间训练即可取得良好效果。
推理效率：在推理生成视频时，无需高长度或高分辨率；可接受的参数包括 3–10 秒长度和 480p 分辨率。

MiniSora-DiT：使用 XTuner 复现 DiT 论文

https://github.com/mini-sora/minisora-DiT

要求

我们正在招募 MiniSora 社区贡献者，使用 XTuner 复现 DiT。

我们希望社区成员具备以下特点：

熟悉 OpenMMLab MMEngine 机制。
熟悉 DiT。

背景

DiT 的作者与 Sora 的作者相同。
XTuner 具有高效训练长度为 1000K 序列的核心技术。

支持

计算资源：2 张 A100。
来自 XTuner 核心开发者 P佬@pppppM 的大力支持。

论文阅读计划

Sora：从文本创建视频
技术报告：视频生成模型作为世界模拟器
Latte：Latte：用于视频生成的潜在扩散 Transformer
- Latte 论文解读（简体中文），知乎（简体中文）
DiT：具有 Transformer 的可扩展扩散模型
Stable Cascade（ICLR 24 论文）：Würstchen：用于大规模文本到图像扩散模型的高效架构
Stable Diffusion 3：MM-DiT：用于高分辨率图像合成的缩放修正流变换器
- SD3 论文解读（简体中文），知乎（简体中文）
更新中...

演讲者招募

相关工作

01 扩散模型
02 扩散Transformer
03 基准视频生成模型
04 扩散UNet
05 视频生成
06 数据集
- 6.1 公开数据集
- 6.2 视频增强方法
  - 6.2.1 基础变换
  - 6.2.2 特征空间
  - 6.2.3 基于GAN的增强
  - 6.2.4 基于编码器/解码器的方法
  - 6.2.5 仿真
07 补丁化方法
08 长上下文
09 音频相关资源
10 一致性
11 提示工程
12 安全性
13 世界模型
14 视频压缩
15 Mamba
- 15.1 理论基础与模型架构
- 15.2 图像生成与视觉应用
- 15.3 视频处理与理解
- 15.4 医学图像处理
16 现有高质量资源
17 高效训练
- 17.1 基于并行的方法
  - 17.1.1 数据并行(DP)
  - 17.1.2 模型并行(MP)
  - 17.1.3 流水线并行(PP)
  - 17.1.4 广义并行(GP)
  - 17.1.5 ZeRO并行(ZP)
- 17.2 非并行方法
  - 17.2.1 减少激活内存
  - 17.2.2 CPU卸载
  - 17.2.3 内存高效的优化器
- 17.3 新型结构
18 高效推理
- 18.1 减少采样步数
  - 18.1.1 连续步数
  - 18.1.2 快速采样
  - 18.1.3 步数蒸馏
- 18.2 优化推理
  - 18.2.1 低比特量化
  - 18.2.2 并行/稀疏推理

01 Diffusion Models
Paper	Link
1) Guided-Diffusion: Diffusion Models Beat GANs on Image Synthesis	NeurIPS 21 Paper, GitHub
2) Latent Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models	CVPR 22 Paper, GitHub
3) EDM: Elucidating the Design Space of Diffusion-Based Generative Models	NeurIPS 22 Paper, GitHub
4) DDPM: Denoising Diffusion Probabilistic Models	NeurIPS 20 Paper, GitHub
5) DDIM: Denoising Diffusion Implicit Models	ICLR 21 Paper, GitHub
6) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations	ICLR 21 Paper, GitHub, Blog
7) Stable Cascade: Würstchen: An efficient architecture for large-scale text-to-image diffusion models	ICLR 24 Paper, GitHub, Blog
8) Diffusion Models in Vision: A Survey	TPAMI 23 Paper, GitHub
9) Improved DDPM: Improved Denoising Diffusion Probabilistic Models	ICML 21 Paper, Github
10) Classifier-free diffusion guidance	NIPS 21 Paper
11) Glide: Towards photorealistic image generation and editing with text-guided diffusion models	Paper, Github
12) VQ-DDM: Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation	CVPR 22 Paper, Github
13) Diffusion Models for Medical Anomaly Detection	Paper, Github
14) Generation of Anonymous Chest Radiographs Using Latent Diffusion Models for Training Thoracic Abnormality Classification Systems	Paper
15) DiffusionDet: Diffusion Model for Object Detection	ICCV 23 Paper, Github
16) Label-efficient semantic segmentation with diffusion models	ICLR 22 Paper, Github, Project
02 Diffusion Transformer
Paper	Link
1) UViT: All are Worth Words: A ViT Backbone for Diffusion Models	CVPR 23 Paper, GitHub, ModelScope
2) DiT: Scalable Diffusion Models with Transformers	ICCV 23 Paper, GitHub, Project, ModelScope
3) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers	ArXiv 23, GitHub, ModelScope
4) FiT: Flexible Vision Transformer for Diffusion Model	ArXiv 24, GitHub
5) k-diffusion: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers	ArXiv 24, GitHub
6) Large-DiT: Large Diffusion Transformer	GitHub
7) VisionLLaMA: A Unified LLaMA Interface for Vision Tasks	ArXiv 24, GitHub
8) Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis	Paper, Blog
9) PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation	ArXiv 24, Project
10) PIXART-α: Fast Training of Diffusion Transformer for Photorealistic Text-To-Image Synthesis	ArXiv 23, GitHub ModelScope
11) PIXART-δ: Fast and Controllable Image Generation With Latent Consistency Model	ArXiv 24,
12) Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers	ArXiv 24, GitHub
13) DDM: Deconstructing Denoising Diffusion Models for Self-Supervised Learning	ArXiv 24
14) Autoregressive Image Generation without Vector Quantization	ArXiv 24, GitHub
15) Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model	ArXiv 24
16) Scaling Diffusion Language Models via Adaptation from Autoregressive Models	ArXiv 24
17) Large Language Diffusion Models	ArXiv 25
03 Baseline Video Generation Models
Paper	Link
1) ViViT: A Video Vision Transformer	ICCV 21 Paper, GitHub
2) VideoLDM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models	CVPR 23 Paper
3) DiT: Scalable Diffusion Models with Transformers	ICCV 23 Paper, Github, Project, ModelScope
4) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators	ArXiv 23, GitHub
5) Latte: Latent Diffusion Transformer for Video Generation	ArXiv 24, GitHub, Project, ModelScope
04 Diffusion UNet
Paper	Link
1) Taming Transformers for High-Resolution Image Synthesis	CVPR 21 Paper,GitHub ,Project
2) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment	ArXiv 24 Github
05 Video Generation
Paper	Link
1) Animatediff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning	ICLR 24 Paper, GitHub, ModelScope
2) I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models	ArXiv 23, GitHub, ModelScope
3) Imagen Video: High Definition Video Generation with Diffusion Models	ArXiv 22
4) MoCoGAN: Decomposing Motion and Content for Video Generation	CVPR 18 Paper
5) Adversarial Video Generation on Complex Datasets	Paper
6) W.A.L.T: Photorealistic Video Generation with Diffusion Models	ArXiv 23, Project
7) VideoGPT: Video Generation using VQ-VAE and Transformers	ArXiv 21, GitHub
8) Video Diffusion Models	ArXiv 22, GitHub, Project
9) MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation	NeurIPS 22 Paper, GitHub, Project, Blog
10) VideoPoet: A Large Language Model for Zero-Shot Video Generation	ArXiv 23, Project, Blog
11) MAGVIT: Masked Generative Video Transformer	CVPR 23 Paper, GitHub, Project, Colab
12) EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions	ArXiv 24, GitHub, Project
13) SimDA: Simple Diffusion Adapter for Efficient Video Generation	Paper, GitHub, Project
14) StableVideo: Text-driven Consistency-aware Diffusion Video Editing	ICCV 23 Paper, GitHub, Project
15) SVD: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets	Paper, GitHub
16) ADD: Adversarial Diffusion Distillation	Paper, GitHub
17) GenTron: Diffusion Transformers for Image and Video Generation	CVPR 24 Paper, Project
18) LFDM: Conditional Image-to-Video Generation with Latent Flow Diffusion Models	CVPR 23 Paper, GitHub
19) MotionDirector: Motion Customization of Text-to-Video Diffusion Models	ArXiv 23, GitHub
20) TGAN-ODE: Latent Neural Differential Equations for Video Generation	Paper, GitHub
21) VideoCrafter1: Open Diffusion Models for High-Quality Video Generation	ArXiv 23, GitHub
22) VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models	ArXiv 24, GitHub
23) LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation	ArXiv 22, GitHub
24) LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models	ArXiv 23, GitHub, Project
25) PYoCo: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models	ICCV 23 Paper, Project
26) VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation	CVPR 23 Paper
27) Movie Gen: A Cast of Media Foundation Models	Paper, Project
28) Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model	ArXiv 25, Project
06 Dataset
6.1 Public Datasets
Dataset Name - Paper	Link
1) Panda-70M - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers `70M Clips, 720P, Downloadable`	CVPR 24 Paper, Github, Project, ModelScope
2) InternVid-10M - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation `10M Clips, 720P, Downloadable`	ArXiv 24, Github
3) CelebV-Text - CelebV-Text: A Large-Scale Facial Text-Video Dataset `70K Clips, 720P, Downloadable`	CVPR 23 Paper, Github, Project
4) HD-VG-130M - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation `130M Clips, 720P, Downloadable`	ArXiv 23, Github, Tool
5) HD-VILA-100M - Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions `100M Clips, 720P, Downloadable`	CVPR 22 Paper, Github
6) VideoCC - Learning Audio-Video Modalities from Image Captions `10.3M Clips, 720P, Downloadable`	ECCV 22 Paper, Github
7) YT-Temporal-180M - MERLOT: Multimodal Neural Script Knowledge Models `180M Clips, 480P, Downloadable`	NeurIPS 21 Paper, Github, Project
8) HowTo100M - HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips `136M Clips, 240P, Downloadable`	ICCV 19 Paper, Github, Project
9) UCF101 - UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild `13K Clips, 240P, Downloadable`	CVPR 12 Paper, Project
10) MSVD - Collecting Highly Parallel Data for Paraphrase Evaluation `122K Clips, 240P, Downloadable`	ACL 11 Paper, Project
11) Fashion-Text2Video - A human video dataset with rich label and text annotations `600 Videos, 480P, Downloadable`	ArXiv 23, Project
12) LAION-5B - A dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M `5B Clips, Downloadable`	NeurIPS 22 Paper, Project
13) ActivityNet Captions - ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time `20k videos, Downloadable`	Arxiv 17 Paper, Project
14) MSR-VTT - A large-scale video benchmark for video understanding `10k Clips, Downloadable`	CVPR 16 Paper, Project
15) The Cityscapes Dataset - Benchmark suite and evaluation server for pixel-level, instance-level, and panoptic semantic labeling `Downloadable`	Arxiv 16 Paper, Project
16) Youku-mPLUG - First open-source large-scale Chinese video text dataset `Downloadable`	ArXiv 23, Project, ModelScope
17) VidProM - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models `6.69M, Downloadable`	ArXiv 24, Github
18) Pixabay100 - A video dataset collected from Pixabay `Downloadable`	Github
19) WebVid - Large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sites `Long Durations and Structured Captions`	ArXiv 21, Project , ModelScope
20) MiraData(Mini-Sora Data): A Large-Scale Video Dataset with Long Durations and Structured Captions `10M video-text pairs`	Github, Project
21) IDForge: A video dataset featuring scenes of people speaking. `300k Clips, Downloadable`	ArXiv 24, Github
6.2 Video Augmentation Methods
6.2.1 Basic Transformations
Three-stream CNNs for action recognition	PRL 17 Paper
Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural Networks	EL 19 Paper
Intra-clip Aggregation for Video Person Re-identification	ICIP 20 Paper
VideoMix: Rethinking Data Augmentation for Video Classification	CVPR 20 Paper
mixup: Beyond Empirical Risk Minimization	ICLR 17 Paper
CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features	ICCV 19 Paper
Video Salient Object Detection via Fully Convolutional Networks	ICIP 18 Paper
Illumination-Based Data Augmentation for Robust Background Subtraction	SKIMA 19 Paper
Image editing-based data augmentation for illumination-insensitive background subtraction	EIM 20 Paper
6.2.2 Feature Space
Feature Re-Learning with Data Augmentation for Content-based Video Recommendation	ACM 18 Paper
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer	Trans 21 Paper
6.2.3 GAN-based Augmentation
Deep Video-Based Performance Cloning	CVPR 18 Paper
Adversarial Action Data Augmentation for Similar Gesture Action Recognition	IJCNN 19 Paper
Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples	MM 20 Paper
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer	Trans 20 Paper
Dynamic Facial Expression Generation on Hilbert Hypersphere With Conditional Wasserstein Generative Adversarial Nets	TPAMI 20 Paper
CrowdGAN: Identity-Free Interactive Crowd Video Generation and Beyond	TPAMI 22 Paper
6.2.4 Encoder/Decoder Based
Rotationally-Temporally Consistent Novel View Synthesis of Human Performance Video	ECCV 20 Paper
Autoencoder-based Data Augmentation for Deepfake Detection	ACM 23 Paper
6.2.5 Simulation
A data augmentation methodology for training machine/deep learning gait recognition algorithms	CVPR 16 Paper
ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications	IEEE 21 Paper
Mid-Air: A Multi-Modal Dataset for Extremely Low Altitude Drone Flights	CVPR 19 Paper
Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models	IJCV 19 Paper
Using synthetic data for person tracking under adverse weather conditions	IVC 21 Paper
Unlimited Road-scene Synthetic Annotation (URSA) Dataset	ITSC 18 Paper
SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video Data	CVPR 21 Paper
Universal Semantic Segmentation for Fisheye Urban Driving Images	SMC 20 Paper
07 Patchifying Methods
Paper	Link
1) ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale	CVPR 21 Paper, Github
2) MAE: Masked Autoencoders Are Scalable Vision Learners	CVPR 22 Paper, Github
3) ViViT: A Video Vision Transformer (-)	ICCV 21 Paper, GitHub
4) DiT: Scalable Diffusion Models with Transformers (-)	ICCV 23 Paper, GitHub, Project, ModelScope
5) U-ViT: All are Worth Words: A ViT Backbone for Diffusion Models (-)	CVPR 23 Paper, GitHub, ModelScope
6) FlexiViT: One Model for All Patch Sizes	Paper, Github
7) Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution	ArXiv 23, Github
8) VQ-VAE: Neural Discrete Representation Learning	Paper, Github
9) VQ-GAN: Neural Discrete Representation Learning	CVPR 21 Paper, Github
10) LVT: Latent Video Transformer	Paper, Github
11) VideoGPT: Video Generation using VQ-VAE and Transformers (-)	ArXiv 21, GitHub
12) Predicting Video with VQVAE	ArXiv 21
13) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers	ICLR 23 Paper, Github
14) TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer	ECCV 22 Paper, Github
15) MAGVIT: Masked Generative Video Transformer (-)	CVPR 23 Paper, GitHub, Project, Colab
16) MagViT2: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation	ICLR 24 Paper, Github
17) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-)	ArXiv 23, Project, Blog
18) CLIP: Learning Transferable Visual Models From Natural Language Supervision	CVPR 21 Paper, Github
19) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	ArXiv 22, Github
20) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	ArXiv 23, Github
08 Long-context
Paper	Link
1) World Model on Million-Length Video And Language With RingAttention	ArXiv 24, GitHub
2) Ring Attention with Blockwise Transformers for Near-Infinite Context	ArXiv 23, GitHub
3) Extending LLMs' Context Window with 100 Samples	ArXiv 24, GitHub
4) Efficient Streaming Language Models with Attention Sinks	ICLR 24 Paper, GitHub
5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed Survey	Paper
6) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	CVPR 24 Paper, GitHub, Project
7) MemoryBank: Enhancing Large Language Models with Long-Term Memory	Paper, GitHub
09 Audio Related Resource
Paper	Link
1) Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion	ArXiv 24, Github, Blog
2) MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation	CVPR 23 Paper, GitHub
3) Pengi: An Audio Language Model for Audio Tasks	NeurIPS 23 Paper, GitHub
4) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset	NeurlPS 23 Paper, GitHub
5) Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	ArXiv 23, GitHub
6) NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality	TPAMI 24 Paper, GitHub
7) NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers	ICLR 24 Paper, GitHub
8) UniAudio: An Audio Foundation Model Toward Universal Audio Generation	ArXiv 23, GitHub
9) Diffsound: Discrete Diffusion Model for Text-to-sound Generation	TASLP 22 Paper
10) AudioGen: Textually Guided Audio Generation	ICLR 23 Paper, Project
11) AudioLDM: Text-to-audio generation with latent diffusion models	ICML 23 Paper, GitHub, Project, Huggingface
12) AudioLDM2: Learning Holistic Audio Generation with Self-supervised Pretraining	ArXiv 23, GitHub, Project, Huggingface
13) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models	ICML 23 Paper, GitHub
14) Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation	ArXiv 23
15) TANGO: Text-to-audio generation using instruction-tuned LLM and latent diffusion model	ArXiv 23, GitHub, Project, Huggingface
16) AudioLM: a Language Modeling Approach to Audio Generation	ArXiv 22
17) AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head	ArXiv 23, GitHub
18) MusicGen: Simple and Controllable Music Generation	NeurIPS 23 Paper, GitHub
19) LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT	ArXiv 23
20) Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners	CVPR 24 Paper
21) Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	EMNLP 23 Paper
22) Audio-Visual LLM for Video Understanding	ArXiv 23
23) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-)	ArXiv 23, Project, Blog
24) Movie Gen: A Cast of Media Foundation Models	Paper, Project
10 Consistency
Paper	Link
1) Consistency Models	Paper, GitHub
2) Improved Techniques for Training Consistency Models	ArXiv 23
3) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations (-)	ICLR 21 Paper, GitHub, Blog
4) Improved Techniques for Training Score-Based Generative Models	NIPS 20 Paper, GitHub
4) Generative Modeling by Estimating Gradients of the Data Distribution	NIPS 19 Paper, GitHub
5) Maximum Likelihood Training of Score-Based Diffusion Models	NIPS 21 Paper, GitHub
6) Layered Neural Atlases for Consistent Video Editing	TOG 21 Paper, GitHub, Project
7) StableVideo: Text-driven Consistency-aware Diffusion Video Editing	ICCV 23 Paper, GitHub, Project
8) CoDeF: Content Deformation Fields for Temporally Consistent Video Processing	Paper, GitHub, Project
9) Sora Generates Videos with Stunning Geometrical Consistency	Paper, GitHub, Project
10) Efficient One-stage Video Object Detection by Exploiting Temporal Consistency	ECCV 22 Paper, GitHub
11) Bootstrap Motion Forecasting With Self-Consistent Constraints	ICCV 23 Paper
12) Enforcing Realism and Temporal Consistency for Large-Scale Video Inpainting	Paper
13) Enhancing Multi-Camera People Tracking with Anchor-Guided Clustering and Spatio-Temporal Consistency ID Re-Assignment	CVPRW 23 Paper, GitHub
14) Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing	ArXiv 21
15) Semi-Supervised Crowd Counting With Spatial Temporal Consistency and Pseudo-Label Filter	TCSVT 23 Paper
16) Spatio-temporal Consistency and Hierarchical Matching for Multi-Target Multi-Camera Vehicle Tracking	CVPRW 19 Paper
17) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning (-)	ArXiv 23
18) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM (-)	ArXiv 24
19) MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask	ArXiv 23
11 Prompt Engineering
Paper	Link
1) RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models	ArXiv 24, GitHub, Project
2) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs	ArXiv 24, GitHub
3) LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models	TMLR 23 Paper, GitHub
4) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTS	ICLR 24 Paper, GitHub
5) Progressive Text-to-Image Diffusion with Soft Latent Direction	ArXiv 23
6) Self-correcting LLM-controlled Diffusion Models	CVPR 24 Paper, GitHub
7) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation	MM 23 Paper
8) LayoutGPT: Compositional Visual Planning and Generation with Large Language Models	NeurIPS 23 Paper, GitHub
9) Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition	ArXiv 24, GitHub
10) InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions	ArXiv 23, GitHub
11) Controllable Text-to-Image Generation with GPT-4	ArXiv 23
12) LLM-grounded Video Diffusion Models	ICLR 24 Paper
13) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning	ArXiv 23
14) FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax	ArXiv 23, Github, Project
15) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM	ArXiv 24
16) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator	NeurIPS 23 Paper
17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models	ArXiv 23
18) MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation	ArXiv 23
19) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning	ArXiv 23
20) Multimodal Procedural Planning via Dual Text-Image Prompting	ArXiv 23, Github
21) InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists	ICLR 24 Paper, Github
22) DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback	ArXiv 23
23) TaleCrafter: Interactive Story Visualization with Multiple Characters	SIGGRAPH Asia 23 Paper
24) Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis	ArXiv 23, Github
25) COLE: A Hierarchical Generation Framework for Graphic Design	ArXiv 23
26) Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision	ArXiv 23
27) Vlogger: Make Your Dream A Vlog	CVPR 24 Paper, Github
28) GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting	Paper
29) MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion	ArXiv 24
Recaption
Paper	Link
1) LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models	ArXiv 23, GitHub
2) Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation	ArXiv 23, GitHub
3) CoCa: Contrastive Captioners are Image-Text Foundation Models	ArXiv 22, Github
4) CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion	ArXiv 24
5) VideoChat: Chat-Centric Video Understanding	CVPR 24 Paper, Github
6) De-Diffusion Makes Text a Strong Cross-Modal Interface	ArXiv 23
7) HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	ArXiv 23
8) SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data	ArXiv 24
9) LLMGA: Multimodal Large Language Model based Generation Assistant	ArXiv 23, Github
10) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment	ArXiv 24, Github
11) MyVLM: Personalizing VLMs for User-Specific Queries	ArXiv 24
12) A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation	ArXiv 23, Github
13) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs(-)	ArXiv 24, Github
14) FlexCap: Generating Rich, Localized, and Flexible Captions in Images	ArXiv 24
15) Video ReCap: Recursive Captioning of Hour-Long Videos	ArXiv 24, Github
16) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	ICML 22, Github
17) PromptCap: Prompt-Guided Task-Aware Image Captioning	ICCV 23, Github
18) CIC: A framework for Culturally-aware Image Captioning	ArXiv 24
19) Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion	ArXiv 24
20) FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions	WACV 24, Github
12 Security
Paper	Link
1) BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset	NeurIPS 23 Paper, Github
2) LIMA: Less Is More for Alignment	NeurIPS 23 Paper
3) Jailbroken: How Does LLM Safety Training Fail?	NeurIPS 23 Paper
4) Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models	CVPR 23 Paper
5) Stable Bias: Evaluating Societal Representations in Diffusion Models	NeurIPS 23 Paper
6) Ablating concepts in text-to-image diffusion models	ICCV 23 Paper
7) Diffusion art or digital forgery? investigating data replication in diffusion models	ICCV 23 Paper, Project
8) Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks	ICCV 20 Paper
9) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks	ICML 20 Paper
10) A pilot study of query-free adversarial attack against stable diffusion	ICCV 23 Paper
11) Interpretable-Through-Prototypes Deepfake Detection for Diffusion Models	ICCV 23 Paper
12) Erasing Concepts from Diffusion Models	ICCV 23 Paper, Project
13) Ablating Concepts in Text-to-Image Diffusion Models	ICCV 23 Paper, Project
14) BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset	NeurIPS 23 Paper, Project
15) Stable Bias: Evaluating Societal Representations in Diffusion Models	NeurIPS 23 Paper
16) Threat Model-Agnostic Adversarial Defense using Diffusion Models	Paper
17) How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?	Paper, Github
18) Differentially Private Diffusion Models Generate Useful Synthetic Images	Paper
19) Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models	SIGSAC 23 Paper, Github
20) Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models	Paper, Github
21) Unified Concept Editing in Diffusion Models	WACV 24 Paper, Project
22) Diffusion Model Alignment Using Direct Preference Optimization	ArXiv 23
23) RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment	TMLR 23 Paper , Github
24) Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation	Paper, Github, Project
13 World Model
Paper	Link
1) NExT-GPT: Any-to-Any Multimodal LLM	ArXiv 23, GitHub
14 Video Compression
Paper	Link
1) H.261: Video codec for audiovisual services at p x 64 kbit/s	Paper
2) H.262: Information technology - Generic coding of moving pictures and associated audio information: Video	Paper
3) H.263: Video coding for low bit rate communication	Paper
4) H.264: Overview of the H.264/AVC video coding standard	Paper
5) H.265: Overview of the High Efficiency Video Coding (HEVC) Standard	Paper
6) H.266: Overview of the Versatile Video Coding (VVC) Standard and its Applications	Paper
7) DVC: An End-to-end Deep Video Compression Framework	CVPR 19 Paper, GitHub
8) OpenDVC: An Open Source Implementation of the DVC Video Compression Method	Paper, GitHub
9) HLVC: Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement	CVPR 20 Paper, Github
10) RLVC: Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model	J-STSP 21 Paper, Github
11) PLVC: Perceptual Learned Video Compression with Recurrent Conditional GAN	IJCAI 22 Paper, Github
12) ALVC: Advancing Learned Video Compression with In-loop Frame Prediction	T-CSVT 22 Paper, Github
13) DCVC: Deep Contextual Video Compression	NeurIPS 21 Paper, Github
14) DCVC-TCM: Temporal Context Mining for Learned Video Compression	TM 22 Paper, Github
15) DCVC-HEM: Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression	MM 22 Paper, Github
16) DCVC-DC: Neural Video Compression with Diverse Contexts	CVPR 23 Paper, Github
17) DCVC-FM: Neural Video Compression with Feature Modulation	CVPR 24 Paper, Github
18) SSF: Scale-Space Flow for End-to-End Optimized Video Compression	CVPR 20 Paper, Github
15 Mamba
15.1 Theoretical Foundations and Model Architecture
Paper	Link
1) Mamba: Linear-Time Sequence Modeling with Selective State Spaces	ArXiv 23, Github
2) Efficiently Modeling Long Sequences with Structured State Spaces	ICLR 22 Paper, Github
3) Modeling Sequences with Structured State Spaces	Paper
4) Long Range Language Modeling via Gated State Spaces	ArXiv 22, GitHub
15.2 Image Generation and Visual Applications
Paper	Link
1) Diffusion Models Without Attention	ArXiv 23
2) Pan-Mamba: Effective Pan-Sharpening with State Space Model	ArXiv 24, Github
3) Pretraining Without Attention	ArXiv 22, Github
4) Block-State Transformers	NIPS 23 Paper
5) Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model	ArXiv 24, Github
6) VMamba: Visual State Space Model	ArXiv 24, Github
7) ZigMa: Zigzag Mamba Diffusion Model	ArXiv 24, Github
8) MambaVision: A Hybrid Mamba-Transformer Vision Backbone	ArXiv 24, GitHub
15.3 Video Processing and Understanding
Paper	Link
1) Long Movie Clip Classification with State-Space Video Models	ECCV 22 Paper, Github
2) Selective Structured State-Spaces for Long-Form Video Understanding	CVPR 23 Paper
3) Efficient Movie Scene Detection Using State-Space Transformers	CVPR 23 Paper, Github
4) VideoMamba: State Space Model for Efficient Video Understanding	Paper, Github
15.4 Medical Image Processing
Paper	Link
1) Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining	ArXiv 24, Github
2) MambaIR: A Simple Baseline for Image Restoration with State-Space Model	ArXiv 24, Github
3) VM-UNet: Vision Mamba UNet for Medical Image Segmentation	ArXiv 24, Github

16 Existing high-quality resources
Resources	Link
1) Datawhale - AI视频生成学习	Feishu doc
2) A Survey on Generative Diffusion Model	TKDE 24 Paper, GitHub
3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion Models	ArXiv 23, GitHub
4) Awesome-Text-To-Video：A Survey on Text-to-Video Generation/Synthesis	GitHub
5) video-generation-survey: A reading list of video generation	GitHub
6) Awesome-Video-Diffusion	GitHub
7) Video Generation Task in Papers With Code	Task
8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models	ArXiv 24, GitHub
9) Open-Sora-Plan (PKU-YuanGroup)	GitHub
10) State of the Art on Diffusion Models for Visual Computing	Paper
11) Diffusion Models: A Comprehensive Survey of Methods and Applications	CSUR 24 Paper, GitHub
12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable	Paper
13) On the Design Fundamentals of Diffusion Models: A Survey	Paper
14) Efficient Diffusion Models for Vision: A Survey	Paper
15) Text-to-Image Diffusion Models in Generative AI: A Survey	Paper
16) Awesome-Diffusion-Transformers	GitHub, Project
17) Open-Sora (HPC-AI Tech)	GitHub, Blog
18) LAVIS - A Library for Language-Vision Intelligence	ACL 23 Paper, GitHub, Project
19) OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference	GitHub
20) Awesome-Long-Context	GitHub1, GitHub2
21) Lite-Sora	GitHub
22) Mira: A Mini-step Towards Sora-like Long Video Generation	GitHub, Project
17 Efficient Training
17.1 Parallelism based Approach
17.1.1 Data Parallelism (DP)
1) A bridging model for parallel computation	Paper
2) PyTorch Distributed: Experiences on Accelerating Data Parallel Training	VLDB 20 Paper
17.1.2 Model Parallelism (MP)
1) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	ArXiv 19 Paper
2) TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models	PMLR 21 Paper
17.1.3 Pipeline Parallelism (PP)
1) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism	NeurIPS 19 Paper
2) PipeDream: generalized pipeline parallelism for DNN training	SOSP 19 Paper
17.1.4 Generalized Parallelism (GP)
1) Mesh-TensorFlow: Deep Learning for Supercomputers	ArXiv 18 Paper
2) Beyond Data and Model Parallelism for Deep Neural Networks	MLSys 19 Paper
17.1.5 ZeRO Parallelism (ZP)
1) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models	ArXiv 20
2) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters	ACM 20 Paper
3) ZeRO-Offload: Democratizing Billion-Scale Model Training	ArXiv 21
4) PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel	ArXiv 23
17.2 Non-parallelism based Approach
17.2.1 Reducing Activation Memory
1) Gist: Efficient Data Encoding for Deep Neural Network Training	IEEE 18 Paper
2) Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization	MLSys 20 Paper
3) Training Deep Nets with Sublinear Memory Cost	ArXiv 16 Paper
4) Superneurons: dynamic GPU memory management for training deep neural networks	ACM 18 Paper
17.2.2 CPU-Offloading
1) Training Large Neural Networks with Constant Memory using a New Execution Algorithm	ArXiv 20 Paper
2) vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design	IEEE 16 Paper
17.2.3 Memory Efficient Optimizer
1) Adafactor: Adaptive Learning Rates with Sublinear Memory Cost	PMLR 18 Paper
2) Memory-Efficient Adaptive Optimization for Large-Scale Learning	Paper
17.3 Novel Structure
1) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment	ArXiv 24 Github
18 Efficient Inference
18.1 Reduce Sampling Steps
18.1.1 Continuous Steps
1) Generative Modeling by Estimating Gradients of the Data Distribution	NeurIPS 19 Paper
2) WaveGrad: Estimating Gradients for Waveform Generation	ArXiv 20
3) Noise Level Limited Sub-Modeling for Diffusion Probabilistic Vocoders	ICASSP 21 Paper
4) Noise Estimation for Generative Diffusion Models	ArXiv 21
18.1.2 Fast Sampling
1) Denoising Diffusion Implicit Models	ICLR 21 Paper
2) DiffWave: A Versatile Diffusion Model for Audio Synthesis	ICLR 21 Paper
3) On Fast Sampling of Diffusion Probabilistic Models	ArXiv 21
4) DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps	NeurIPS 22 Paper
5) DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models	ArXiv 22
6) Fast Sampling of Diffusion Models with Exponential Integrator	ICLR 22 Paper
18.1.3 Step distillation
1) On Distillation of Guided Diffusion Models	CVPR 23 Paper
2) Progressive Distillation for Fast Sampling of Diffusion Models	ICLR 22 Paper
3) SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds	NeurIPS 23 Paper
4) Tackling the Generative Learning Trilemma with Denoising Diffusion GANs	ICLR 22 Paper
18.2 Optimizing Inference
18.2.1 Low-bit Quantization
1) Q-Diffusion: Quantizing Diffusion Models	CVPR 23 Paper
2) Q-DM: An Efficient Low-bit Quantized Diffusion Model	NeurIPS 23 Paper
3) Temporal Dynamic Quantization for Diffusion Models	NeurIPS 23 Paper
18.2.2 Parallel/Sparse inference
1) DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models	CVPR 24 Paper
2) Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models	NeurIPS 22 Paper
3) PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models	ArXiv 24

引用

如果本项目对您的工作有所帮助，请使用以下格式引用：

@misc{minisora,
    title={MiniSora},
    author={MiniSora社区},
    url={https://github.com/mini-sora/minisora},
    year={2024}
}

@misc{minisora,
    title={从DDPM到Sora的基于扩散模型的视频生成模型综述},
    author={MiniSora社区综述论文组},
    url={https://github.com/mini-sora/minisora},
    year={2024}
}

Minisora社区微信群

星标历史

如何为Mini Sora社区贡献力量

我们非常感谢您对Mini Sora开源社区的贡献，帮助我们让它比现在更加出色！

更多详情，请参阅贡献指南。

社区贡献者

MiniSora 快速上手指南

MiniSora 是一个由社区驱动的开源项目，旨在探索 Sora 视频生成模型的实现路径。本项目目前主要聚焦于复现相关论文（如 DiT）、技术调研以及组织圆桌讨论，而非直接提供一个“一键生成视频”的成品软件。

以下是参与 MiniSora 社区核心复现任务（基于 XTuner 复现 DiT）及获取学习资源的快速指南。

环境准备

在开始之前，请确保您的开发环境满足以下要求，特别是如果您计划参与模型复现训练：

操作系统: Linux (推荐 Ubuntu 20.04+)
Python: 3.8 或更高版本
GPU 要求:
- 推理/轻量实验: 建议至少 16GB 显存 (如 RTX 3090/4090)。
- 训练复现 (DiT): 官方推荐配置为 2x A100 (80G) 或同等算力。社区目标是在 8x A100/A6000 或消费级显卡集群上实现训练。
前置依赖:
- CUDA 11.7+
- PyTorch 2.0+
- Git

安装步骤

MiniSora 的核心复现工作主要依托于 XTuner 框架进行。请按照以下步骤搭建基础环境：

克隆仓库

git clone https://github.com/mini-sora/minisora.git
cd minisora

安装核心依赖 (XTuner) 由于 MiniSora-DiT 项目深度集成 XTuner，建议优先安装 XTuner 及其依赖。国内开发者推荐使用镜像源加速安装。

# 创建虚拟环境 (推荐)
conda create -n minisora python=3.10 -y
conda activate minisora

# 安装 PyTorch (根据实际 CUDA 版本调整，此处以 CUDA 11.8 为例)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# 安装 XTuner (使用国内镜像加速)
pip install -U xtuner[all] -i https://pypi.tuna.tsinghua.edu.cn/simple

安装 MiniSora 特定依赖 进入具体的复现子模块（如 MiniSora-DiT）安装额外需求：

cd codes/minisora-DiT  # 路径可能随项目更新变动，请以实际目录结构为准
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

基本使用

MiniSora 目前主要提供代码复现参考、技术文档和论文解读。最直接的使用方式是运行社区复现的 DiT 模型或阅读技术综述。

1. 运行 DiT 复现示例 (基于 XTuner)

如果您已配置好环境，可以使用 XTuner 启动 DiT 的训练或推理流程。以下是一个典型的训练配置启动命令示例：

# 使用 XTuner 启动 DiT 训练 (需替换为具体的配置文件路径)
xtuner train ./configs/dit_config.py --work-dir ./work_dirs/dit_output

注意：具体的配置文件 (dit_config.py) 和数据集路径需要在 minisora-DiT 子目录中根据官方文档进行详细配置。

2. 获取技术资源与论文解读

对于大多数开发者，快速上手的最佳方式是查阅社区整理的技术综述和论文笔记：

查看 Sora 技术综述: 访问 docs/survey_README.md 获取从 DDPM 到 Sora 的视频生成模型全面回顾。
阅读论文解读笔记: 项目 notes/ 目录下包含了社区对关键论文的详细中文解读，包括：
- Sora: Video generation models as world simulators
- DiT: Scalable Diffusion Models with Transformers
- Stable Diffusion 3 (MM-DiT): 详见 notes/SD3_zh-CN.md
- Latte: Latent Diffusion Transformer for Video Generation

3. 加入社区交流

MiniSora 强调社区协作，建议加入微信群参与圆桌讨论和获取最新进展：

微信群二维码: 点击此处查看
GitHub Discussions: 关注 Reproduction Group 获取最新的复现目标和招募信息。

常见问题

视频编码标准（如 H.264/H.265）与视频生成中的压缩技术有什么关系？

该项目使用什么开源许可证？

论文列表中应该保留 arXiv 链接还是会议官方链接？

如何正确提交 PR 以自动关闭相关的 Issue？

《Taming Transformers for High-Resolution Image Synthesis》这篇论文属于 Diffusion Transformer 类别吗？

在更新英文 README 时，是否需要包含中文翻译材料？

关于 Latte 的笔记文件是中文写的，如何在英文文档中说明？

相似工具推荐

OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。 OpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你

★ 349.3k|★★★☆☆|1周前

Agent开发框架图像

stable-diffusion-webui

stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。

★ 162.1k|★★★☆☆|1周前

开发框架图像Agent

ComfyUI

ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。

★ 109.2k|★★☆☆☆|今天

开发框架图像Agent

gemini-cli

gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。

★ 100.8k|★★☆☆☆|1周前

插件Agent图像

LLMs-from-scratch

LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。 LLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备

★ 90.1k|★★★☆☆|1周前

语言模型图像Agent

Deep-Live-Cam

Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。

★ 88.9k|★★★☆☆|1周前

开发框架图像Agent

使用场景

没有 minisora 时

使用 minisora 后

运行环境要求

快速开始

MiniSora 社区

热点新闻

MiniSora 社区复现小组

MiniSora 复现 Sora 的目标

MiniSora-DiT：使用 XTuner 复现 DiT 论文

要求

背景

支持

最近的圆桌讨论

Stable Diffusion 3 论文解读：MM-DiT

往期讨论亮点

与 Sora 的夜间谈话：视频扩散概述

论文阅读计划

演讲者招募

相关工作

01 Diffusion Models

02 Diffusion Transformer

03 Baseline Video Generation Models

04 Diffusion UNet

05 Video Generation

06 Dataset

6.1 Public Datasets

6.2 Video Augmentation Methods

6.2.1 Basic Transformations

6.2.2 Feature Space

6.2.3 GAN-based Augmentation

6.2.4 Encoder/Decoder Based

6.2.5 Simulation

07 Patchifying Methods

08 Long-context

09 Audio Related Resource

10 Consistency

11 Prompt Engineering

Recaption

12 Security

13 World Model

14 Video Compression

15 Mamba

15.1 Theoretical Foundations and Model Architecture

15.2 Image Generation and Visual Applications

15.3 Video Processing and Understanding

15.4 Medical Image Processing

16 Existing high-quality resources

17 Efficient Training

17.1 Parallelism based Approach

17.1.1 Data Parallelism (DP)

17.1.2 Model Parallelism (MP)

17.1.3 Pipeline Parallelism (PP)

17.1.4 Generalized Parallelism (GP)

17.1.5 ZeRO Parallelism (ZP)

17.2 Non-parallelism based Approach

17.2.1 Reducing Activation Memory

17.2.2 CPU-Offloading

17.2.3 Memory Efficient Optimizer

17.3 Novel Structure

18 Efficient Inference

18.1 Reduce Sampling Steps

18.1.1 Continuous Steps

18.1.2 Fast Sampling

18.1.3 Step distillation

18.2 Optimizing Inference

18.2.1 Low-bit Quantization

18.2.2 Parallel/Sparse inference

引用

Minisora社区微信群

星标历史

如何为Mini Sora社区贡献力量

社区贡献者

MiniSora 快速上手指南

环境准备

安装步骤

基本使用

1. 运行 DiT 复现示例 (基于 XTuner)

2. 获取技术资源与论文解读

3. 加入社区交流