[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Wan-Video--Wan2.2":3,"tool-Wan-Video--Wan2.2":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,2,"2026-04-05T10:45:23",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},3833,"MoneyPrinterTurbo","harry0703\u002FMoneyPrinterTurbo","MoneyPrinterTurbo 是一款利用 AI 大模型技术，帮助用户一键生成高清短视频的开源工具。只需输入一个视频主题或关键词，它就能全自动完成从文案创作、素材匹配、字幕合成到背景音乐搭配的全过程，最终输出完整的竖屏或横屏短视频。\n\n这款工具主要解决了传统视频制作流程繁琐、门槛高以及素材版权复杂等痛点。无论是需要快速产出内容的自媒体创作者，还是希望尝试视频生成的普通用户，无需具备专业的剪辑技能或昂贵的硬件配置（普通电脑即可运行），都能轻松上手。同时，其清晰的 MVC 架构和对多种主流大模型（如 DeepSeek、Moonshot、通义千问等）的广泛支持，也使其成为开发者进行二次开发或技术研究的理想底座。\n\nMoneyPrinterTurbo 的独特亮点在于其高度的灵活性与本地化友好性。它不仅支持中英文双语及多种语音合成，允许用户精细调整字幕样式和画面比例，还特别优化了国内网络环境下的模型接入方案，让用户无需依赖 VPN 即可使用高性能国产大模型。此外，工具提供批量生成模式，可一次性产出多个版本供用户择优，极大地提升了内容创作的效率与质量。",54991,3,"2026-04-05T12:23:02",[20,19,17,15,13],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":10,"last_commit_at":38,"category_tags":39,"status":22},2179,"oh-my-openagent","code-yeongyu\u002Foh-my-openagent","oh-my-openagent（简称 omo）是一款强大的开源智能体编排框架，前身名为 oh-my-opencode。它致力于打破单一模型供应商的生态壁垒，解决开发者在构建 AI 应用时面临的“厂商锁定”难题。不同于仅依赖特定模型的封闭方案，omo 倡导开放市场理念，支持灵活调度多种主流大模型：利用 Claude、Kimi 或 GLM 进行任务编排，调用 GPT 处理复杂推理，借助 Minimax 提升响应速度，或发挥 Gemini 的创意优势。\n\n这款工具特别适合希望摆脱平台限制、追求极致性能与成本平衡的开发者及研究人员使用。通过统一接口，用户可以轻松组合不同模型的长处，构建更高效、更具适应性的智能体系统。其独特的技术亮点在于“全模型兼容”架构，让用户不再受制于某一家公司的策略变动或定价调整，真正实现对前沿模型资源的自由驾驭。无论是构建自动化编码助手，还是开发多步骤任务处理流程，oh-my-openagent 都能提供灵活且稳健的基础设施支持，助力用户在快速演进的 AI 生态中保持技术主动权。",48371,"2026-04-05T11:36:18",[15,19,20,13,17],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},2483,"onlook","onlook-dev\u002Fonlook","Onlook 是一款专为设计师打造的开源 AI 优先设计工具，被誉为“设计师版的 Cursor”。它旨在打破设计与开发之间的壁垒，让用户能够以可视化的方式直接构建、样式化和编辑 React 应用。通过 Onlook，用户无需深入编写复杂代码，即可在类似 Figma 的直观界面中完成网页原型的搭建与调整，并实时预览最终效果。\n\n这款工具主要解决了传统工作流中设计稿到代码转换效率低、沟通成本高的问题。以往，设计师使用 Figma 等工具完成设计后，需要开发人员手动将其转化为代码，过程繁琐且容易出错。Onlook 允许用户直接在浏览器 DOM 中进行可视化编辑，底层自动生成基于 Next.js 和 TailwindCSS 的高质量代码，实现了“所见即所得”的开发体验。它不仅支持从文本或图像快速生成应用，还具备分支管理、资源管理及一键部署等功能，极大地简化了从创意到成品的流程。\n\nOnlook 特别适合前端开发者、UI\u002FUX 设计师以及希望快速验证产品创意的独立开发者使用。对于设计师而言，它降低了参与前端开发的门槛；对于开发者来说，它提供了一个高效的视觉化调试和原型构建环境。其核心技术亮点在于",25006,4,"2026-04-03T01:50:49",[17,13,15,20],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":10,"last_commit_at":55,"category_tags":56,"status":22},3795,"serena","oraios\u002Fserena","Serena 是一款专为编程智能体（Coding Agent）打造的强大工具包，被誉为“智能体的集成开发环境（IDE）”。它通过模型上下文协议（MCP）与各类大语言模型及客户端无缝集成，旨在解决传统 AI 在复杂代码库中因依赖行号或简单文本搜索而导致的效率低下和准确性不足的问题。\n\n与传统方法不同，Serena 采用“智能体优先”的设计理念，提供基于语义的代码检索、编辑和重构能力。它能像资深开发者使用 IDE 一样，深入理解代码的符号层级和关联结构，从而让智能体在大型项目中运行得更快、更稳、更可靠。无论是终端用户（如 Claude Code）、IDE 插件（VSCode、Cursor）还是桌面应用，都能轻松接入 Serena 以扩展功能。\n\nSerena 特别适合需要处理大规模代码项目的开发者、研究人员以及希望提升 AI 编码能力的技术团队。其核心技术亮点在于灵活的后端支持：既默认集成了基于语言服务器协议（LSP）的开源方案，支持超过 40 种编程语言；也可选配强大的 JetBrains 插件，利用专业 IDE 的深度分析能力。这让 Serena 成为连接人工智能与复杂软件工程的高效桥",22488,"2026-04-05T10:53:54",[17,13,20,15],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":29,"last_commit_at":63,"category_tags":64,"status":22},3856,"sam2","facebookresearch\u002Fsam2","SAM 2 是 Meta 推出的新一代基础模型，旨在解决图像与视频中的“提示式视觉分割”难题。无论是静态图片还是动态视频，用户只需提供简单的点击、框选等提示，SAM 2 就能精准识别并分割出目标对象。它将单张图像视为单帧视频进行处理，成功打破了以往模型在视频理解上的局限。\n\n这款工具特别适合计算机视觉开发者、AI 研究人员以及需要处理视频内容的设计师使用。对于希望探索多目标跟踪或构建交互式应用的技术团队，SAM 2 提供了强大的底层支持。其核心亮点在于采用了带有流式记忆机制的 Transformer 架构，能够实现实时的视频处理性能。此外，项目配套发布了迄今为止规模最大的视频分割数据集（SA-V），并通过“模型闭环数据引擎”不断自我进化。最新更新的 SAM 2.1 版本不仅提供了更优的预训练权重，还支持全模型编译加速及灵活的多目标独立追踪，让复杂场景下的视频分析变得更加高效与便捷。",18853,"2026-04-05T10:30:04",[13,15],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":81,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":98,"forks":99,"last_commit_at":100,"license":101,"difficulty_score":46,"env_os":102,"env_gpu":103,"env_ram":102,"env_deps":104,"category_tags":111,"github_topics":112,"view_count":115,"oss_zip_url":80,"oss_zip_packed_at":80,"status":22,"created_at":116,"updated_at":117,"faqs":118,"releases":148},1329,"Wan-Video\u002FWan2.2","Wan2.2","Wan: Open and Advanced Large-Scale Video Generative Models","Wan2.2 是一套开源的大规模视频生成模型，能把文字或图片一键变成 720P、24fps 的高清短片。它解决了传统模型画面模糊、动作僵硬、风格不可控等痛点，让开发者、科研人员、视频设计师乃至普通爱好者都能在消费级显卡（如 RTX 4090）上快速产出电影级质感的视频。\n\n亮点在于：采用“专家混合”架构，把去噪任务拆给不同专家，算力不变但容量更大；训练数据量相比上一代提升 65% 图像、83% 视频，复杂动作和语义理解显著增强；自带精细美学标签，可精准控制光影、色调、构图；还提供 14B 角色动画与语音驱动版本，可直接做人物替换或口播视频。","# Wan2.2\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_readme_87f80debdb90.png\" width=\"400\"\u002F>\n\u003Cp>\n\n\u003Cp align=\"center\">\n    💜 \u003Ca href=\"https:\u002F\u002Fwan.video\">\u003Cb>Wan\u003C\u002Fb>\u003C\u002Fa> &nbsp&nbsp ｜ &nbsp&nbsp 🖥️ \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2\">GitHub\u003C\u002Fa> &nbsp&nbsp  | &nbsp&nbsp🤗 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002F\">Hugging Face\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp🤖 \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Forganization\u002FWan-AI\">ModelScope\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp 📑 \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.20314\">Paper\u003C\u002Fa> &nbsp&nbsp | &nbsp&nbsp 📑 \u003Ca href=\"https:\u002F\u002Fwan.video\u002Fwelcome?spm=a2ty_o02.30011076.0.0.6c9ee41eCcluqg\">Blog\u003C\u002Fa> &nbsp&nbsp |  &nbsp&nbsp 💬  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FAKNgpMK4Yj\">Discord\u003C\u002Fa>&nbsp&nbsp\n    \u003Cbr>\n    📕 \u003Ca href=\"https:\u002F\u002Falidocs.dingtalk.com\u002Fi\u002Fnodes\u002Fjb9Y4gmKWrx9eo4dCql9LlbYJGXn6lpz\">使用指南(中文)\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp 📘 \u003Ca href=\"https:\u002F\u002Falidocs.dingtalk.com\u002Fi\u002Fnodes\u002FEpGBa2Lm8aZxe5myC99MelA2WgN7R35y\">User Guide(English)\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp💬 \u003Ca href=\"https:\u002F\u002Fgw.alicdn.com\u002Fimgextra\u002Fi2\u002FO1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg\">WeChat(微信)\u003C\u002Fa>&nbsp&nbsp\n\u003Cbr>\n\n-----\n\n[**Wan: Open and Advanced Large-Scale Video Generative Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.20314) \u003Cbe>\n\n\nWe are excited to introduce **Wan2.2**, a major upgrade to our foundational video models. With **Wan2.2**, we have focused on incorporating the following innovations:\n\n- 👍 **Effective MoE Architecture**: Wan2.2 introduces a Mixture-of-Experts (MoE) architecture into video diffusion models. By separating the denoising process cross timesteps with specialized powerful expert models, this enlarges the overall model capacity while maintaining the same computational cost.\n\n- 👍 **Cinematic-level Aesthetics**: Wan2.2 incorporates meticulously curated aesthetic data, complete with detailed labels for lighting, composition, contrast, color tone, and more. This allows for more precise and controllable cinematic style generation, facilitating the creation of videos with customizable aesthetic preferences.\n\n- 👍 **Complex Motion Generation**: Compared to Wan2.1, Wan2.2 is trained on a significantly larger data, with +65.6% more images and +83.2% more videos. This expansion notably enhances the model's generalization across multiple dimensions such as motions,  semantics, and aesthetics, achieving TOP performance among all open-sourced and closed-sourced models. \n\n- 👍 **Efficient High-Definition Hybrid TI2V**:  Wan2.2 open-sources a 5B model built with our advanced Wan2.2-VAE that achieves a compression ratio of **16×16×4**. This model supports both text-to-video and image-to-video generation at 720P resolution with 24fps and can also run on consumer-grade graphics cards like 4090. It is one of the fastest **720P@24fps** models currently available, capable of serving both the industrial and academic sectors simultaneously.\n\n\n## Video Demos\n\n\u003Cdiv align=\"center\">\n  \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb63bfa58-d5d7-4de6-a1a2-98970b06d9a7\" width=\"70%\" poster=\"\"> \u003C\u002Fvideo>\n\u003C\u002Fdiv>\n\n## 🔥 Latest News!!\n* Nov 13, 2025: 👋 Wan2.2-Animate-14B has been integrated into Diffusers ([PR](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\u002Fpull\u002F12526),[Weights](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-Animate-14B-Diffusers)). Thanks to all community contributors. Enjoy!\n\n* Sep 19, 2025: 💃 We introduct **[Wan2.2-Animate-14B](https:\u002F\u002Fhumanaigc.github.io\u002Fwan-animate)**, an unified model for character animation and replacement with holistic movement and expression replication. We released the [model weights](#model-download) and [inference code](#run-wan-animate). And you can try it on [wan.video](https:\u002F\u002Fwan.video\u002F), [ModelScope Studio](https:\u002F\u002Fwww.modelscope.cn\u002Fstudios\u002FWan-AI\u002FWan2.2-Animate) or [HuggingFace Space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FWan-AI\u002FWan2.2-Animate)!\n* Aug 26, 2025: 🎵 We introduce **[Wan2.2-S2V-14B](https:\u002F\u002Fhumanaigc.github.io\u002Fwan-s2v-webpage)**, an audio-driven cinematic video generation model, including [inference code](#run-speech-to-video-generation), [model weights](#model-download), and [technical report](https:\u002F\u002Fhumanaigc.github.io\u002Fwan-s2v-webpage\u002Fcontent\u002Fwan-s2v.pdf)! Now you can try it on [wan.video](https:\u002F\u002Fwan.video\u002F),  [ModelScope Gradio](https:\u002F\u002Fwww.modelscope.cn\u002Fstudios\u002FWan-AI\u002FWan2.2-S2V) or [HuggingFace Gradio](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FWan-AI\u002FWan2.2-S2V)!\n* Jul 28, 2025: 👋 We have open a [HF space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FWan-AI\u002FWan-2.2-5B) using the TI2V-5B model. Enjoy!\n* Jul 28, 2025: 👋 Wan2.2 has been integrated into ComfyUI ([CN](https:\u002F\u002Fdocs.comfy.org\u002Fzh-CN\u002Ftutorials\u002Fvideo\u002Fwan\u002Fwan2_2) | [EN](https:\u002F\u002Fdocs.comfy.org\u002Ftutorials\u002Fvideo\u002Fwan\u002Fwan2_2)). Enjoy!\n* Jul 28, 2025: 👋 Wan2.2's T2V, I2V and TI2V have been integrated into Diffusers ([T2V-A14B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-T2V-A14B-Diffusers) | [I2V-A14B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-I2V-A14B-Diffusers) | [TI2V-5B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-TI2V-5B-Diffusers)). Feel free to give it a try!\n* Jul 28, 2025: 👋 We've released the inference code and model weights of **Wan2.2**.\n* Sep 5, 2025: 👋 We add text-to-speech synthesis support with [CosyVoice](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice) for Speech-to-Video generation task.\n\n\n## Community Works\nIf your research or project builds upon [**Wan2.1**](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.1) or [**Wan2.2**](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2), and you would like more people to see it, please inform us.\n\n- [Prompt Relay](https:\u002F\u002Fgithub.com\u002FGordonChen19\u002FPrompt-Relay), a plug-and-play, inference-time method for temporal control in video generation. Prompt Relay improves video quality and gives users precise control over what happens at each moment in the video. Visit their [webpage](https:\u002F\u002Fgordonchen19.github.io\u002FPrompt-Relay\u002F) for more details.\n- [Helios](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FHelios), a breakthrough video generation model base on **Wan2.1** that achieves minute-scale, high-quality video synthesis at 19.5 FPS on a single H100 GPU (about 10 FPS on a single Ascend NPU) —without relying on conventional long video anti-drifting strategies or standard video acceleration techniques. Visit their [webpage](https:\u002F\u002Fpku-yuangroup.github.io\u002FHelios-Page\u002F) for more details.\n- [LightX2V](https:\u002F\u002Fgithub.com\u002FModelTC\u002FLightX2V), a lightweight and efficient video generation framework that integrates **Wan2.1** and **Wan2.2**, supporting multiple engineering acceleration techniques for fast inference. [LightX2V-HuggingFace](https:\u002F\u002Fhuggingface.co\u002Flightx2v), offers a variety of Wan-based step-distillation models, quantized models, and lightweight VAE models.\n- [HuMo](https:\u002F\u002Fgithub.com\u002FPhantom-video\u002FHuMo) proposed a unified, human-centric framework based on **Wan** to produce high-quality, fine-grained, and controllable human videos from multimodal inputs—including text, images, and audio. Visit their [webpage](https:\u002F\u002Fphantom-video.github.io\u002FHuMo\u002F) for more details.\n- [FastVideo](https:\u002F\u002Fgithub.com\u002Fhao-ai-lab\u002FFastVideo) includes distilled **Wan** models with sparse attention that significanly speed up the inference time. \n- [Cache-dit](https:\u002F\u002Fgithub.com\u002Fvipshop\u002Fcache-dit) offers Fully Cache Acceleration support for **Wan2.2** MoE with DBCache, TaylorSeer and Cache CFG. Visit their [example](https:\u002F\u002Fgithub.com\u002Fvipshop\u002Fcache-dit\u002Fblob\u002Fmain\u002Fexamples\u002Fpipeline\u002Frun_wan_2.2.py) for more details.\n- [Kijai's ComfyUI WanVideoWrapper](https:\u002F\u002Fgithub.com\u002Fkijai\u002FComfyUI-WanVideoWrapper) is an alternative implementation of **Wan** models for ComfyUI. Thanks to its Wan-only focus, it's on the frontline of getting cutting edge optimizations and hot research features, which are often hard to integrate into ComfyUI quickly due to its more rigid structure.\n- [DiffSynth-Studio](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FDiffSynth-Studio) provides comprehensive support for **Wan 2.2**, including low-GPU-memory layer-by-layer offload, FP8 quantization, sequence parallelism, LoRA training, full training.\n\n\n## 📑 Todo List\n- Wan2.2 Text-to-Video\n    - [x] Multi-GPU Inference code of the A14B and 14B models\n    - [x] Checkpoints of the A14B and 14B models\n    - [x] ComfyUI integration\n    - [x] Diffusers integration\n- Wan2.2 Image-to-Video\n    - [x] Multi-GPU Inference code of the A14B model\n    - [x] Checkpoints of the A14B model\n    - [x] ComfyUI integration\n    - [x] Diffusers integration\n- Wan2.2 Text-Image-to-Video\n    - [x] Multi-GPU Inference code of the 5B model\n    - [x] Checkpoints of the 5B model\n    - [x] ComfyUI integration\n    - [x] Diffusers integration\n- Wan2.2-S2V Speech-to-Video\n    - [x] Inference code of Wan2.2-S2V\n    - [x] Checkpoints of Wan2.2-S2V-14B\n    - [x] ComfyUI integration\n    - [x] Diffusers integration\n- Wan2.2-Animate Character Animation and Replacement\n    - [x] Inference code of Wan2.2-Animate\n    - [x] Checkpoints of Wan2.2-Animate\n    - [x] ComfyUI integration\n    - [x] Diffusers integration\n\n## Run Wan2.2\n\n#### Installation\nClone the repo:\n```sh\ngit clone https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2.git\ncd Wan2.2\n```\n\nInstall dependencies:\n```sh\n# Ensure torch >= 2.4.0\n# If the installation of `flash_attn` fails, try installing the other packages first and install `flash_attn` last\npip install -r requirements.txt\n# If you want to use CosyVoice to synthesize speech for Speech-to-Video Generation, please install requirements_s2v.txt additionally\npip install -r requirements_s2v.txt\n```\n\n\n#### Model Download\n\n| Models              | Download Links                                                                                                                              | Description |\n|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------|-------------|\n| T2V-A14B    | 🤗 [Huggingface](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-T2V-A14B)    🤖 [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.2-T2V-A14B)    | Text-to-Video MoE model, supports 480P & 720P |\n| I2V-A14B    | 🤗 [Huggingface](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-I2V-A14B)    🤖 [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.2-I2V-A14B)    | Image-to-Video MoE model, supports 480P & 720P |\n| TI2V-5B     | 🤗 [Huggingface](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-TI2V-5B)     🤖 [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.2-TI2V-5B)     | High-compression VAE, T2V+I2V, supports 720P |\n| S2V-14B     | 🤗 [Huggingface](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-S2V-14B)     🤖 [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.2-S2V-14B)     | Speech-to-Video model, supports 480P & 720P |\n| Animate-14B | 🤗 [Huggingface](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-Animate-14B) 🤖 [ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.2-Animate-14B)  | Character animation and replacement | |\n\n\n\n> 💡Note: \n> The TI2V-5B model supports 720P video generation at **24 FPS**.\n\n\nDownload models using huggingface-cli:\n``` sh\npip install \"huggingface_hub[cli]\"\nhuggingface-cli download Wan-AI\u002FWan2.2-T2V-A14B --local-dir .\u002FWan2.2-T2V-A14B\n```\n\nDownload models using modelscope-cli:\n``` sh\npip install modelscope\nmodelscope download Wan-AI\u002FWan2.2-T2V-A14B --local_dir .\u002FWan2.2-T2V-A14B\n```\n\n#### Run Text-to-Video Generation\n\nThis repository supports the `Wan2.2-T2V-A14B` Text-to-Video model and can simultaneously support video generation at 480P and 720P resolutions.\n\n\n##### (1) Without Prompt Extension\n\nTo facilitate implementation, we will start with a basic version of the inference process that skips the [prompt extension](#2-using-prompt-extention) step.\n\n- Single-GPU inference\n\n``` sh\npython generate.py  --task t2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-T2V-A14B --offload_model True --convert_model_dtype --prompt \"Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.\"\n```\n\n> 💡 This command can run on a GPU with at least 80GB VRAM.\n\n> 💡If you encounter OOM (Out-of-Memory) issues, you can use the `--offload_model True`, `--convert_model_dtype` and `--t5_cpu` options to reduce GPU memory usage.\n\n\n- Multi-GPU inference using FSDP + DeepSpeed Ulysses\n\n  We use [PyTorch FSDP](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Ffsdp.html) and [DeepSpeed Ulysses](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.14509) to accelerate inference.\n\n\n``` sh\ntorchrun --nproc_per_node=8 generate.py --task t2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-T2V-A14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt \"Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.\"\n```\n\n\n##### (2) Using Prompt Extension\n\nExtending the prompts can effectively enrich the details in the generated videos, further enhancing the video quality. Therefore, we recommend enabling prompt extension. We provide the following two methods for prompt extension:\n\n- Use the Dashscope API for extension.\n  - Apply for a `dashscope.api_key` in advance ([EN](https:\u002F\u002Fwww.alibabacloud.com\u002Fhelp\u002Fen\u002Fmodel-studio\u002Fgetting-started\u002Ffirst-api-call-to-qwen) | [CN](https:\u002F\u002Fhelp.aliyun.com\u002Fzh\u002Fmodel-studio\u002Fgetting-started\u002Ffirst-api-call-to-qwen)).\n  - Configure the environment variable `DASH_API_KEY` to specify the Dashscope API key. For users of Alibaba Cloud's international site, you also need to set the environment variable `DASH_API_URL` to 'https:\u002F\u002Fdashscope-intl.aliyuncs.com\u002Fapi\u002Fv1'. For more detailed instructions, please refer to the [dashscope document](https:\u002F\u002Fwww.alibabacloud.com\u002Fhelp\u002Fen\u002Fmodel-studio\u002Fdeveloper-reference\u002Fuse-qwen-by-calling-api?spm=a2c63.p38356.0.i1).\n  - Use the `qwen-plus` model for text-to-video tasks and `qwen-vl-max` for image-to-video tasks.\n  - You can modify the model used for extension with the parameter `--prompt_extend_model`. For example:\n```sh\nDASH_API_KEY=your_key torchrun --nproc_per_node=8 generate.py  --task t2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-T2V-A14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt \"Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage\" --use_prompt_extend --prompt_extend_method 'dashscope' --prompt_extend_target_lang 'zh'\n```\n\n- Using a local model for extension.\n\n  - By default, the Qwen model on HuggingFace is used for this extension. Users can choose Qwen models or other models based on the available GPU memory size.\n  - For text-to-video tasks, you can use models like `Qwen\u002FQwen2.5-14B-Instruct`, `Qwen\u002FQwen2.5-7B-Instruct` and `Qwen\u002FQwen2.5-3B-Instruct`.\n  - For image-to-video tasks, you can use models like `Qwen\u002FQwen2.5-VL-7B-Instruct` and `Qwen\u002FQwen2.5-VL-3B-Instruct`.\n  - Larger models generally provide better extension results but require more GPU memory.\n  - You can modify the model used for extension with the parameter `--prompt_extend_model` , allowing you to specify either a local model path or a Hugging Face model. For example:\n\n``` sh\ntorchrun --nproc_per_node=8 generate.py  --task t2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-T2V-A14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt \"Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage\" --use_prompt_extend --prompt_extend_method 'local_qwen' --prompt_extend_target_lang 'zh'\n```\n\n\n#### Run Image-to-Video Generation\n\nThis repository supports the `Wan2.2-I2V-A14B` Image-to-Video model and can simultaneously support video generation at 480P and 720P resolutions.\n\n\n- Single-GPU inference\n```sh\npython generate.py --task i2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-I2V-A14B --offload_model True --convert_model_dtype --image examples\u002Fi2v_input.JPG --prompt \"Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside.\"\n```\n\n> This command can run on a GPU with at least 80GB VRAM.\n\n> 💡For the Image-to-Video task, the `size` parameter represents the area of the generated video, with the aspect ratio following that of the original input image.\n\n\n- Multi-GPU inference using FSDP + DeepSpeed Ulysses\n\n```sh\ntorchrun --nproc_per_node=8 generate.py --task i2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-I2V-A14B --image examples\u002Fi2v_input.JPG --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt \"Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside.\"\n```\n\n- Image-to-Video Generation without prompt\n\n```sh\nDASH_API_KEY=your_key torchrun --nproc_per_node=8 generate.py --task i2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-I2V-A14B --prompt '' --image examples\u002Fi2v_input.JPG --dit_fsdp --t5_fsdp --ulysses_size 8 --use_prompt_extend --prompt_extend_method 'dashscope'\n```\n\n> 💡The model can generate videos solely from the input image. You can use prompt extension to generate prompt from the image.\n\n> The process of prompt extension can be referenced [here](#2-using-prompt-extention).\n\n#### Run Text-Image-to-Video Generation\n\nThis repository supports the `Wan2.2-TI2V-5B` Text-Image-to-Video model and can support video generation at 720P resolutions.\n\n\n- Single-GPU Text-to-Video inference\n```sh\npython generate.py --task ti2v-5B --size 1280*704 --ckpt_dir .\u002FWan2.2-TI2V-5B --offload_model True --convert_model_dtype --t5_cpu --prompt \"Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage\"\n```\n\n> 💡Unlike other tasks, the 720P resolution of the Text-Image-to-Video task is `1280*704` or `704*1280`.\n\n> This command can run on a GPU with at least 24GB VRAM (e.g, RTX 4090 GPU).\n\n> 💡If you are running on a GPU with at least 80GB VRAM, you can remove the `--offload_model True`, `--convert_model_dtype` and `--t5_cpu` options to speed up execution.\n\n\n- Single-GPU Image-to-Video inference\n```sh\npython generate.py --task ti2v-5B --size 1280*704 --ckpt_dir .\u002FWan2.2-TI2V-5B --offload_model True --convert_model_dtype --t5_cpu --image examples\u002Fi2v_input.JPG --prompt \"Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside.\"\n```\n\n> 💡If the image parameter is configured, it is an Image-to-Video generation; otherwise, it defaults to a Text-to-Video generation.\n\n> 💡Similar to Image-to-Video, the `size` parameter represents the area of the generated video, with the aspect ratio following that of the original input image.\n\n\n- Multi-GPU inference using FSDP + DeepSpeed Ulysses\n\n```sh\ntorchrun --nproc_per_node=8 generate.py --task ti2v-5B --size 1280*704 --ckpt_dir .\u002FWan2.2-TI2V-5B --dit_fsdp --t5_fsdp --ulysses_size 8 --image examples\u002Fi2v_input.JPG --prompt \"Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside.\"\n```\n\n> The process of prompt extension can be referenced [here](#2-using-prompt-extention).\n\n#### Run Speech-to-Video Generation\n\nThis repository supports the `Wan2.2-S2V-14B` Speech-to-Video model and can simultaneously support video generation at 480P and 720P resolutions.\n\n- Single-GPU Speech-to-Video inference\n\n```sh\npython generate.py  --task s2v-14B --size 1024*704 --ckpt_dir .\u002FWan2.2-S2V-14B\u002F --offload_model True --convert_model_dtype --prompt \"Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard.\"  --image \"examples\u002Fi2v_input.JPG\" --audio \"examples\u002Ftalk.wav\"\n# Without setting --num_clip, the generated video length will automatically adjust based on the input audio length\n\n# You can use CosyVoice to generate audio with --enable_tts\npython generate.py  --task s2v-14B --size 1024*704 --ckpt_dir .\u002FWan2.2-S2V-14B\u002F --offload_model True --convert_model_dtype --prompt \"Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard.\"  --image \"examples\u002Fi2v_input.JPG\" --enable_tts --tts_prompt_audio \"examples\u002Fzero_shot_prompt.wav\" --tts_prompt_text \"希望你以后能够做的比我还好呦。\" --tts_text \"收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。\"\n```\n\n> 💡 This command can run on a GPU with at least 80GB VRAM.\n\n- Multi-GPU inference using FSDP + DeepSpeed Ulysses\n\n```sh\ntorchrun --nproc_per_node=8 generate.py --task s2v-14B --size 1024*704 --ckpt_dir .\u002FWan2.2-S2V-14B\u002F --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt \"Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard.\" --image \"examples\u002Fi2v_input.JPG\" --audio \"examples\u002Ftalk.wav\"\n```\n\n- Pose + Audio driven generation\n\n```sh\ntorchrun --nproc_per_node=8 generate.py --task s2v-14B --size 1024*704 --ckpt_dir .\u002FWan2.2-S2V-14B\u002F --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt \"a person is singing\" --image \"examples\u002Fpose.png\" --audio \"examples\u002Fsing.MP3\" --pose_video \".\u002Fexamples\u002Fpose.mp4\" \n```\n\n> 💡For the Speech-to-Video task, the `size` parameter represents the area of the generated video, with the aspect ratio following that of the original input image.\n\n> 💡The model can generate videos from audio input combined with reference image and optional text prompt.\n\n> 💡The `--pose_video` parameter enables pose-driven generation, allowing the model to follow specific pose sequences while generating videos synchronized with audio input.\n\n> 💡The `--num_clip` parameter controls the number of video clips generated, useful for quick preview with shorter generation time.\n\nPlease visit our project page to see more examples and learn about the scenarios suitable for this model.\n\n#### Run Wan-Animate \n\nWan-Animate takes a video and a character image as input, and generates a video in either \"animation\" or \"replacement\" mode. \n\n1. animation mode： The model generates a video of the character image that mimics the human motion in the input video.\n2. replacement mode: The model replaces the character image with the input video.\n\nPlease visit our [project page](https:\u002F\u002Fhumanaigc.github.io\u002Fwan-animate) to see more examples and learn about the scenarios suitable for this model.\n\n##### (1) Preprocessing \nThe input video should be preprocessed into several materials before be feed into the inference process.  Please refer to the following processing flow, and more details about preprocessing can be found in [UserGuider](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2\u002Fblob\u002Fmain\u002Fwan\u002Fmodules\u002Fanimate\u002Fpreprocess\u002FUserGuider.md).\n\n* For animation\n```bash\npython .\u002Fwan\u002Fmodules\u002Fanimate\u002Fpreprocess\u002Fpreprocess_data.py \\\n    --ckpt_path .\u002FWan2.2-Animate-14B\u002Fprocess_checkpoint \\\n    --video_path .\u002Fexamples\u002Fwan_animate\u002Fanimate\u002Fvideo.mp4 \\\n    --refer_path .\u002Fexamples\u002Fwan_animate\u002Fanimate\u002Fimage.jpeg \\\n    --save_path .\u002Fexamples\u002Fwan_animate\u002Fanimate\u002Fprocess_results \\\n    --resolution_area 1280 720 \\\n    --retarget_flag \\\n    --use_flux\n```\n* For replacement\n```bash\npython .\u002Fwan\u002Fmodules\u002Fanimate\u002Fpreprocess\u002Fpreprocess_data.py \\\n    --ckpt_path .\u002FWan2.2-Animate-14B\u002Fprocess_checkpoint \\\n    --video_path .\u002Fexamples\u002Fwan_animate\u002Freplace\u002Fvideo.mp4 \\\n    --refer_path .\u002Fexamples\u002Fwan_animate\u002Freplace\u002Fimage.jpeg \\\n    --save_path .\u002Fexamples\u002Fwan_animate\u002Freplace\u002Fprocess_results \\\n    --resolution_area 1280 720 \\\n    --iterations 3 \\\n    --k 7 \\\n    --w_len 1 \\\n    --h_len 1 \\\n    --replace_flag\n```\n##### (2) Run in animation mode \n\n* Single-GPU inference \n\n```bash\npython generate.py --task animate-14B --ckpt_dir .\u002FWan2.2-Animate-14B\u002F --src_root_path .\u002Fexamples\u002Fwan_animate\u002Fanimate\u002Fprocess_results\u002F --refert_num 1\n```\n\n* Multi-GPU inference using FSDP + DeepSpeed Ulysses\n\n```bash\npython -m torch.distributed.run --nnodes 1 --nproc_per_node 8 generate.py --task animate-14B --ckpt_dir .\u002FWan2.2-Animate-14B\u002F --src_root_path .\u002Fexamples\u002Fwan_animate\u002Fanimate\u002Fprocess_results\u002F --refert_num 1 --dit_fsdp --t5_fsdp --ulysses_size 8\n```\n\n* Diffusers Pipeline\n\n```python\nfrom diffusers import WanAnimatePipeline\nfrom diffusers.utils import export_to_video, load_image, load_video\n\ndevice = \"cuda:0\"\ndtype = torch.bfloat16\nmodel_id = \"Wan-AI\u002FWan2.2-Animate-14B-Diffusers\"\npipe = WanAnimatePipeline.from_pretrained(model_id torch_dtype=dtype)\npipe.to(device)\n\nseed = 42\nprompt = \"People in the video are doing actions.\"\n\n# Animation\nimage = load_image(\"\u002Fpath\u002Fto\u002Fanimate\u002Freference\u002Fimage\u002Fsrc_ref.png\")\npose_video = load_video(\"\u002Fpath\u002Fto\u002Fanimate\u002Fpose\u002Fvideo\u002Fsrc_pose.mp4\")\nface_video = load_video(\"\u002Fpath\u002Fto\u002Fanimate\u002Fface\u002Fvideo\u002Fsrc_face.mp4\")\n\nanimate_video = pipe(\n    image=image,\n    pose_video=pose_video,\n    face_video=face_video,\n    prompt=prompt,\n    mode=\"animate\",\n    segment_frame_length=77,  # clip_len in original code\n    prev_segment_conditioning_frames=1,  # refert_num in original code\n    guidance_scale=1.0,\n    num_inference_steps=20,\n    generator=torch.Generator(device=device).manual_seed(seed),\n).frames[0]\nexport_to_video(animate_video, \"diffusers_animate.mp4\", fps=30)\n```\n\n##### (3) Run in replacement mode \n\n* Single-GPU inference \n\n```bash\npython generate.py --task animate-14B --ckpt_dir .\u002FWan2.2-Animate-14B\u002F --src_root_path .\u002Fexamples\u002Fwan_animate\u002Freplace\u002Fprocess_results\u002F --refert_num 1 --replace_flag --use_relighting_lora \n```\n\n* Multi-GPU inference using FSDP + DeepSpeed Ulysses\n\n```bash\npython -m torch.distributed.run --nnodes 1 --nproc_per_node 8 generate.py --task animate-14B --ckpt_dir .\u002FWan2.2-Animate-14B\u002F --src_root_path .\u002Fexamples\u002Fwan_animate\u002Freplace\u002Fprocess_results\u002Fsrc_pose.mp4  --refert_num 1 --replace_flag --use_relighting_lora --dit_fsdp --t5_fsdp --ulysses_size 8\n```\n\n* Diffusers Pipeline\n\n```python\n# create pipeline as in the Animation code ☝️\n\n# Replacement\nimage = load_image(\"\u002Fpath\u002Fto\u002Freplace\u002Freference\u002Fimage\u002Fsrc_ref.png\")\npose_video = load_video(\"\u002Fpath\u002Fto\u002Freplace\u002Fpose\u002Fvideo\u002Fsrc_pose.mp4\")\nface_video = load_video(\"\u002Fpath\u002Fto\u002Freplace\u002Fface\u002Fvideo\u002Fsrc_face.mp4\")\nbackground_video = load_video(\"\u002Fpath\u002Fto\u002Freplace\u002Fbackground\u002Fvideo\u002Fsrc_bg.mp4\")\nmask_video = load_video(\"\u002Fpath\u002Fto\u002Freplace\u002Fmask\u002Fvideo\u002Fsrc_mask.mp4\")\n\nreplace_video = pipe(\n    image=image,\n    pose_video=pose_video,\n    face_video=face_video,\n    background_video=background_video,\n    mask_video=mask_video,\n    prompt=prompt,\n    mode=\"replace\",\n    segment_frame_length=77,  # clip_len in original code\n    prev_segment_conditioning_frames=1,  # refert_num in original code\n    guidance_scale=1.0,\n    num_inference_steps=20,\n    generator=torch.Generator(device=device).manual_seed(seed),\n).frames[0]\nexport_to_video(replace_video, \"diffusers_replace.mp4\", fps=30)\n```\n\n> 💡 If you're using **Wan-Animate**, we do not recommend using LoRA models trained on `Wan2.2`, since weight changes during training may lead to unexpected behavior.\n\n## Computational Efficiency on Different GPUs\n\nWe test the computational efficiency of different **Wan2.2** models on different GPUs in the following table. The results are presented in the format: **Total time (s) \u002F peak GPU memory (GB)**.\n\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_readme_f18a1a4d96fd.png\" alt=\"\" style=\"width: 80%;\" \u002F>\n\u003C\u002Fdiv>\n\n> The parameter settings for the tests presented in this table are as follows:\n> (1) Multi-GPU: 14B: `--ulysses_size 4\u002F8 --dit_fsdp --t5_fsdp`, 5B: `--ulysses_size 4\u002F8 --offload_model True --convert_model_dtype --t5_cpu`; Single-GPU: 14B: `--offload_model True --convert_model_dtype`, 5B: `--offload_model True --convert_model_dtype --t5_cpu`\n(--convert_model_dtype converts model parameter types to config.param_dtype);\n> (2) The distributed testing utilizes the built-in FSDP and Ulysses implementations, with FlashAttention3 deployed on Hopper architecture GPUs;\n> (3) Tests were run without the `--use_prompt_extend` flag;\n> (4) Reported results are the average of multiple samples taken after the warm-up phase.\n\n\n-------\n\n## Introduction of Wan2.2\n\n**Wan2.2** builds on the foundation of Wan2.1 with notable improvements in generation quality and model capability. This upgrade is driven by a series of key technical innovations, mainly including the Mixture-of-Experts (MoE) architecture, upgraded training data, and high-compression video generation.\n\n##### (1) Mixture-of-Experts (MoE) Architecture\n\nWan2.2 introduces Mixture-of-Experts (MoE) architecture into the video generation diffusion model. MoE has been widely validated in large language models as an efficient approach to increase total model parameters while keeping inference cost nearly unchanged. In Wan2.2, the A14B model series adopts a two-expert design tailored to the denoising process of diffusion models: a high-noise expert for the early stages, focusing on overall layout; and a low-noise expert for the later stages, refining video details. Each expert model has about 14B parameters, resulting in a total of 27B parameters but only 14B active parameters per step, keeping inference computation and GPU memory nearly unchanged.\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_readme_3a2c5479c154.png\" alt=\"\" style=\"width: 90%;\" \u002F>\n\u003C\u002Fdiv>\n\nThe transition point between the two experts is determined by the signal-to-noise ratio (SNR), a metric that decreases monotonically as the denoising step $t$ increases. At the beginning of the denoising process, $t$ is large and the noise level is high, so the SNR is at its minimum, denoted as ${SNR}_{min}$. In this stage, the high-noise expert is activated. We define a threshold step ${t}_{moe}$ corresponding to half of the ${SNR}_{min}$, and switch to the low-noise expert when $t\u003C{t}_{moe}$.\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_readme_d4782d3e9ff5.png\" alt=\"\" style=\"width: 90%;\" \u002F>\n\u003C\u002Fdiv>\n\nTo validate the effectiveness of the MoE architecture, four settings are compared based on their validation loss curves. The baseline **Wan2.1** model does not employ the MoE architecture. Among the MoE-based variants, the **Wan2.1 & High-Noise Expert** reuses the Wan2.1 model as the low-noise expert while uses the  Wan2.2's high-noise expert, while the **Wan2.1 & Low-Noise Expert** uses Wan2.1 as the high-noise expert and employ the Wan2.2's low-noise expert. The **Wan2.2 (MoE)** (our final version) achieves the lowest validation loss, indicating that its generated video distribution is closest to ground-truth and exhibits superior convergence.\n\n\n##### (2) Efficient High-Definition Hybrid TI2V\nTo enable more efficient deployment, Wan2.2 also explores a high-compression design. In addition to the 27B MoE models, a 5B dense model, i.e., TI2V-5B, is released. It is supported by a high-compression Wan2.2-VAE, which achieves a $T\\times H\\times W$ compression ratio of $4\\times16\\times16$, increasing the overall compression rate to 64 while maintaining high-quality video reconstruction. With an additional patchification layer, the total compression ratio of TI2V-5B reaches $4\\times32\\times32$. Without specific optimization, TI2V-5B can generate a 5-second 720P video in under 9 minutes on a single consumer-grade GPU, ranking among the fastest 720P@24fps video generation models. This model also natively supports both text-to-video and image-to-video tasks within a single unified framework, covering both academic research and practical applications.\n\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_readme_21563712fccd.png\" alt=\"\" style=\"width: 80%;\" \u002F>\n\u003C\u002Fdiv>\n\n\n\n##### Comparisons to SOTAs\nWe compared Wan2.2 with leading closed-source commercial models on our new Wan-Bench 2.0, evaluating performance across multiple crucial dimensions. The results demonstrate that Wan2.2 achieves superior performance compared to these leading models.\n\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_readme_aa4a5c84c6ab.png\" alt=\"\" style=\"width: 90%;\" \u002F>\n\u003C\u002Fdiv>\n\n## Citation\nIf you find our work helpful, please cite us.\n\n```\n@article{wan2025,\n      title={Wan: Open and Advanced Large-Scale Video Generative Models}, \n      author={Team Wan and Ang Wang and Baole Ai and Bin Wen and Chaojie Mao and Chen-Wei Xie and Di Chen and Feiwu Yu and Haiming Zhao and Jianxiao Yang and Jianyuan Zeng and Jiayu Wang and Jingfeng Zhang and Jingren Zhou and Jinkai Wang and Jixuan Chen and Kai Zhu and Kang Zhao and Keyu Yan and Lianghua Huang and Mengyang Feng and Ningyi Zhang and Pandeng Li and Pingyu Wu and Ruihang Chu and Ruili Feng and Shiwei Zhang and Siyang Sun and Tao Fang and Tianxing Wang and Tianyi Gui and Tingyu Weng and Tong Shen and Wei Lin and Wei Wang and Wei Wang and Wenmeng Zhou and Wente Wang and Wenting Shen and Wenyuan Yu and Xianzhong Shi and Xiaoming Huang and Xin Xu and Yan Kou and Yangyu Lv and Yifei Li and Yijing Liu and Yiming Wang and Yingya Zhang and Yitong Huang and Yong Li and You Wu and Yu Liu and Yulin Pan and Yun Zheng and Yuntao Hong and Yupeng Shi and Yutong Feng and Zeyinzi Jiang and Zhen Han and Zhi-Fan Wu and Ziyu Liu},\n      journal = {arXiv preprint arXiv:2503.20314},\n      year={2025}\n}\n```\n\n## License Agreement\nThe models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generated contents, granting you the freedom to use them while ensuring that your usage complies with the provisions of this license. You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws, causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations. For a complete list of restrictions and details regarding your rights, please refer to the full text of the [license](LICENSE.txt).\n\n\n## Acknowledgements\n\nWe would like to thank the contributors to the [SD3](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fstable-diffusion-3-medium), [Qwen](https:\u002F\u002Fhuggingface.co\u002FQwen), [umt5-xxl](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fumt5-xxl), [diffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers) and [HuggingFace](https:\u002F\u002Fhuggingface.co) repositories, for their open research.\n\n\n\n## Contact Us\nIf you would like to leave a message to our research or product teams, feel free to join our [Discord](https:\u002F\u002Fdiscord.gg\u002FAKNgpMK4Yj) or [WeChat groups](https:\u002F\u002Fgw.alicdn.com\u002Fimgextra\u002Fi2\u002FO1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg)!\n\n","# Wan2.2\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_readme_87f80debdb90.png\" width=\"400\"\u002F>\n\u003Cp>\n\n\u003Cp align=\"center\">\n    💜 \u003Ca href=\"https:\u002F\u002Fwan.video\">\u003Cb>Wan\u003C\u002Fb>\u003C\u002Fa> &nbsp&nbsp ｜ &nbsp&nbsp 🖥️ \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2\">GitHub\u003C\u002Fa> &nbsp&nbsp  | &nbsp&nbsp🤗 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002F\">Hugging Face\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp🤖 \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Forganization\u002FWan-AI\">ModelScope\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp 📑 \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.20314\">论文\u003C\u002Fa> &nbsp&nbsp | &nbsp&nbsp 📑 \u003Ca href=\"https:\u002F\u002Fwan.video\u002Fwelcome?spm=a2ty_o02.30011076.0.0.6c9ee41eCcluqg\">博客\u003C\u002Fa> &nbsp&nbsp |  &nbsp&nbsp 💬  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FAKNgpMK4Yj\">Discord\u003C\u002Fa>&nbsp&nbsp\n    \u003Cbr>\n    📕 \u003Ca href=\"https:\u002F\u002Falidocs.dingtalk.com\u002Fi\u002Fnodes\u002Fjb9Y4gmKWrx9eo4dCql9LlbYJGXn6lpz\">使用指南(中文)\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp 📘 \u003Ca href=\"https:\u002F\u002Falidocs.dingtalk.com\u002Fi\u002Fnodes\u002FEpGBa2Lm8aZxe5myC99MelA2WgN7R35y\">User Guide(English)\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp💬 \u003Ca href=\"https:\u002F\u002Fgw.alicdn.com\u002Fimgextra\u002Fi2\u002FO1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg\">WeChat(微信)\u003C\u002Fa>&nbsp&nbsp\n\u003Cbr>\n\n-----\n\n[**Wan：开放且先进的大规模视频生成模型**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.20314) \u003Cbe>\n\n\n我们很高兴地推出**Wan2.2**，这是我们基础视频模型的一次重大升级。在**Wan2.2**中，我们重点引入了以下创新：\n\n- 👍 **高效的MoE架构**：Wan2.2将专家混合模型（MoE）架构引入视频扩散模型。通过将去噪过程按时间步分离，并采用专门的高性能专家模型，这一设计在保持相同计算成本的同时，显著提升了整体模型容量。\n\n- 👍 **电影级美学效果**：Wan2.2整合了精心筛选的美学数据，包含光照、构图、对比度、色调等详细标注。这使得电影风格的生成更加精准可控，便于创作出具有可定制美学偏好的视频。\n\n- 👍 **复杂运动生成**：与Wan2.1相比，Wan2.2在训练数据上有了大幅扩展，图像数量增加了65.6%，视频数量增加了83.2%。这种扩展显著增强了模型在运动、语义和美学等多个维度上的泛化能力，使其在所有开源及闭源模型中均处于顶尖水平。\n\n- 👍 **高效的高清混合TI2V**：Wan2.2开源了一款基于我们先进Wan2.2-VAE构建的5B模型，其压缩比达到**16×16×4**。该模型支持文本到视频和图像到视频生成，分辨率为720P、帧率24fps，并且可在消费级显卡如4090上运行。它是目前最快的**720P@24fps**模型之一，能够同时服务于工业界和学术界。\n\n\n## 视频演示\n\n\u003Cdiv align=\"center\">\n  \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb63bfa58-d5d7-4de6-a1a2-98970b06d9a7\" width=\"70%\" poster=\"\"> \u003C\u002Fvideo>\n\u003C\u002Fdiv>\n\n## 🔥 最新消息！！\n* 2025年11月13日：👋 Wan2.2-Animate-14B已集成至Diffusers（[PR](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\u002Fpull\u002F12526),[权重](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-Animate-14B-Diffusers)）。感谢所有社区贡献者！尽情体验吧！\n\n* 2025年9月19日：💃 我们推出了**[Wan2.2-Animate-14B](https:\u002F\u002Fhumanaigc.github.io\u002Fwan-animate)**，一款用于角色动画与替换的统一模型，具备全面的动作与表情复制能力。我们已发布[模型权重](#model-download)和[推理代码](#run-wan-animate)。您可以在[wan.video](https:\u002F\u002Fwan.video\u002F)、[ModelScope Studio](https:\u002F\u002Fwww.modelscope.cn\u002Fstudios\u002FWan-AI\u002FWan2.2-Animate)或[HuggingFace Space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FWan-AI\u002FWan2.2-Animate)上试用！\n* 2025年8月26日：🎵 我们推出了**[Wan2.2-S2V-14B](https:\u002F\u002Fhumanaigc.github.io\u002Fwan-s2v-webpage)**，一款基于音频驱动的电影级视频生成模型，包括[推理代码](#run-speech-to-video-generation)、[模型权重](#model-download)以及[技术报告](https:\u002F\u002Fhumanaigc.github.io\u002Fwan-s2v-webpage\u002Fcontent\u002Fwan-s2v.pdf)！现在您可以在[wan.video](https:\u002F\u002Fwan.video\u002F)、[ModelScope Gradio](https:\u002F\u002Fwww.modelscope.cn\u002Fstudios\u002FWan-AI\u002FWan2.2-S2V)或[HuggingFace Gradio](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FWan-AI\u002FWan2.2-S2V)上试用！\n* 2025年7月28日：👋 我们已开放一个使用TI2V-5B模型的[HF空间](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FWan-AI\u002FWan-2.2-5B)。尽情体验吧！\n* 2025年7月28日：👋 Wan2.2已集成至ComfyUI（[CN](https:\u002F\u002Fdocs.comfy.org\u002Fzh-CN\u002Ftutorials\u002Fvideo\u002Fwan\u002Fwan2_2) | [EN](https:\u002F\u002Fdocs.comfy.org\u002Ftutorials\u002Fvideo\u002Fwan\u002Fwan2_2)）。尽情体验吧！\n* 2025年7月28日：👋 Wan2.2的T2V、I2V和TI2V均已集成至Diffusers（[T2V-A14B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-T2V-A14B-Diffusers) | [I2V-A14B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-I2V-A14B-Diffusers) | [TI2V-5B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-TI2V-5B-Diffusers)）。欢迎随时尝试！\n* 2025年7月28日：👋 我们已发布**Wan2.2**的推理代码和模型权重。\n* 2025年9月5日：👋 我们为语音到视频生成任务新增了基于[CosyVoice](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice)的文本转语音合成支持。\n\n## 社区成果\n如果您在研究或项目中基于[**Wan2.1**](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.1)或[**Wan2.2**](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2)开展工作，并希望让更多人了解您的成果，请告知我们。\n\n- [Prompt Relay](https:\u002F\u002Fgithub.com\u002FGordonChen19\u002FPrompt-Relay)，一种即插即用、推理阶段实现视频生成时间控制的方法。Prompt Relay能够提升视频质量，并让用户对视频中每一时刻的内容进行精准控制。更多详情请访问其[网页](https:\u002F\u002Fgordonchen19.github.io\u002FPrompt-Relay\u002F)。\n- [Helios](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FHelios)，一款基于**Wan2.1**的突破性视频生成模型，能够在单张H100 GPU上以19.5 FPS的速度实现分钟级高质量视频合成（而在单张Ascend NPU上约为10 FPS）——且无需依赖传统的长视频防漂移策略或标准的视频加速技术。更多详情请访问其[网页](https:\u002F\u002Fpku-yuangroup.github.io\u002FHelios-Page\u002F)。\n- [LightX2V](https:\u002F\u002Fgithub.com\u002FModelTC\u002FLightX2V)，一个轻量高效的视频生成框架，集成了**Wan2.1**和**Wan2.2**，支持多种工程加速技术以实现快速推理。[LightX2V-HuggingFace](https:\u002F\u002Fhuggingface.co\u002Flightx2v)提供了多种基于Wan的步进蒸馏模型、量化模型以及轻量级VAE模型。\n- [HuMo](https:\u002F\u002Fgithub.com\u002FPhantom-video\u002FHuMo)提出了一种基于**Wan**的统一、以人为中心的框架，能够从多模态输入——包括文本、图像和音频——生成高质量、精细可控的人体视频。更多详情请访问其[网页](https:\u002F\u002Fphantom-video.github.io\u002FHuMo\u002F)。\n- [FastVideo](https:\u002F\u002Fgithub.com\u002Fhao-ai-lab\u002FFastVideo)包含经过蒸馏的**Wan**模型，采用稀疏注意力机制，显著加快了推理速度。\n- [Cache-dit](https:\u002F\u002Fgithub.com\u002Fvipshop\u002Fcache-dit)为**Wan2.2** MoE提供全缓存加速支持，结合DBCache、TaylorSeer和Cache CFG。更多详情请访问其[示例](https:\u002F\u002Fgithub.com\u002Fvipshop\u002Fcache-dit\u002Fblob\u002Fmain\u002Fexamples\u002Fpipeline\u002Frun_wan_2.2.py)。\n- [Kijai's ComfyUI WanVideoWrapper](https:\u002F\u002Fgithub.com\u002Fkijai\u002FComfyUI-WanVideoWrapper)是ComfyUI的**Wan**模型替代实现。由于其专注于Wan本身，因此能够第一时间获得前沿优化与热门研究功能，而这些功能往往因ComfyUI结构较为 rigid 而难以快速集成。\n- [DiffSynth-Studio](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FDiffSynth-Studio)为**Wan 2.2**提供全面支持，包括低GPU内存逐层卸载、FP8量化、序列并行、LoRA训练以及完整训练等。\n\n\n## 📑 待办事项清单\n- Wan2.2 文本转视频\n    - [x] A14B和14B模型的多GPU推理代码\n    - [x] A14B和14B模型的检查点\n    - [x] ComfyUI集成\n    - [x] Diffusers集成\n- Wan2.2 图像转视频\n    - [x] A14B模型的多GPU推理代码\n    - [x] A14B模型的检查点\n    - [x] ComfyUI集成\n    - [x] Diffusers集成\n- Wan2.2 文本-图像转视频\n    - [x] 5B模型的多GPU推理代码\n    - [x] 5B模型的检查点\n    - [x] ComfyUI集成\n    - [x] Diffusers集成\n- Wan2.2-S2V 语音转视频\n    - [x] Wan2.2-S2V的推理代码\n    - [x] Wan2.2-S2V-14B的检查点\n    - [x] ComfyUI集成\n    - [x] Diffusers集成\n- Wan2.2-动画角色动画与替换\n    - [x] Wan2.2-Animate的推理代码\n    - [x] Wan2.2-Animate的检查点\n    - [x] ComfyUI集成\n    - [x] Diffusers集成\n\n## 运行Wan2.2\n\n#### 安装\n克隆仓库：\n```sh\ngit clone https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2.git\ncd Wan2.2\n```\n\n安装依赖：\n```sh\n# 确保torch >= 2.4.0\n# 如果`flash_attn`安装失败，可先安装其他包再最后安装`flash_attn`\npip install -r requirements.txt\n# 如果想使用CosyVoice为语音转视频生成合成语音，请额外安装requirements_s2v.txt\npip install -r requirements_s2v.txt\n```\n\n\n#### 模型下载\n\n| 模型              | 下载链接                                                                                                                              | 描述 |\n|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------|-------------|\n| T2V-A14B    | 🤗 [Huggingface](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-T2V-A14B)    🤖 [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.2-T2V-A14B)    | 文本转视频MoE模型，支持480P与720P |\n| I2V-A14B    | 🤗 [Huggingface](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-I2V-A14B)    🤖 [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.2-I2V-A14B)    | 图像转视频MoE模型，支持480P与720P |\n| TI2V-5B     | 🤗 [Huggingface](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-TI2V-5B)     🤖 [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.2-TI2V-5B)     | 高压缩VAE，T2V+I2V，支持720P |\n| S2V-14B     | 🤗 [Huggingface](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-S2V-14B)     🤖 [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.2-S2V-14B)     | 语音转视频模型，支持480P与720P |\n| Animate-14B | 🤗 [Huggingface](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-Animate-14B) 🤖 [ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FWan-AI\u002FWan2.2-Animate-14B)  | 角色动画与替换 | |\n\n\n\n> 💡注意： \n> TI2V-5B模型支持以**24 FPS**生成720P视频。\n\n\n使用huggingface-cli下载模型：\n``` sh\npip install \"huggingface_hub[cli]\"\nhuggingface-cli download Wan-AI\u002FWan2.2-T2V-A14B --local-dir .\u002FWan2.2-T2V-A14B\n```\n\n使用modelscope-cli下载模型：\n``` sh\npip install modelscope\nmodelscope download Wan-AI\u002FWan2.2-T2V-A14B --local_dir .\u002FWan2.2-T2V-A14B\n```\n\n#### 运行文本转视频生成\n\n本仓库支持`Wan2.2-T2V-A14B`文本转视频模型，并可同时支持480P与720P分辨率的视频生成。\n\n\n##### (1) 不使用提示扩展\n\n为便于实施，我们将从一个基础版本的推理流程开始，跳过[提示扩展](#2-using-prompt-extention)步骤。\n\n- 单GPU推理\n\n``` sh\npython generate.py  --task t2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-T2V-A14B --offload_model True --convert_model_dtype --prompt \"两只穿着舒适拳击装备、戴着亮色手套的人形猫在聚光灯下的舞台上激烈搏斗。\"\n```\n\n> 💡 此命令可在至少拥有80GB显存的GPU上运行。\n\n> 💡若遇到OOM（显存不足）问题，可使用`--offload_model True`、`--convert_model_dtype`以及`--t5_cpu`选项来降低显存占用。\n\n\n- 使用FSDP + DeepSpeed Ulysses的多GPU推理\n\n我们使用 [PyTorch FSDP](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Ffsdp.html) 和 [DeepSpeed Ulysses](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.14509) 来加速推理。\n\n\n``` sh\ntorchrun --nproc_per_node=8 generate.py --task t2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-T2V-A14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt \"两只穿着舒适拳击装备、戴着亮色手套的人形猫在聚光灯下的舞台上激烈搏斗。\"\n```\n\n\n##### (2) 使用提示扩展\n\n扩展提示可以有效丰富生成视频的细节，进一步提升视频质量。因此，我们建议启用提示扩展。我们提供以下两种提示扩展方法：\n\n- 使用 Dashscope API 进行扩展。\n  - 提前申请 `dashscope.api_key`（[英文](https:\u002F\u002Fwww.alibabacloud.com\u002Fhelp\u002Fen\u002Fmodel-studio\u002Fgetting-started\u002Ffirst-api-call-to-qwen) | [中文](https:\u002F\u002Fhelp.aliyun.com\u002Fzh\u002Fmodel-studio\u002Fgetting-started\u002Ffirst-api-call-to-qwen)）。\n  - 配置环境变量 `DASH_API_KEY` 以指定 Dashscope API 密钥。对于阿里云国际站用户，还需将环境变量 `DASH_API_URL` 设置为 'https:\u002F\u002Fdashscope-intl.aliyuncs.com\u002Fapi\u002Fv1'。更多详细说明请参阅 [dashscope 文档](https:\u002F\u002Fwww.alibabacloud.com\u002Fhelp\u002Fen\u002Fmodel-studio\u002Fdeveloper-reference\u002Fuse-qwen-by-calling-api?spm=a2c63.p38356.0.i1)。\n  - 文本转视频任务使用 `qwen-plus` 模型，图像转视频任务使用 `qwen-vl-max` 模型。\n  - 可通过参数 `--prompt_extend_model` 修改用于扩展的模型。例如：\n```sh\nDASH_API_KEY=your_key torchrun --nproc_per_node=8 generate.py  --task t2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-T2V-A14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt \"两只穿着舒适拳击装备、戴着亮色手套的人形猫在聚光灯下的舞台上激烈搏斗\" --use_prompt_extend --prompt_extend_method 'dashscope' --prompt_extend_target_lang 'zh'\n```\n\n- 使用本地模型进行扩展。\n\n  - 默认情况下，此扩展使用 HuggingFace 上的 Qwen 模型。用户可根据可用显存大小选择 Qwen 模型或其他模型。\n  - 对于文本转视频任务，可使用 `Qwen\u002FQwen2.5-14B-Instruct`、`Qwen\u002FQwen2.5-7B-Instruct` 和 `Qwen\u002FQwen2.5-3B-Instruct` 等模型。\n  - 对于图像转视频任务，可使用 `Qwen\u002FQwen2.5-VL-7B-Instruct` 和 `Qwen\u002FQwen2.5-VL-3B-Instruct` 等模型。\n  - 较大的模型通常能提供更好的扩展效果，但需要更多的显存。\n  - 可通过参数 `--prompt_extend_model` 修改用于扩展的模型，允许指定本地模型路径或 HuggingFace 模型。例如：\n\n``` sh\ntorchrun --nproc_per_node=8 generate.py  --task t2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-T2V-A14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt \"两只穿着舒适拳击装备、戴着亮色手套的人形猫在聚光灯下的舞台上激烈搏斗\" --use_prompt_extend --prompt_extend_method 'local_qwen' --prompt_extend_target_lang 'zh'\n```\n\n\n#### 运行图像转视频生成\n\n本仓库支持 `Wan2.2-I2V-A14B` 图像转视频模型，并可同时支持 480P 和 720P 分辨率的视频生成。\n\n\n- 单 GPU 推理\n```sh\npython generate.py --task i2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-I2V-A14B --offload_model True --convert_model_dtype --image examples\u002Fi2v_input.JPG --prompt \"夏日海滩度假风格，一只戴着太阳镜的白猫坐在冲浪板上。毛茸茸的猫咪以轻松的表情直视镜头。模糊的海滩风景作为背景，呈现出清澈见底的海水、远处的绿丘和点缀着白云的蓝天。猫咪摆出自然放松的姿势，仿佛在享受海风与温暖的阳光。特写镜头突出了猫咪的细腻毛发和海滨的清新氛围。\"\n```\n\n> 此命令可在至少配备 80GB 显存的 GPU 上运行。\n\n> 💡对于图像转视频任务，`size` 参数表示生成视频的区域，长宽比遵循原始输入图像的比例。\n\n\n- 使用 FSDP + DeepSpeed Ulysses 的多 GPU 推理\n\n```sh\ntorchrun --nproc_per_node=8 generate.py --task i2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-I2V-A14B --image examples\u002Fi2v_input.JPG --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt \"夏日海滩度假风格，一只戴着太阳镜的白猫坐在冲浪板上。毛茸茸的猫咪以轻松的表情直视镜头。模糊的海滩风景作为背景，呈现出清澈见底的海水、远处的绿丘和点缀着白云的蓝天。猫咪摆出自然放松的姿势，仿佛在享受海风与温暖的阳光。特写镜头突出了猫咪的细腻毛发和海滨的清新氛围。\"\n```\n\n- 无提示的图像转视频生成\n\n```sh\nDASH_API_KEY=your_key torchrun --nproc_per_node=8 generate.py --task i2v-A14B --size 1280*720 --ckpt_dir .\u002FWan2.2-I2V-A14B --prompt '' --image examples\u002Fi2v_input.JPG --dit_fsdp --t5_fsdp --ulysses_size 8 --use_prompt_extend --prompt_extend_method 'dashscope'\n```\n\n> 💡该模型仅根据输入图像即可生成视频。您可以通过提示扩展从图像中生成提示。\n\n> 提示扩展的过程可参考 [这里](#2-using-prompt-extention)。\n\n#### 运行文本-图像转视频生成\n\n本仓库支持 `Wan2.2-TI2V-5B` 文本-图像转视频模型，并可支持 720P 分辨率的视频生成。\n\n\n- 单 GPU 文本转视频推理\n```sh\npython generate.py --task ti2v-5B --size 1280*704 --ckpt_dir .\u002FWan2.2-TI2V-5B --offload_model True --convert_model_dtype --t5_cpu --prompt \"两只穿着舒适拳击装备、戴着亮色手套的人形猫在聚光灯下的舞台上激烈搏斗\"\n```\n\n> 💡与其他任务不同，文本-图像转视频任务的 720P 分辨率是 `1280*704` 或 `704*1280`。\n\n> 此命令可在至少配备 24GB 显存的 GPU 上运行（例如 RTX 4090 GPU）。\n\n> 💡如果您使用的 GPU 至少配备 80GB 显存，可以去掉 `--offload_model True`、`--convert_model_dtype` 和 `--t5_cpu` 选项以加快执行速度。\n\n\n- 单 GPU 图像转视频推理\n```sh\npython generate.py --task ti2v-5B --size 1280*704 --ckpt_dir .\u002FWan2.2-TI2V-5B --offload_model True --convert_model_dtype --t5_cpu --image examples\u002Fi2v_input.JPG --prompt \"夏日海滩度假风格，一只戴着太阳镜的白猫坐在冲浪板上。毛茸茸的猫咪以轻松的表情直视镜头。模糊的海滩风景作为背景，呈现出清澈见底的海水、远处的绿丘和点缀着白云的蓝天。猫咪摆出自然放松的姿势，仿佛在享受海风与温暖的阳光。特写镜头突出了猫咪的细腻毛发和海滨的清新氛围。\"\n```\n\n> 💡如果配置了图像参数，则为图像转视频生成；否则，默认为文本转视频生成。\n\n> 💡与图像转视频类似，`size`参数表示生成视频的区域大小，其宽高比遵循原始输入图像的宽高比。\n\n\n- 使用FSDP + DeepSpeed Ulysses进行多GPU推理\n\n```sh\ntorchrun --nproc_per_node=8 generate.py --task ti2v-5B --size 1280*704 --ckpt_dir .\u002FWan2.2-TI2V-5B --dit_fsdp --t5_fsdp --ulysses_size 8 --image examples\u002Fi2v_input.JPG --prompt \"夏日海滩度假风格，一只戴着太阳镜的白猫坐在冲浪板上。毛茸茸的猫咪以轻松的表情直视镜头。背景是模糊的海滩景色，包括清澈见底的海水、远处的绿丘和点缀着白云的蓝天。猫咪的姿态自然放松，仿佛在享受海风与温暖的阳光。特写镜头突出了猫咪的细腻毛发与海滨的清新氛围。\"\n```\n\n> 提示词扩展的过程可参考[此处](#2-using-prompt-extention)。\n\n#### 运行语音转视频生成\n\n本仓库支持“Wan2.2-S2V-14B”语音转视频模型，并可同时支持480P和720P分辨率的视频生成。\n\n- 单GPU语音转视频推理\n\n```sh\npython generate.py  --task s2v-14B --size 1024*704 --ckpt_dir .\u002FWan2.2-S2V-14B\u002F --offload_model True --convert_model_dtype --prompt \"夏日海滩度假风格，一只戴着太阳镜的白猫坐在冲浪板上。\"  --image \"examples\u002Fi2v_input.JPG\" --audio \"examples\u002Ftalk.wav\"\n\n\n# 如果不设置--num_clip，生成的视频长度将根据输入音频的长度自动调整\n\n# 可以使用CosyVoice通过--enable_tts生成音频\npython generate.py  --task s2v-14B --size 1024*704 --ckpt_dir .\u002FWan2.2-S2V-14B\u002F --offload_model True --convert_model_dtype --prompt \"夏日海滩度假风格，一只戴着太阳镜的白猫坐在冲浪板上。\"  --image \"examples\u002Fi2v_input.JPG\" --enable_tts --tts_prompt_audio \"examples\u002Fzero_shot_prompt.wav\" --tts_prompt_text \"希望你以后能够做的比我还好呦。\" --tts_text \"收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。\"\n```\n\n> 💡该命令可在至少配备80GB显存的GPU上运行。\n\n- 使用FSDP + DeepSpeed Ulysses进行多GPU推理\n\n```sh\ntorchrun --nproc_per_node=8 generate.py --task s2v-14B --size 1024*704 --ckpt_dir .\u002FWan2.2-S2V-14B\u002F --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt \"夏日海滩度假风格，一只戴着太阳镜的白猫坐在冲浪板上。\" --image \"examples\u002Fi2v_input.JPG\" --audio \"examples\u002Ftalk.wav\"\n```\n\n- 姿势+音频驱动生成\n\n```sh\ntorchrun --nproc_per_node=8 generate.py --task s2v-14B --size 1024*704 --ckpt_dir .\u002FWan2.2-S2V-14B\u002F --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt \"一个人在唱歌\" --image \"examples\u002Fpose.png\" --audio \"examples\u002Fsing.MP3\" --pose_video \".\u002Fexamples\u002Fpose.mp4\" \n```\n\n> 💡对于语音转视频任务，`size`参数表示生成视频的区域大小，其宽高比遵循原始输入图像的宽高比。\n\n> 💡该模型可以根据音频输入结合参考图像以及可选的文本提示生成视频。\n\n> 💡`--pose_video`参数启用姿势驱动生成，使模型能够在生成与音频输入同步的视频时遵循特定的姿势序列。\n\n> 💡`--num_clip`参数控制生成的视频片段数量，有助于以较短的生成时间进行快速预览。\n\n请访问我们的项目页面，查看更多示例并了解该模型适用的场景。\n\n#### 运行Wan-Animate\n\nWan-Animate以视频和人物图像作为输入，生成“动画”或“替换”模式的视频。\n\n1. 动画模式：模型生成的人物图像视频会模仿输入视频中的人类动作。\n2. 替换模式：模型用输入视频替换人物图像。\n\n请访问我们的[项目页面](https:\u002F\u002Fhumanaigc.github.io\u002Fwan-animate)，查看更多示例并了解该模型适用的场景。\n\n##### (1) 预处理 \n输入视频应在进入推理流程前被预处理成若干素材。请参考以下处理流程，更多关于预处理的细节可在[用户指南](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2\u002Fblob\u002Fmain\u002Fwan\u002Fmodules\u002Fanimate\u002Fpreprocess\u002FUserGuider.md)中找到。\n\n* 对于动画\n```bash\npython .\u002Fwan\u002Fmodules\u002Fanimate\u002Fpreprocess\u002Fpreprocess_data.py \\\n    --ckpt_path .\u002FWan2.2-Animate-14B\u002Fprocess_checkpoint \\\n    --video_path .\u002Fexamples\u002Fwan_animate\u002Fanimate\u002Fvideo.mp4 \\\n    --refer_path .\u002Fexamples\u002Fwan_animate\u002Fanimate\u002Fimage.jpeg \\\n    --save_path .\u002Fexamples\u002Fwan_animate\u002Fanimate\u002Fprocess_results \\\n    --resolution_area 1280 720 \\\n    --retarget_flag \\\n    --use_flux\n```\n* 对于替换\n```bash\npython .\u002Fwan\u002Fmodules\u002Fanimate\u002Fpreprocess\u002Fpreprocess_data.py \\\n    --ckpt_path .\u002FWan2.2-Animate-14B\u002Fprocess_checkpoint \\\n    --video_path .\u002Fexamples\u002Fwan_animate\u002Freplace\u002Fvideo.mp4 \\\n    --refer_path .\u002Fexamples\u002Fwan_animate\u002Freplace\u002Fimage.jpeg \\\n    --save_path .\u002Fexamples\u002Fwan_animate\u002Freplace\u002Fprocess_results \\\n    --resolution_area 1280 720 \\\n    --iterations 3 \\\n    --k 7 \\\n    --w_len 1 \\\n    --h_len 1 \\\n    --replace_flag\n```\n##### (2) 在动画模式下运行 \n\n* 单GPU推理 \n\n```bash\npython generate.py --task animate-14B --ckpt_dir .\u002FWan2.2-Animate-14B\u002F --src_root_path .\u002Fexamples\u002Fwan_animate\u002Fanimate\u002Fprocess_results\u002F --refert_num 1\n```\n\n* 使用FSDP + DeepSpeed Ulysses进行多GPU推理\n\n```bash\npython -m torch.distributed.run --nnodes 1 --nproc_per_node 8 generate.py --task animate-14B --ckpt_dir .\u002FWan2.2-Animate-14B\u002F --src_root_path .\u002Fexamples\u002Fwan_animate\u002Fanimate\u002Fprocess_results\u002F --refert_num 1 --dit_fsdp --t5_fsdp --ulysses_size 8\n```\n\n* Diffusers流水线\n\n```python\nfrom diffusers import WanAnimatePipeline\nfrom diffusers.utils import export_to_video, load_image, load_video\n\ndevice = \"cuda:0\"\ndtype = torch.bfloat16\nmodel_id = \"Wan-AI\u002FWan2.2-Animate-14B-Diffusers\"\npipe = WanAnimatePipeline.from_pretrained(model_id torch_dtype=dtype)\npipe.to(device)\n\nseed = 42\nprompt = \"视频中的人正在做各种动作。\"\n\n# 动画\nimage = load_image(\"\u002Fpath\u002Fto\u002Fanimate\u002Freference\u002Fimage\u002Fsrc_ref.png\")\npose_video = load_video(\"\u002Fpath\u002Fto\u002Fanimate\u002Fpose\u002Fvideo\u002Fsrc_pose.mp4\")\nface_video = load_video(\"\u002Fpath\u002Fto\u002Fanimate\u002Fface\u002Fvideo\u002Fsrc_face.mp4\")\n\nanimate_video = pipe(\n    image=image,\n    pose_video=pose_video,\n    face_video=face_video,\n    prompt=prompt,\n    mode=\"animate\",\n    segment_frame_length=77,  # 原代码中的 clip_len\n    prev_segment_conditioning_frames=1,  # 原代码中的 refert_num\n    guidance_scale=1.0,\n    num_inference_steps=20,\n    generator=torch.Generator(device=device).manual_seed(seed),\n).frames[0]\nexport_to_video(animate_video, \"diffusers_animate.mp4\", fps=30)\n```\n\n##### (3) 替换模式运行\n\n* 单GPU推理\n\n```bash\npython generate.py --task animate-14B --ckpt_dir .\u002FWan2.2-Animate-14B\u002F --src_root_path .\u002Fexamples\u002Fwan_animate\u002Freplace\u002Fprocess_results\u002F --refert_num 1 --replace_flag --use_relighting_lora \n```\n\n* 使用FSDP + DeepSpeed Ulysses的多GPU推理\n\n```bash\npython -m torch.distributed.run --nnodes 1 --nproc_per_node 8 generate.py --task animate-14B --ckpt_dir .\u002FWan2.2-Animate-14B\u002F --src_root_path .\u002Fexamples\u002Fwan_animate\u002Freplace\u002Fprocess_results\u002Fsrc_pose.mp4  --refert_num 1 --replace_flag --use_relighting_lora --dit_fsdp --t5_fsdp --ulysses_size 8\n```\n\n* Diffusers Pipeline\n\n```python\n# 创建管道，与动画代码中一致 ☝️\n\n# 替换\nimage = load_image(\"\u002Fpath\u002Fto\u002Freplace\u002Freference\u002Fimage\u002Fsrc_ref.png\")\npose_video = load_video(\"\u002Fpath\u002Fto\u002Freplace\u002Fpose\u002Fvideo\u002Fsrc_pose.mp4\")\nface_video = load_video(\"\u002Fpath\u002Fto\u002Freplace\u002Fface\u002Fvideo\u002Fsrc_face.mp4\")\nbackground_video = load_video(\"\u002Fpath\u002Fto\u002Freplace\u002Fbackground\u002Fvideo\u002Fsrc_bg.mp4\")\nmask_video = load_video(\"\u002Fpath\u002Fto\u002Freplace\u002Fmask\u002Fvideo\u002Fsrc_mask.mp4\")\n\nreplace_video = pipe(\n    image=image,\n    pose_video=pose_video,\n    face_video=face_video,\n    background_video=background_video,\n    mask_video=mask_video,\n    prompt=prompt,\n    mode=\"replace\",\n    segment_frame_length=77,  # 原代码中的 clip_len\n    prev_segment_conditioning_frames=1,  # 原代码中的 refert_num\n    guidance_scale=1.0,\n    num_inference_steps=20,\n    generator=torch.Generator(device=device).manual_seed(seed),\n).frames[0]\nexport_to_video(replace_video, \"diffusers_replace.mp4\", fps=30)\n```\n\n> 💡 如果您使用的是 **Wan-Animate**，我们不建议使用在 `Wan2.2` 上训练的 LoRA 模型，因为训练过程中权重的变化可能导致意外行为。\n\n## 不同GPU上的计算效率\n\n我们在下表中测试了不同 **Wan2.2** 模型在不同GPU上的计算效率。结果以 **总时间（秒）\u002F GPU峰值显存（GB）** 的形式呈现。\n\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_readme_f18a1a4d96fd.png\" alt=\"\" style=\"width: 80%;\" \u002F>\n\u003C\u002Fdiv>\n\n> 本表中所展示测试的参数设置如下：\n> (1) 多GPU：14B：`--ulysses_size 4\u002F8 --dit_fsdp --t5_fsdp`，5B：`--ulysses_size 4\u002F8 --offload_model True --convert_model_dtype --t5_cpu`；单GPU：14B：`--offload_model True --convert_model_dtype`，5B：`--offload_model True --convert_model_dtype --t5_cpu`（`--convert_model_dtype` 将模型参数类型转换为 config.param_dtype）；\n> (2) 分布式测试采用内置的 FSDP 和 Ulysses 实现，并在 Hopper 架构的 GPU 上部署 FlashAttention3；\n> (3) 测试未启用 `--use_prompt_extend` 标志；\n> (4) 报告的结果为预热阶段后多次采样的平均值。\n\n\n-------\n\n## Wan2.2 简介\n\n**Wan2.2** 在 Wan2.1 的基础上进行了重大升级，显著提升了生成质量和模型能力。此次升级得益于一系列关键技术革新，主要包括混合专家（MoE）架构、训练数据的优化以及高压缩率视频生成技术。\n\n##### (1) 混合专家（MoE）架构\n\nWan2.2 将混合专家（MoE）架构引入了视频生成扩散模型。MoE 在大型语言模型中已被广泛验证，是一种在保持推理成本几乎不变的同时有效提升模型总参数量的方法。在 Wan2.2 中，A14B 系列模型采用了专为扩散模型去噪过程量身定制的双专家设计：早期阶段使用高噪声专家，侧重于整体布局；后期阶段则切换为低噪声专家，专注于细化视频细节。每个专家模型约有 140 亿参数，因此总参数量达到 270 亿，但每一步仅激活 140 亿个参数，从而将推理计算量和 GPU 显存占用基本维持在原有水平。\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_readme_3a2c5479c154.png\" alt=\"\" style=\"width: 90%;\" \u002F>\n\u003C\u002Fdiv>\n\n两组专家之间的切换点由信噪比（SNR）决定，而信噪比会随着去噪步骤 $t$ 的增加而单调下降。在去噪过程的初始阶段，$t$ 较大且噪声水平较高，因此信噪比处于最低值，记为 ${SNR}_{min}$。在此阶段，高噪声专家被激活。我们定义一个对应于 ${SNR}_{min}$ 一半的阈值步数 ${t}_{moe}$，当 $t\u003C{t}_{moe}$ 时，便切换至低噪声专家。\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_readme_d4782d3e9ff5.png\" alt=\"\" style=\"width: 90%;\" \u002F>\n\u003C\u002Fdiv>\n\n为了验证 MoE 架构的有效性，我们基于其验证损失曲线对比了四种设置。基准模型 **Wan2.1** 并未采用 MoE 架构。在基于 MoE 的变体中，**Wan2.1 & 高噪声专家** 以 Wan2.1 模型作为低噪声专家，同时使用 Wan2.2 的高噪声专家；而 **Wan2.1 & 低噪声专家** 则以 Wan2.1 作为高噪声专家，并采用 Wan2.2 的低噪声专家。最终版本 **Wan2.2 (MoE)** 的验证损失最低，表明其生成的视频分布最接近真实数据，且具有更优的收敛性能。\n\n\n##### (2) 高效高清混合 TI2V\n为实现更高效的部署，Wan2.2 还探索了一种高压缩设计。除了 270 亿参数的 MoE 模型外，我们还发布了一个 50 亿参数的稠密模型——TI2V-5B。该模型由一个高压缩比的 Wan2.2-VAE 支持，可实现 $T\\times H\\times W$ 压缩比为 $4\\times16\\times16$，使整体压缩率提升至 64，同时保持高质量的视频重建效果。此外，通过添加一个补丁化层，TI2V-5B 的总压缩比可达 $4\\times32\\times32$。在无需特殊优化的情况下，TI2V-5B 单独在消费级 GPU 上即可在不到 9 分钟内生成一段 5 秒的 720P 视频，跻身最快的 720P@24fps 视频生成模型之列。此外，该模型原生支持文本到视频与图像到视频两类任务，统一在一个框架内完成，既适用于学术研究，也满足实际应用需求。\n\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_readme_21563712fccd.png\" alt=\"\" style=\"width: 80%;\" \u002F>\n\u003C\u002Fdiv>\n\n\n\n##### 与 SOTA 的对比\n我们在全新的 Wan-Bench 2.0 上将 Wan2.2 与领先的闭源商业模型进行了对比，从多个关键维度评估了性能。结果表明，Wan2.2 的表现优于这些领先模型。\n\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_readme_aa4a5c84c6ab.png\" alt=\"\" style=\"width: 90%;\" \u002F>\n\u003C\u002Fdiv>\n\n## 引用\n如果您觉得我们的工作有所帮助，请引用我们。\n\n```\n@article{wan2025,\n      title={Wan: 开放且先进的大规模视频生成模型}, \n      author={Wan 团队、Ang Wang、Baole Ai、Bin Wen、Chaojie Mao、Chen-Wei Xie、Di Chen、Feiwu Yu、Haiming Zhao、Jianxiao Yang、Jianyuan Zeng、Jiayu Wang、Jingfeng Zhang、Jingren Zhou、Jinkai Wang、Jixuan Chen、Kai Zhu、Kang Zhao、Keyu Yan、Lianghua Huang、Mengyang Feng、Ningyi Zhang、Pandeng Li、Pingyu Wu、Ruihang Chu、Ruili Feng、Shiwei Zhang、Siyang Sun、Tao Fang、Tianxing Wang、Tianyi Gui、Tingyu Weng、Tong Shen、Wei Lin、Wei Wang、Wei Wang、Wenmeng Zhou、Wente Wang、Wenting Shen、Wenyuan Yu、Xianzhong Shi、Xiaoming Huang、Xin Xu、Yan Kou、Yangyu Lv、Yifei Li、Yijing Liu、Yiming Wang、Yingya Zhang、Yitong Huang、Yong Li、You Wu、Yu Liu、Yulin Pan、Yun Zheng、Yuntao Hong、Yupeng Shi、Yutong Feng、Zeyinzi Jiang、Zhen Han、Zhi-Fan Wu、Ziyu Liu},\n      journal = {arXiv 预印本 arXiv:2503.20314},\n      year={2025}\n}\n```\n\n## 许可协议\n本仓库中的模型均采用 Apache 2.0 许可协议授权。我们对您生成的内容不主张任何权利，赋予您自由使用这些内容的权利，同时确保您的使用符合本许可协议的规定。您需对模型的使用承担全部责任，不得分享任何违反相关法律、对个人或群体造成伤害、传播用于实施伤害的个人信息、散布虚假信息或针对弱势群体的内容。有关完整限制及您的权利详情，请参阅 [许可证](LICENSE.txt) 全文。\n\n\n## 致谢\n\n我们衷心感谢 [SD3](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fstable-diffusion-3-medium)、[Qwen](https:\u002F\u002Fhuggingface.co\u002FQwen)、[umt5-xxl](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fumt5-xxl)、[diffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers) 以及 [HuggingFace](https:\u002F\u002Fhuggingface.co) 项目的贡献者们，感谢他们开放的研究成果。\n\n\n## 联系我们\n如果您想向我们的研究或产品团队留言，欢迎加入我们的 [Discord](https:\u002F\u002Fdiscord.gg\u002FAKNgpMK4Yj) 或 [微信交流群](https:\u002F\u002Fgw.alicdn.com\u002Fimgextra\u002Fi2\u002FO1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg)!","# Wan2.2 快速上手指南\n\nWan2.2 是一款先进的开源大规模视频生成模型，支持文生视频（T2V）、图生视频（I2V）、图文生视频（TI2V）及语音生视频（S2V）。其采用高效的 MoE 架构，能在消费级显卡（如 RTX 4090）上运行 720P@24fps 的高清视频生成任务。\n\n## 1. 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐) 或 Windows\n*   **Python**: 3.8 及以上版本\n*   **PyTorch**: 版本需 >= 2.4.0\n*   **GPU**: \n    *   推荐 NVIDIA GPU (支持 CUDA)。\n    *   运行 5B 模型（720P）建议显存 >= 24GB (如 RTX 4090)，配合量化或卸载技术可在更低显存运行。\n    *   运行 14B 模型建议显存 >= 80GB (如 A100\u002FH100)，或使用多卡\u002F卸载模式。\n*   **网络**: 访问 Hugging Face 或 ModelScope 下载模型（国内用户推荐使用 ModelScope 加速）。\n\n## 2. 安装步骤\n\n### 2.1 克隆项目\n首先从 GitHub 克隆仓库并进入目录：\n\n```sh\ngit clone https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2.git\ncd Wan2.2\n```\n\n### 2.2 安装依赖\n安装基础依赖包。如果 `flash_attn` 安装失败，建议先安装其他包，最后单独安装它。\n\n```sh\n# 安装核心依赖\npip install -r requirements.txt\n\n# (可选) 如果需要体验语音生视频 (S2V) 功能，需额外安装 CosyVoice 相关依赖\npip install -r requirements_s2v.txt\n```\n\n## 3. 模型下载\n\n国内开发者强烈推荐使用 **ModelScope (魔搭社区)** 进行模型下载，速度更快且稳定。以下以 **Wan2.2-T2V-A14B** (文生视频模型) 为例：\n\n### 方式一：使用 ModelScope (推荐)\n\n```sh\n# 安装 modelscope 工具\npip install modelscope\n\n# 下载模型到本地目录 .\u002FWan2.2-T2V-A14B\nmodelscope download Wan-AI\u002FWan2.2-T2V-A14B --local_dir .\u002FWan2.2-T2V-A14B\n```\n\n### 方式二：使用 Hugging Face\n\n```sh\n# 安装 huggingface_hub 工具\npip install \"huggingface_hub[cli]\"\n\n# 下载模型\nhuggingface-cli download Wan-AI\u002FWan2.2-T2V-A14B --local-dir .\u002FWan2.2-T2V-A14B\n```\n\n> **可用模型列表**:\n> *   `Wan2.2-T2V-A14B`: 文生视频 (MoE 架构，支持 480P\u002F720P)\n> *   `Wan2.2-I2V-A14B`: 图生视频 (MoE 架构，支持 480P\u002F720P)\n> *   `Wan2.2-TI2V-5B`: 图文生视频 (高压缩 VAE，支持 720P@24fps，适合消费级显卡)\n> *   `Wan2.2-S2V-14B`: 语音生视频\n> *   `Wan2.2-Animate-14B`: 角色动画与替换\n\n## 4. 基本使用\n\n以下演示如何使用 **单张 GPU** 运行文生视频任务。\n\n### 示例：生成一段打斗场景视频\n\n此命令使用 `Wan2.2-T2V-A14B` 模型生成 720P 分辨率的视频。为了适应不同显存大小的显卡，命令中开启了模型卸载 (`--offload_model`) 和精度转换 (`--convert_model_dtype`) 选项。\n\n```sh\npython generate.py \\\n  --task t2v-A14B \\\n  --size 1280*720 \\\n  --ckpt_dir .\u002FWan2.2-T2V-A14B \\\n  --offload_model True \\\n  --convert_model_dtype \\\n  --prompt \"Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.\"\n```\n\n**参数说明：**\n*   `--task`: 指定任务类型，如 `t2v-A14B` (文生视频), `i2v-A14B` (图生视频), `ti2v-5B` (图文生视频)。\n*   `--size`: 输出分辨率，格式为 `宽*高` (例如 `1280*720`)。\n*   `--ckpt_dir`: 本地模型权重存放路径。\n*   `--offload_model True`: 启用模型卸载，将部分层移至 CPU 以节省显存（显存不足时必选）。\n*   `--convert_model_dtype`: 自动转换数据类型以优化显存占用。\n*   `--prompt`: 视频生成的提示词（英文效果更佳）。\n\n> **提示**: 如果遇到显存溢出 (OOM) 错误，请确保添加了 `--offload_model True` 和 `--convert_model_dtype` 参数。对于 5B 模型，RTX 4090 通常可直接流畅运行 720P 生成。","一家独立游戏工作室正在为新品宣传片制作一段主角在雨夜街头奔跑的高清过场动画，要求画面具有电影级光影且动作流畅自然。\n\n### 没有 Wan2.2 时\n- **画质与成本难以兼得**：生成 720P 高清视频通常需要昂贵的企业级显卡集群，普通消费级显卡（如 RTX 4090）无法运行或显存爆满，导致渲染成本极高。\n- **动作僵硬不自然**：旧模型在处理“雨中奔跑”这种复杂动态时，人物肢体容易扭曲变形，缺乏真实的物理惯性，看起来像“纸片人”在滑动。\n- **光影缺乏电影感**：生成的视频光线平淡，无法精准控制雨夜的霓虹反射、对比度和色调，后期需要花费大量时间手动调色才能达到预期效果。\n- **迭代效率低下**：由于推理速度慢且失败率高，团队一天只能尝试寥寥几次提示词调整，严重拖慢了创意验证和成片产出的进度。\n\n### 使用 Wan2.2 后\n- **平民硬件跑通高清**：凭借高效的混合专家（MoE）架构和先进的 VAE 压缩技术，Wan2.2 能在单张 RTX 4090 上流畅生成 720P@24fps 的视频，大幅降低了硬件门槛。\n- **复杂运动逼真还原**：基于扩大 83.2% 的视频数据训练，Wan2.2 精准捕捉了人物奔跑时的肌肉起伏和雨水飞溅的细节，动作连贯且符合物理规律。\n- **原生电影级美学**：利用其内置的美学标签控制，团队直接通过提示词即可锁定“赛博朋克雨夜”的特定布光与色调，直出画面即具备大片质感。\n- **快速创意迭代**：作为目前最快的开源模型之一，Wan2.2 将单次生成时间显著缩短，让团队能在半天内完成数十种不同镜头语言的测试与优选。\n\nWan2.2 通过突破性的架构优化，让中小团队也能以低成本在消费级显卡上高效产出具有电影级质感和复杂动态的高清视频内容。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWan-Video_Wan2.2_87f80deb.png","Wan-Video","Wan","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FWan-Video_a6c38a68.png","Alibaba Cloud's Large-scale Generative Models.",null,"wan.ai@alibabacloud.com","Alibaba_Wan","https:\u002F\u002Fwan.video","https:\u002F\u002Fgithub.com\u002FWan-Video",[86,90,94],{"name":87,"color":88,"percentage":89},"Python","#3572A5",99.2,{"name":91,"color":92,"percentage":93},"Shell","#89e051",0.8,{"name":95,"color":96,"percentage":97},"Makefile","#427819",0,15057,1830,"2026-04-05T09:44:52","Apache-2.0","未说明","必需 NVIDIA GPU。运行 5B 模型推荐消费级显卡（如 RTX 4090）；运行 14B 模型单卡推理至少需要 80GB 显存。若显存不足，可使用 --offload_model True、--convert_model_dtype 和 --t5_cpu 选项降低显存占用。",{"notes":105,"python":102,"dependencies":106},"安装时若 flash_attn 失败，建议先安装其他依赖最后单独安装它。若需使用语音生成视频功能，需额外安装 requirements_s2v.txt（包含 CosyVoice）。TI2V-5B 模型支持在消费级显卡上运行 720P@24fps 视频生成。",[107,108,109,110],"torch>=2.4.0","flash_attn","huggingface_hub","modelscope (可选)",[15],[113,114],"aigc","video-generation",12,"2026-03-27T02:49:30.150509","2026-04-06T05:27:14.076869",[119,124,129,134,139,144],{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},6066,"使用 diffusers 生成的视频质量为何比官方 generate.py 脚本差？","这通常是因为依赖库版本不匹配或缺少特定参数配置。请确保安装以下特定版本的库以获得最佳效果：\n- diffusers >= 0.35.1 (建议直接从源码安装最新开发版)\n- transformers >= 4.49.0\n- torch >= 2.6.0\n- accelerate >= 1.1.1\n\n安装命令示例：\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\n\n此外，检查是否缺少针对边界、高噪声和低噪声阶段的 guide_scale 配置，官方脚本中通常包含这些微调参数。","https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2\u002Fissues\u002F69",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},6067,"运行时报错 'TypeError: NoneType object is not subscriptable' 或 'libpng error: bad parameters to zlib' 如何解决？","该错误通常是因为 `cv2.imread` 无法正确读取某些图片格式（如带有特殊元数据的 PNG），返回了 None。\n\n解决方案是将读取图片的代码从 OpenCV 改为 PIL：\n\n原代码：\nrefer_images = cv2.imread(src_ref_path)[..., ::-1]\n\n修改为：\nfrom PIL import Image\nimport numpy as np\n\npil_img = Image.open(src_ref_path).convert(\"RGB\")  # 确保转换为 RGB 模式\nrefer_image = np.array(pil_img)\nrefer_images = refer_image[..., ::-1]  # 如果需要 BGR 格式再转换，否则直接使用 RGB","https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2\u002Fissues\u002F243",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},6068,"加载模型时报错 'ValueError: Cannot load ... expected shape ..., but got ...' 怎么办？","这是因为当前 PyPI 上的 diffusers 稳定版（如 0.34.0）尚未支持 Wan2.2 的最新模型架构。\n\n解决方法是安装 diffusers 的开发版（0.35.0.dev 或更高）：\n\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\n\n安装完成后重启环境即可正常加载模型。","https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2\u002Fissues\u002F45",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},6069,"使用自定义图片和驱动视频进行 Animate 任务时，报 'shape mismatch' 错误如何修复？","这通常是由于多 GPU 推理时 Tensor 切分不整除，导致后续 Self-Attention 计算维度不匹配。\n\n有两种解决方案：\n1. **调整帧数**：将 `clip_len` 或 `frame_num` 设置为能被 GPU 数量或并行策略整除的数值（例如尝试改为 81 帧）。\n2. **代码修复**：修改 `wan\u002Fmodules\u002Fanimate\u002Fmodel_animate.py`，在 Self-Attention 前去除填充的 zero tensor，确保 `rearrange` 操作前的形状正确。具体需检查 `self.sp_size` 相关的逻辑。","https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2\u002Fissues\u002F178",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},6070,"Wan Animate 如何实现最佳的人物一致性和生成效果？","根据社区测试，不同平台的效果排序如下：\n1. **官方在线入口**（每天赠送积分的版本）：效果最好，一致性最高。\n2. **本地部署工作流**：效果次之，但可以通过精细控制参数（如参考图尺寸、Clip Vision 设置、蒙版绘制、提示词优化）来接近官方效果。\n3. **HuggingFace Space**：效果相对最弱。\n\n建议在本地调试时，重点关注 face、pose、background 和 mask 的参数配合，并尽量使用高分辨率的参考图片。","https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2\u002Fissues\u002F161",{"id":145,"question_zh":146,"answer_zh":147,"source_url":128},6071,"生成 1 分钟视频耗时过长（如 20 帧需 7 分钟），是否有硬件或配置建议？","生成速度高度依赖硬件配置。目前反馈较快的配置包括单张 A800 或 H200 显卡。\n\n如果速度过慢，请检查：\n1. 是否开启了不必要的预处理步骤（如 `--use_flux` 或 `--retarget_flag` 会增加耗时）。\n2. 显存是否充足，避免频繁的数据交换。\n3. 尝试减少 `num_inference_steps` 或使用更高效的调度器（如 UniPCMultistepScheduler）。\n\n示例预处理命令（按需精简参数）：\npython .\u002Fwan\u002Fmodules\u002Fanimate\u002Fpreprocess\u002Fpreprocess_data.py --ckpt_path [...] --video_path [...] --resolution_area 1280 720",[]]