[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Francis-Rings--StableAvatar":3,"tool-Francis-Rings--StableAvatar":65},[4,18,32,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,3,"2026-04-06T03:28:53",[13,14,15,16],"开发框架","图像","Agent","视频","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":24,"last_commit_at":25,"category_tags":26,"status":17},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,2,"2026-04-10T11:13:16",[14,27,16,28,15,29,30,13,31],"数据工具","插件","其他","语言模型","音频",{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":10,"last_commit_at":38,"category_tags":39,"status":17},3833,"MoneyPrinterTurbo","harry0703\u002FMoneyPrinterTurbo","MoneyPrinterTurbo 是一款利用 AI 大模型技术，帮助用户一键生成高清短视频的开源工具。只需输入一个视频主题或关键词，它就能全自动完成从文案创作、素材匹配、字幕合成到背景音乐搭配的全过程，最终输出完整的竖屏或横屏短视频。\n\n这款工具主要解决了传统视频制作流程繁琐、门槛高以及素材版权复杂等痛点。无论是需要快速产出内容的自媒体创作者，还是希望尝试视频生成的普通用户，无需具备专业的剪辑技能或昂贵的硬件配置（普通电脑即可运行），都能轻松上手。同时，其清晰的 MVC 架构和对多种主流大模型（如 DeepSeek、Moonshot、通义千问等）的广泛支持，也使其成为开发者进行二次开发或技术研究的理想底座。\n\nMoneyPrinterTurbo 的独特亮点在于其高度的灵活性与本地化友好性。它不仅支持中英文双语及多种语音合成，允许用户精细调整字幕样式和画面比例，还特别优化了国内网络环境下的模型接入方案，让用户无需依赖 VPN 即可使用高性能国产大模型。此外，工具提供批量生成模式，可一次性产出多个版本供用户择优，极大地提升了内容创作的效率与质量。",54991,"2026-04-05T12:23:02",[13,30,15,16,14],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":24,"last_commit_at":46,"category_tags":47,"status":17},2179,"oh-my-openagent","code-yeongyu\u002Foh-my-openagent","oh-my-openagent（简称 omo）是一款强大的开源智能体编排框架，前身名为 oh-my-opencode。它致力于打破单一模型供应商的生态壁垒，解决开发者在构建 AI 应用时面临的“厂商锁定”难题。不同于仅依赖特定模型的封闭方案，omo 倡导开放市场理念，支持灵活调度多种主流大模型：利用 Claude、Kimi 或 GLM 进行任务编排，调用 GPT 处理复杂推理，借助 Minimax 提升响应速度，或发挥 Gemini 的创意优势。\n\n这款工具特别适合希望摆脱平台限制、追求极致性能与成本平衡的开发者及研究人员使用。通过统一接口，用户可以轻松组合不同模型的长处，构建更高效、更具适应性的智能体系统。其独特的技术亮点在于“全模型兼容”架构，让用户不再受制于某一家公司的策略变动或定价调整，真正实现对前沿模型资源的自由驾驭。无论是构建自动化编码助手，还是开发多步骤任务处理流程，oh-my-openagent 都能提供灵活且稳健的基础设施支持，助力用户在快速演进的 AI 生态中保持技术主动权。",51450,"2026-04-14T11:40:14",[16,30,13,14,15],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":10,"last_commit_at":54,"category_tags":55,"status":17},5295,"tabby","TabbyML\u002Ftabby","Tabby 是一款可私有化部署的开源 AI 编程助手，旨在为开发团队提供 GitHub Copilot 的安全替代方案。它核心解决了代码辅助过程中的数据隐私顾虑与云端依赖问题，让企业能够在完全掌控数据的前提下享受智能代码补全、聊天问答及上下文理解带来的效率提升。\n\n这款工具特别适合注重代码安全的企业开发团队、希望本地化运行大模型的科研机构，以及拥有消费级显卡的个人开发者。Tabby 的最大亮点在于其“开箱即用”的自包含架构，无需配置复杂的数据库或依赖云服务即可快速启动。同时，它对硬件十分友好，支持在普通的消费级 GPU 上流畅运行，大幅降低了部署门槛。此外，Tabby 提供了标准的 OpenAPI 接口，能轻松集成到现有的云 IDE 或内部开发流程中，并支持通过 REST API 接入自定义文档以增强知识上下文。从代码自动补全到基于 Git 仓库的智能问答，Tabby 致力于成为开发者身边懂业务、守安全的智能伙伴。",33308,"2026-04-07T20:23:18",[13,30,15,14,16],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":17},6525,"generative-models","Stability-AI\u002Fgenerative-models","Generative Models 是 Stability AI 推出的开源项目，核心亮点在于最新发布的 Stable Video 4D 2.0（SV4D 2.0）。这是一个先进的视频转 4D 扩散模型，旨在解决从单一视角视频中生成高保真、多视角动态 3D 资产的技术难题。传统方法往往难以处理物体自遮挡或背景杂乱的情况，且生成的动态细节容易模糊，而 SV4D 2.0 通过改进的架构，显著提升了运动中的画面锐度与时空一致性，无需依赖额外的多视角参考图即可稳健地合成新颖视角的视频。\n\n该项目特别适合计算机视觉研究人员、AI 开发者以及从事 3D 内容创作的设计师使用。对于研究者，它提供了探索 4D 生成前沿的完整代码与训练权重；对于开发者，其支持自动回归生成长视频及低显存优化选项，便于集成与调试；对于设计师，它能将简单的物体运动视频快速转化为可用于游戏或影视的多视角 4D 素材。技术层面，SV4D 2.0 支持一次性生成 12 帧视频对应 4 个相机视角（或 5 帧对应 8 视角），分辨率达 576x576，并能更好地泛化至真实世界场景。用户只需准备一段白底或经简单抠图处理的物体运动视频，",27078,4,"2026-04-10T22:08:34",[16,29],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":81,"owner_twitter":82,"owner_website":81,"owner_url":83,"languages":84,"stars":93,"forks":94,"last_commit_at":95,"license":96,"difficulty_score":10,"env_os":97,"env_gpu":98,"env_ram":99,"env_deps":100,"category_tags":112,"github_topics":113,"view_count":24,"oss_zip_url":81,"oss_zip_packed_at":81,"status":17,"created_at":117,"updated_at":118,"faqs":119,"releases":149},7618,"Francis-Rings\u002FStableAvatar","StableAvatar","We present StableAvatar, the first end-to-end video diffusion transformer, which synthesizes infinite-length high-quality audio-driven avatar videos without any post-processing, conditioned on a reference image and audio.","StableAvatar 是一款革命性的开源 AI 工具，专注于生成由音频驱动的虚拟人视频。只需提供一张参考人物图片和一段音频，它就能端到端地合成高质量、无限时长的说话视频，且无需任何后期处理步骤。\n\n长期以来，现有的数字人生成模型在制作长视频时，往往面临画面闪烁、人物身份特征丢失（如长相变化）等问题，通常不得不依赖额外的人脸修复或换脸工具进行繁琐的后处理。StableAvatar 完美解决了这一痛点，作为首个基于视频扩散 Transformer 架构的模型，它能够直接生成连贯流畅的长视频，同时严格保持人物身份的一致性，彻底摆脱了对 FaceFusion、GFP-GAN 等第三方后处理工具的依赖。\n\n这项技术的核心亮点在于其“无限时长”生成能力和卓越的保真度，确保了虚拟人在长时间说话过程中表情自然、口型精准且形象稳定。无论是从事多模态算法研究的研究人员、希望集成数字人功能的开发者，还是需要快速制作虚拟主播内容的设计师与普通创作者，StableAvatar 都提供了高效且高质量的解决方案。目前，该项目已开放代码、模型权重及在线演示，欢迎各界人士体验与探索。","# StableAvatar\n\n\u003Ca href='https:\u002F\u002Ffrancis-rings.github.io\u002FStableAvatar'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-Green'>\u003C\u002Fa> \u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.08248'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-Arxiv-red'>\u003C\u002Fa> \u003Ca href='https:\u002F\u002Fhuggingface.co\u002FFrancisRing\u002FStableAvatar\u002Ftree\u002Fmain'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-Model-orange'>\u003C\u002Fa> \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FYinmingHuang\u002FStableAvatar'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Demo-blue'>\u003C\u002Fa> \u003Ca href='https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=6lhvmbzvv3Y'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FYouTube-Watch-red?style=flat-square&logo=youtube'>\u003C\u002Fa> \u003Ca href='https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1hUt9z4EoQ'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBilibili-Watch-blue?style=flat-square&logo=bilibili'>\u003C\u002Fa> \n\nStableAvatar: Infinite-Length Audio-Driven Avatar Video Generation\n\u003Cbr\u002F>\nShuyuan Tu\u003Csup>1\u003C\u002Fsup>, Yueming Pan\u003Csup>3\u003C\u002Fsup>, Yinming Huang\u003Csup>1\u003C\u002Fsup>, Xintong Han\u003Csup>4\u003C\u002Fsup>, Zhen Xing\u003Csup>1\u003C\u002Fsup>, Qi Dai\u003Csup>2\u003C\u002Fsup>, Chong Luo\u003Csup>2\u003C\u002Fsup>, Zuxuan Wu\u003Csup>1\u003C\u002Fsup>, Yu-Gang Jiang\u003Csup>1\u003C\u002Fsup>\n\u003Cbr\u002F>\n[\u003Csup>1\u003C\u002Fsup>Fudan University; \u003Csup>2\u003C\u002Fsup>Microsoft Research Asia; \u003Csup>3\u003C\u002Fsup>Xi'an Jiaotong University; \u003Csup>4\u003C\u002Fsup>Tencent Inc]\n\n\n\u003Ctable border=\"0\" style=\"width: 100%; text-align: left; margin-top: 20px;\">\n  \u003Ctr>\n      \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd7eca208-6a14-46af-b337-fb4d2b66ba8d\" width=\"320\" controls loop>\u003C\u002Fvideo>\n      \u003C\u002Ftd>\n      \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb15784b1-c013-4126-a764-10c844341a4e\" width=\"320\" controls loop>\u003C\u002Fvideo>\n      \u003C\u002Ftd>\n       \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F87faa5c1-a118-4a03-a071-45f18e87e6a0\" width=\"320\" controls loop>\u003C\u002Fvideo>\n     \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n      \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F531eb413-8993-4f8f-9804-e3c5ec5794d4\" width=\"320\" controls loop>\u003C\u002Fvideo>\n      \u003C\u002Ftd>\n      \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fcdc603e2-df46-4cf8-a14e-1575053f996f\" width=\"320\" controls loop>\u003C\u002Fvideo>\n      \u003C\u002Ftd>\n       \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F7022dc93-f705-46e5-b8fc-3a3fb755795c\" width=\"320\" controls loop>\u003C\u002Fvideo>\n     \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n      \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F0ba059eb-ff6f-4d94-80e6-f758c613b737\" width=\"320\" controls loop>\u003C\u002Fvideo>\n      \u003C\u002Ftd>\n      \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F03e6c1df-85c6-448d-b40d-aacb8add4e45\" width=\"320\" controls loop>\u003C\u002Fvideo>\n      \u003C\u002Ftd>\n       \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F90b78154-dda0-4eaa-91fd-b5485b718a7f\" width=\"320\" controls loop>\u003C\u002Fvideo>\n     \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003Cp style=\"text-align: justify;\">\n  \u003Cspan>Audio-driven avatar videos generated by StableAvatar, showing its power to synthesize \u003Cb>infinite-length\u003C\u002Fb> and \u003Cb>ID-preserving videos\u003C\u002Fb>. All videos are \u003Cb>directly synthesized by StableAvatar without the use of any face-related post-processing tools\u003C\u002Fb>, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer.\u003C\u002Fspan>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F90691318-311e-40b9-9bd9-62db83ab1492\" width=\"768\" autoplay loop muted playsinline>\u003C\u002Fvideo>\n  \u003Cbr\u002F>\n  \u003Cspan>Comparison results between StableAvatar and state-of-the-art (SOTA) audio-driven avatar video generation models highlight the superior performance of StableAvatar in delivering \u003Cb>infinite-length, high-fidelity, identity-preserving avatar animation\u003C\u002Fb>.\u003C\u002Fspan>\n\u003C\u002Fp>\n\n\n## Overview\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFrancis-Rings_StableAvatar_readme_4f307331d0b1.jpg\" alt=\"model architecture\" width=\"1280\"\u002F>\n  \u003C\u002Fbr>\n  \u003Ci>The overview of the framework of StableAvatar.\u003C\u002Fi>\n\u003C\u002Fp>\n\nCurrent diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. \nWe observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually.\nTo address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion’s own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively. \n\n## News\n* `[2025-10-12]`: 🔥 We fixed the inconsistency between **multi-GPU** and **single-GPU** inference. The patch has been merged—please update to the latest version and give it a try! 🙌\n* `[2025-9-8]`:🔥  We are thrilled to release an interesting 🔥🔥 brand new demo 🔥🔥! The generated videos can be seen on [YouTube](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=GH4hrxIis3Q) and [Bilibili](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1jGYPzqEux).\n* `[2025-8-29]`:🔥 StableAvatar public demo is now live on [Hugging Face Spaces](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FYinmingHuang\u002FStableAvatar). (Note: due to the long video generation time, the demo is currently accessible to \u003Cb>Hugging Face Pro\u003C\u002Fb> users only.)\n* `[2025-8-18]`:🔥 StableAvatar can run on [ComfyUI](https:\u002F\u002Fgithub.com\u002Fsmthemex\u002FComfyUI_StableAvatar) in just 10 steps, making it 3x faster. Thanks @[smthemex](https:\u002F\u002Fgithub.com\u002Fsmthemex) for the contribution.\n* `[2025-8-16]`:🔥 We release the finetuning codes and lora training\u002Ffinetuning codes! Other codes will be public as soon as possible. Stay tuned!\n* `[2025-8-15]`:🔥 StableAvatar can run on Gradio Interface. Thanks @[gluttony-10](https:\u002F\u002Fspace.bilibili.com\u002F893892) for the contribution!\n* `[2025-8-15]`:🔥 StableAvatar can run on [ComfyUI](https:\u002F\u002Fgithub.com\u002Fsmthemex\u002FComfyUI_StableAvatar). Thanks @[smthemex](https:\u002F\u002Fgithub.com\u002Fsmthemex) for the contribution.\n* `[2025-8-13]`:🔥 Added changes to run StableAvatar on the new Blackwell series Nvidia chips, including the RTX 6000 Pro.\n* `[2025-8-11]`:🔥 The project page, code, technical report and [a basic model checkpoint](https:\u002F\u002Fhuggingface.co\u002FFrancisRing\u002FStableAvatar\u002Ftree\u002Fmain) are released. Further lora training codes, the evaluation dataset and StableAvatar-pro will be released very soon. Stay tuned!\n\n## 🛠️ To-Do List\n- [x] StableAvatar-1.3B-basic\n- [x] Inference Code\n- [x] Data Pre-Processing Code (Audio Extraction)\n- [x] Data Pre-Processing Code (Vocal Separation)\n- [x] Training Code\n- [x] Full Finetuning Code\n- [x] Lora Training Code\n- [x] Lora Finetuning Code\n- [ ] Inference Code with Audio Native Guidance\n- [ ] StableAvatar-pro\n\n## 🔑 Quickstart\n\nFor the basic version of the model checkpoint (Wan2.1-1.3B-based), it supports generating \u003Cb>infinite-length videos at a 480x832 or 832x480 or 512x512 resolution\u003C\u002Fb>. If you encounter insufficient memory issues, you can appropriately reduce the number of animated frames or the resolution of the output.\n\n### 🧱 Environment setup\n\n```\npip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.1.1 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124\npip install -r requirements.txt\n# Optional to install flash_attn to accelerate attention computation\npip install flash_attn\n```\n\n### 🧱 Environment setup for Blackwell series chips\n\n```\npip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128\npip install -r requirements.txt\n# Optional to install flash_attn to accelerate attention computation\npip install flash_attn\n```\n\n### 🧱 Download weights\nIf you encounter connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: `export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com`.\nPlease download weights manually as follows:\n```\npip install \"huggingface_hub[cli]\"\ncd StableAvatar\nmkdir checkpoints\nhuggingface-cli download FrancisRing\u002FStableAvatar --local-dir .\u002Fcheckpoints\n```\nAll the weights should be organized in models as follows\nThe overall file structure of this project should be organized as follows:\n```\nStableAvatar\u002F\n├── accelerate_config\n├── deepspeed_config\n├── examples\n├── wan\n├── checkpoints\n│   ├── Kim_Vocal_2.onnx\n│   ├── wav2vec2-base-960h\n│   ├── Wan2.1-Fun-V1.1-1.3B-InP\n│   └── StableAvatar-1.3B\n├── inference.py\n├── inference.sh\n├── train_1B_square.py\n├── train_1B_square.sh\n├── train_1B_vec_rec.py\n├── train_1B_vec_rec.sh\n├── audio_extractor.py\n├── vocal_seperator.py\n├── requirement.txt \n```\n\n### 🧱 Audio Extraction\nGiven the target video file (.mp4), you can use the following command to obtain the corresponding audio file (.wav):\n```\npython audio_extractor.py --video_path=\"path\u002Ftest\u002Fvideo.mp4\" --saved_audio_path=\"path\u002Ftest\u002Faudio.wav\"\n```\n\n### 🧱 Vocal Separation\nAs noisy background music may negatively impact the performance of StableAvatar to some extents, you can further separate the vocal from the audio file for better lip synchronization.\nGiven the path to an audio file (.wav), you can run the following command to extract the corresponding vocal signals:\n```\npip install audio-separator[gpu]\npython vocal_seperator.py --audio_separator_model_file=\"path\u002FStableAvatar\u002Fcheckpoints\u002FKim_Vocal_2.onnx\" --audio_file_path=\"path\u002Ftest\u002Faudio.wav\" --saved_vocal_path=\"path\u002Ftest\u002Fvocal.wav\"\n```\n\n### 🧱 Base Model inference\nA sample configuration for testing is provided as `inference.sh`. You can also easily modify the various configurations according to your needs.\n\n```\nbash inference.sh\n```\nWan2.1-1.3B-based StableAvatar supports audio-driven avatar video generation at three different resolution settings: 512x512, 480x832, and 832x480. You can modify \"--width\" and \"--height\" in `inference.sh` to set the resolution of the animation. \"--output_dir\" in `inference.sh` refers to the saved path of the generated animation. \"--validation_reference_path\", \"--validation_driven_audio_path\", and \"--validation_prompts\" in `inference.sh` refer to the path of the given reference image, the path of the given audio, and the text prompts respectively.\nPrompts are also very important. It is recommended to `[Description of first frame]-[Description of human behavior]-[Description of background (optional)]`.\n\"--pretrained_model_name_or_path\", \"--pretrained_wav2vec_path\", and \"--transformer_path\" in `inference.sh` are the paths of pretrained Wan2.1-1.3B weights, pretrained Wav2Vec2.0 weights, and pretrained StableAvatar weights, respectively.\n\"--sample_steps\", \"--overlap_window_length\", and \"--clip_sample_n_frames\" refer to the total number of inference steps, the overlapping context length between two context windows, and the synthesized frame number in a batch\u002Fcontext window, respectively. \nNotably, the recommended `--sample_steps` range is [30-50], more steps bring higher quality. The recommended `--overlap_window_length` range is [5-15], as longer overlapping length results in higher quality and slower inference speed.\n\"--sample_text_guide_scale\" and \"--sample_audio_guide_scale\" are Classify-Free-Guidance scale of text prompt and audio. The recommended range for prompt and audio cfg is `[3-6]`. You can increase the audio cfg to facilitate the lip synchronization with audio.\n\nAdditionally, you can also run the following command to launch a Gradio interface:\n```\npython app.py\n```\n\nWe provide 6 cases in different resolution settings in `path\u002FStableAvatar\u002Fexamples` for validation. ❤️❤️Please feel free to try it out and enjoy the endless entertainment of infinite-length avatar video generation❤️❤️!\n\n#### 💡Tips\n- Wan2.1-1.3B-based StableAvatar weights have two versions: `transformer3d-square.pt` and `transformer3d-rec-vec.pt`, which are trained on two video datasets in two different resolution settings. Two versions both support generating audio-driven avatar video at three different resolution settings: 512x512, 480x832, and 832x480. You can modify `--transformer_path` in `inference.sh` to switch these two versions.\n\n- If you have limited GPU resources, you can change the loading mode of StableAvatar by modifying \"--GPU_memory_mode\" in `inference.sh`. The options of \"--GPU_memory_mode\" are `model_full_load`, `sequential_cpu_offload`, `model_cpu_offload_and_qfloat8`, and `model_cpu_offload`. In particular, when you set `--GPU_memory_mode` to `sequential_cpu_offload`, the total GPU memory consumption is approximately 3G with slower inference speed.\nSetting `--GPU_memory_mode` to `model_cpu_offload` can significantly cut GPU memory usage, reducing it by roughly half compared to `model_full_load` mode.\n\n- If you have multiple Gpus, you can run Multi-GPU inference to speed up by modifying \"--ulysses_degree\" and \"--ring_degree\" in `inference.sh`. For example, if you have 8 GPUs, you can set `--ulysses_degree=4` and `--ring_degree=2`. Notably, you have to ensure ulysses_degree*ring_degree=total GPU number\u002Fworld-size. Moreover, you can also add `--fsdp_dit` in `inference.sh` to activate FSDP in DiT to further reduce GPU memory consumption.\nYou can fun the following command:\n```\nbash multiple_gpu_inference.sh\n```\nIn my setting, 4 GPUs are utilized for inference.\n\nThe video synthesized by StableAvatar is without audio. If you want to obtain the high quality MP4 file with audio, we recommend you to leverage ffmpeg on the \u003Cb>output_path\u003C\u002Fb> as follows:\n```\nffmpeg -i video_without_audio.mp4 -i \u002Fpath\u002Faudio.wav -c:v copy -c:a aac -shortest \u002Fpath\u002Foutput_with_audio.mp4\n```\n\n### 🧱 Model Training\n\u003Cb>🔥🔥It’s worth noting that if you’re looking to train a conditioned Video Diffusion Transformer (DiT) model, such as Wan2.1, this training tutorial will also be helpful.🔥🔥\u003C\u002Fb>\nFor the training dataset, it has to be organized as follows:\n\n```\ntalking_face_data\u002F\n├── rec\n│   │  ├──speech\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  │  ├──frame_0.png\n│   │  │  │  │  ├──frame_1.png\n│   │  │  │  │  ├──frame_2.png\n│   │  │  │  │  ├──...\n│   │  │  │  ├──face_masks\n│   │  │  │  │  ├──frame_0.png\n│   │  │  │  │  ├──frame_1.png\n│   │  │  │  │  ├──frame_2.png\n│   │  │  │  │  ├──...\n│   │  │  │  ├──lip_masks\n│   │  │  │  │  ├──frame_0.png\n│   │  │  │  │  ├──frame_1.png\n│   │  │  │  │  ├──frame_2.png\n│   │  │  │  │  ├──...\n│   │  │  ├──00002\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n│   │  ├──singing\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n│   │  ├──dancing\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n├── vec\n│   │  ├──speech\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n│   │  ├──singing\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n│   │  ├──dancing\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n├── square\n│   │  ├──speech\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n│   │  ├──singing\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n│   │  ├──dancing\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n├── video_rec_path.txt\n├── video_square_path.txt\n└── video_vec_path.txt\n```\nStableAvatar is trained on mixed-resolution videos, with 512x512 videos stored in `talking_face_data\u002Fsquare`, 480x832 videos stored in `talking_face_data\u002Fvec`, and 832x480 videos stored in `talking_face_data\u002Frec`. Each folder in `talking_face_data\u002Fsquare` or `talking_face_data\u002Frec` or `talking_face_data\u002Fvec` contains three subfolders which contains different types of videos (speech, singing, and dancing). \nAll `.png` image files are named in the format `frame_i.png`, such as `frame_0.png`, `frame_1.png`, and so on.\n`00001`, `00002`, `00003` indicate individual video information.\nIn terms of three subfolders, `images`, `face_masks`, and `lip_masks` store RGB frames, corresponding human face masks, and corresponding human lip masks, respectively.\n`sub_clip.mp4` and `audio.wav` refer to the corresponding RGB video of `images` and the corresponding audio file.\n`video_square_path.txt`, `video_rec_path.txt`, and `video_vec_path.txt` record folder paths of `talking_face_data\u002Fsquare`, `talking_face_data\u002Frec`, and `talking_face_data\u002Fvec`, respectively.\nFor example, the content of `video_rec_path.txt` is shown as follows:\n```\npath\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fspeech\u002F00001\npath\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fspeech\u002F00002\n...\npath\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fsinging\u002F00003\npath\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fsinging\u002F00004\n...\npath\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fdancing\u002F00005\npath\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fdancing\u002F00006\n...\n```\nIf you only have raw videos, you can leverage `ffmpeg` to extract frames from raw videos (speech) and store them in the subfolder `images`.\n```\nffmpeg -i raw_video_1.mp4 -q:v 1 -start_number 0 path\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fspeech\u002F00001\u002Fimages\u002Fframe_%d.png\n```\nThe obtained frames are saved in `path\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fspeech\u002F00001\u002Fimages`.\n\nFor extracting the human face masks, please refer to [StableAnimator repo](https:\u002F\u002Fgithub.com\u002FFrancis-Rings\u002FStableAnimator). The Human Face Mask Extraction section in the tutorial provides off-the-shelf codes.\n\nFor extracting the human lip masks, you can run the following command:\n```\npip install mediapipe\npython lip_mask_extractor.py --folder_root=\"path\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fsinging\" --start=1 --end=500\n```\n`--folder_root` refers to the root path of training datasets.\n`--start` and `--end`  specify the starting and ending indices of the selected training dataset. For example, `--start=1 --end=500` indicates that the human lip extraction will start at `path\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fsinging\u002F00001` and end at `path\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fsinging\u002F00500`.\n\nFor extraction details of corresponding audio, please refer to the Audio Extraction section.\nWhen your dataset is organized exactly as outlined above, you can easily train your Wan2.1-1.3B-based StableAvatar by running the following command:\n```\n# Training StableAvatar on a single resolution setting (512x512) in a single machine\nbash train_1B_square.sh\n# Training StableAvatar on a single resolution setting (512x512) in multiple machines\nbash train_1B_square_64.sh\n# Training StableAvatar on a mixed resolution setting (480x832 and 832x480) in a single machine\nbash train_1B_rec_vec.sh\n# Training StableAvatar on a mixed resolution setting (480x832 and 832x480) in multiple machines\nbash train_1B_rec_vec_64.sh\n```\nFor the parameter details of `train_1B_square.sh` and `train_1B_rec_vec.sh`, `CUDA_VISIBLE_DEVICES` refers to gpu devices. In my setting, I use 4 NVIDIA A100 80G to train StableAvatar (`CUDA_VISIBLE_DEVICES=3,2,1,0`).\n`--pretrained_model_name_or_path`, `--pretrained_wav2vec_path`, and `--output_dir` refer to the pretrained Wan2.1-1.3B path, pretrained Wav2Vec2.0 path, and the checkpoint saved path of the trained StableAvatar.\n`--train_data_square_dir`, `--train_data_rec_dir`, and `--train_data_vec_dir` are the paths of `video_square_path.txt`, `video_rec_path.txt`, and `video_vec_path.txt`, respectively.\n`--validation_reference_path` and `--validation_driven_audio_path` are paths of the validation reference image and the validation driven audio.\n`--video_sample_n_frames` is the number of frames that StableAvatar processes in a single batch. \n`--num_train_epochs` is the training epoch number. It is worth noting that the default number of training epochs is set to infinite. You can manually terminate the training process once you observe that your StableAvatar has reached its peak performance.\nFor the parameter details of `train_1B_square_64.sh` and `train_1B_rec_vec_64.sh`, we set the GPU configuration in `path\u002FStableAvatar\u002Faccelerate_config\u002Faccelerate_config_machine_1B_multiple.yaml`. In my setting, the training setup consists of 8 nodes, each equipped with 8 NVIDIA A100 80GB GPUs, for training StableAvatar.\n\nThe overall file structure of StableAvatar at training is shown as follows:\n```\nStableAvatar\u002F\n├── accelerate_config\n├── deepspeed_config\n├── talking_face_data\n├── examples\n├── wan\n├── checkpoints\n│   ├── Kim_Vocal_2.onnx\n│   ├── wav2vec2-base-960h\n│   ├── Wan2.1-Fun-V1.1-1.3B-InP\n│   └── StableAvatar-1.3B\n├── inference.py\n├── inference.sh\n├── train_1B_square.py\n├── train_1B_square.sh\n├── train_1B_vec_rec.py\n├── train_1B_vec_rec.sh\n├── audio_extractor.py\n├── vocal_seperator.py\n├── requirement.txt \n```\n\u003Cb>It is worth noting that training StableAvatar requires approximately 50GB of VRAM due to the mixed-resolution (480x832 and 832x480) training pipeline. \nHowever, if you train StableAvatar exclusively on 512x512 videos, the VRAM requirement is reduced to approximately 40GB.\u003C\u002Fb>\nAdditionally, The backgrounds of the selected training videos should remain static, as this helps the diffusion model calculate accurate reconstruction loss.\nThe audio should be clear and free from excessive background noise.\n\nRegarding training Wan2.1-14B-based StableAvatar, you can run the following command:\n```\n# Training StableAvatar on a mixed resolution setting (480x832, 832x480, and 512x512) in multiple machines\nhuggingface-cli download Wan-AI\u002FWan2.1-I2V-14B-480P --local-dir .\u002Fcheckpoints\u002FWan2.1-I2V-14B-480P\nhuggingface-cli download Wan-AI\u002FWan2.1-I2V-14B-720P --local-dir .\u002Fcheckpoints\u002FWan2.1-I2V-14B-720P # Optional\nbash train_14B.sh\n```\nWe utilize deepspeed stage-2 to train Wan2.1-14B-based StableAvatar. The GPU configuration can be modified in `path\u002FStableAvatar\u002Faccelerate_config\u002Faccelerate_config_machine_14B_multiple.yaml`.\nThe deepspeed optimization configuration and deepspeed scheduler configuration are in `path\u002FStableAvatar\u002Fdeepspeed_config\u002Fzero_stage2_config.json`.\nNotably, we observe that Wan2.1-1.3B-based StableAvatar is already capable of synthesizing infinite-length high quality avatar videos. The Wan2.1-14B backbone significantly increase the inference latency and GPU memory consumption during training, indicating limited efficiency in terms of performance-to-resource ratio.\n\nYou can also run the following commands to perform lora training:\n```\n# Training StableAvatar-1.3B on a mixed resolution setting (480x832 and 832x480) in a single machine\nbash train_1B_rec_vec_lora.sh\n# Training StableAvatar-1.3B on a mixed resolution setting (480x832 and 832x480) in multiple machines\nbash train_1B_rec_vec_lora_64.sh\n# Lora-Training StableAvatar-14B on a mixed resolution setting (480x832, 832x480, and 512x512) in multiple machines\nbash train_14B_lora.sh\n```\nYou can modify `--rank` and `--network_alpha` to control the quality of your lora training\u002Ffinetuning.\n\nIf you want to train 720P Wan2.1-1.3B-based or Wan2.1-14B-based StableAvatar, you can directly modify the height and width of the dataloader (480p-->720p) in `train_1B_square.py`\u002F`train_1B_vec_rec.py`\u002F`train_14B.py`.\n\n### 🧱 Model Finetuning\nRegarding fully finetuning StableAvatar, you can add `--transformer_path=\"path\u002FStableAvatar\u002Fcheckpoints\u002FStableAvatar-1.3B\u002Ftransformer3d-square.pt\"` to the `train_1B_rec_vec.sh` or `train_1B_rec_vec_64.sh`:\n```\n# Finetuning StableAvatar on a mixed resolution setting (480x832 and 832x480) in a single machine\nbash train_1B_rec_vec.sh\n# Finetuning StableAvatar on a mixed resolution setting (480x832 and 832x480) in multiple machines\nbash train_1B_rec_vec_64.sh\n```\nIn terms of lora finetuning StableAvatar, you can add `--transformer_path=\"path\u002FStableAvatar\u002Fcheckpoints\u002FStableAvatar-1.3B\u002Ftransformer3d-square.pt\"` to the `train_1B_rec_vec_lora.sh`:\n```\n# Lora-Finetuning StableAvatar-1.3B on a mixed resolution setting (480x832 and 832x480) in a single machine\nbash train_1B_rec_vec_lora.sh\n```\nYou can modify `--rank` and `--network_alpha` to control the quality of your lora training\u002Ffinetuning.\n\n### 🧱 VRAM requirement and Runtime\n\nFor the 5s video (480x832, fps=25), the basic model (--GPU_memory_mode=\"model_full_load\") requires approximately 18GB VRAM and finishes in 3 minutes on a 4090 GPU.\n\n\u003Cb>🔥🔥Theoretically, StableAvatar is capable of synthesizing hours of video without significant quality degradation; however, the 3D VAE decoder demands significant GPU memory, especially when decoding 10k+ frames. You have the option to run the VAE on CPU.🔥🔥\u003C\u002Fb>\n\n## Contact\nIf you have any suggestions or find our work helpful, feel free to contact me.\n\nEmail: francisshuyuan@gmail.com\n\nIf you find our work useful, \u003Cb>please consider giving a star ⭐ to this github repository and citing it ❤️\u003C\u002Fb>:\n```bib\n@article{tu2025stableavatar,\n  title={Stableavatar: Infinite-length audio-driven avatar video generation},\n  author={Tu, Shuyuan and Pan, Yueming and Huang, Yinming and Han, Xintong and Xing, Zhen and Dai, Qi and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang},\n  journal={arXiv preprint arXiv:2508.08248},\n  year={2025}\n}\n```\n","# StableAvatar\n\n\u003Ca href='https:\u002F\u002Ffrancis-rings.github.io\u002FStableAvatar'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-Green'>\u003C\u002Fa> \u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.08248'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-Arxiv-red'>\u003C\u002Fa> \u003Ca href='https:\u002F\u002Fhuggingface.co\u002FFrancisRing\u002FStableAvatar\u002Ftree\u002Fmain'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-Model-orange'>\u003C\u002Fa> \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FYinmingHuang\u002FStableAvatar'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Demo-blue'>\u003C\u002Fa> \u003Ca href='https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=6lhvmbzvv3Y'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FYouTube-Watch-red?style=flat-square&logo=youtube'>\u003C\u002Fa> \u003Ca href='https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1hUt9z4EoQ'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBilibili-Watch-blue?style=flat-square&logo=bilibili'>\u003C\u002Fa> \n\nStableAvatar：无限时长音频驱动的虚拟形象视频生成\n\u003Cbr\u002F>\n涂书源\u003Csup>1\u003C\u002Fsup>、潘月明\u003Csup>3\u003C\u002Fsup>、黄寅明\u003Csup>1\u003C\u002Fsup>、韩欣彤\u003Csup>4\u003C\u002Fsup>、邢震\u003Csup>1\u003C\u002Fsup>、戴琪\u003Csup>2\u003C\u002Fsup>、罗冲\u003Csup>2\u003C\u002Fsup>、吴祖轩\u003Csup>1\u003C\u002Fsup>、蒋宇刚\u003Csup>1\u003C\u002Fsup>\n\u003Cbr\u002F>\n[\u003Csup>1\u003C\u002Fsup>复旦大学；\u003Csup>2\u003C\u002Fsup>微软亚洲研究院；\u003Csup>3\u003C\u002Fsup>西安交通大学；\u003Csup>4\u003C\u002Fsup>腾讯公司]\n\n\n\u003Ctable border=\"0\" style=\"width: 100%; text-align: left; margin-top: 20px;\">\n  \u003Ctr>\n      \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd7eca208-6a14-46af-b337-fb4d2b66ba8d\" width=\"320\" controls loop>\u003C\u002Fvideo>\n      \u003C\u002Ftd>\n      \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb15784b1-c013-4126-a764-10c844341a4e\" width=\"320\" controls loop>\u003C\u002Fvideo>\n      \u003C\u002Ftd>\n       \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F87faa5c1-a118-4a03-a071-45f18e87e6a0\" width=\"320\" controls loop>\u003C\u002Fvideo>\n     \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n      \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F531eb413-8993-4f8f-9804-e3c5ec5794d4\" width=\"320\" controls loop>\u003C\u002Fvideo>\n      \u003C\u002Ftd>\n      \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fcdc603e2-df46-4cf8-a14e-1575053f996f\" width=\"320\" controls loop>\u003C\u002Fvideo>\n      \u003C\u002Ftd>\n       \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F7022dc93-f705-46e5-b8fc-3a3fb755795c\" width=\"320\" controls loop>\u003C\u002Fvideo>\n     \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n      \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F0ba059eb-ff6f-4d94-80e6-f758c613b737\" width=\"320\" controls loop>\u003C\u002Fvideo>\n      \u003C\u002Ftd>\n      \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F03e6c1df-85c6-448d-b40d-aacb8add4e45\" width=\"320\" controls loop>\u003C\u002Fvideo>\n      \u003C\u002Ftd>\n       \u003Ctd>\n          \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F90b78154-dda0-4eaa-91fd-b5485b718a7f\" width=\"320\" controls loop>\u003C\u002Fvideo>\n     \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003Cp style=\"text-align: justify;\">\n  \u003Cspan>由StableAvatar生成的音频驱动虚拟形象视频，展示了其合成\u003Cb>无限时长\u003C\u002Fb>且\u003Cb>保持身份一致\u003C\u002Fb>视频的强大能力。所有视频均\u003Cb>由StableAvatar直接合成，未使用任何与人脸相关的后处理工具\u003C\u002Fb>,例如换脸工具FaceFusion或人脸修复模型GFP-GAN和CodeFormer。\u003C\u002Fspan>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F90691318-311e-40b9-9bd9-62db83ab1492\" width=\"768\" autoplay loop muted playsinline>\u003C\u002Fvideo>\n  \u003Cbr\u002F>\n  \u003Cspan>StableAvatar与当前最先进（SOTA）的音频驱动虚拟形象视频生成模型的对比结果凸显了StableAvatar在提供\u003Cb>无限时长、高保真、身份一致的虚拟形象动画\u003C\u002Fb>方面的卓越性能。\u003C\u002Fspan>\n\u003C\u002Fp>\n\n\n## 概述\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFrancis-Rings_StableAvatar_readme_4f307331d0b1.jpg\" alt=\"model architecture\" width=\"1280\"\u002F>\n  \u003C\u002Fbr>\n  \u003Ci>StableAvatar框架的概览。\u003C\u002Fi>\n\u003C\u002Fp>\n\n目前用于音频驱动虚拟形象视频生成的扩散模型难以合成具有自然音频同步和身份一致性的长视频。本文提出了StableAvatar，这是首个无需后处理即可合成无限时长高质量视频的端到端视频扩散Transformer。StableAvatar以参考图像和音频为条件，集成了定制化的训练和推理模块，从而实现无限时长视频的生成。\n我们观察到，现有模型无法生成长视频的主要原因在于其音频建模方式。这些模型通常依赖第三方现成的提取器来获取音频嵌入，然后通过交叉注意力直接将其注入扩散模型中。由于当前的扩散主干网络缺乏任何与音频相关的先验知识，这种方法会导致视频片段之间潜变量分布误差的严重累积，从而使后续片段的潜变量分布逐渐偏离最优分布。\n为此，StableAvatar引入了一种新颖的时步感知音频适配器，通过时步感知调制来防止误差累积。在推理过程中，我们提出了一种新的音频原生引导机制，利用扩散模型自身不断演进的联合音频-潜变量预测作为动态引导信号，进一步提升音频同步效果。为了增强无限时长视频的流畅性，我们还引入了一种动态加权滑动窗口策略，用于随时间融合潜变量。基准测试实验表明，StableAvatar在定性和定量两方面均表现出色。\n\n## 新闻\n* `[2025-10-12]`: 🔥 我们修复了 **多GPU** 和 **单GPU** 推理之间的不一致问题。补丁已合并——请更新到最新版本并试用！🙌\n* `[2025-9-8]`:🔥 我们很高兴发布一个有趣的 🔥🔥 全新演示 🔥🔥！生成的视频可以在 [YouTube](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=GH4hrxIis3Q) 和 [Bilibili](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1jGYPzqEux) 上观看。\n* `[2025-8-29]`:🔥 StableAvatar 公开演示现已在 [Hugging Face Spaces](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FYinmingHuang\u002FStableAvatar) 上线。（注意：由于视频生成时间较长，目前该演示仅对 \u003Cb>Hugging Face Pro\u003C\u002Fb> 用户开放。）\n* `[2025-8-18]`:🔥 StableAvatar 只需 10 步即可在 [ComfyUI](https:\u002F\u002Fgithub.com\u002Fsmthemex\u002FComfyUI_StableAvatar) 上运行，速度提升至原来的 3 倍。感谢 @[smthemex](https:\u002F\u002Fgithub.com\u002Fsmthemex) 的贡献。\n* `[2025-8-16]`:🔥 我们发布了微调代码以及 LoRA 训练\u002F微调代码！其他代码将尽快公开，请持续关注！\n* `[2025-8-15]`:🔥 StableAvatar 可以在 Gradio 界面上运行。感谢 @[gluttony-10](https:\u002F\u002Fspace.bilibili.com\u002F893892) 的贡献！\n* `[2025-8-15]`:🔥 StableAvatar 可以在 [ComfyUI](https:\u002F\u002Fgithub.com\u002Fsmthemex\u002FComfyUI_StableAvatar) 上运行。感谢 @[smthemex](https:\u002F\u002Fgithub.com\u002Fsmthemex) 的贡献。\n* `[2025-8-13]`:🔥 添加了支持在全新 Blackwell 系列 Nvidia 芯片上运行 StableAvatar 的更改，包括 RTX 6000 Pro。\n* `[2025-8-11]`:🔥 项目页面、代码、技术报告以及 [基础模型检查点](https:\u002F\u002Fhuggingface.co\u002FFrancisRing\u002FStableAvatar\u002Ftree\u002Fmain) 已发布。进一步的 LoRA 训练代码、评估数据集和 StableAvatar-pro 将很快发布。敬请期待！\n\n## 🛠️ 待办事项\n- [x] StableAvatar-1.3B-basic\n- [x] 推理代码\n- [x] 数据预处理代码（音频提取）\n- [x] 数据预处理代码（人声分离）\n- [x] 训练代码\n- [x] 完整微调代码\n- [x] LoRA 训练代码\n- [x] LoRA 微调代码\n- [ ] 带有音频原生引导的推理代码\n- [ ] StableAvatar-pro\n\n## 🔑 快速入门\n\n对于基础版本的模型检查点（Wan2.1-1.3B 基础），它支持生成 \u003Cb>无限长度的视频，分辨率为 480×832 或 832×480 或 512×512\u003C\u002Fb>。如果遇到内存不足的问题，可以适当减少动画帧数或降低输出分辨率。\n\n### 🧱 环境搭建\n\n```\npip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.1.1 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124\npip install -r requirements.txt\n# 可选安装 flash_attn 以加速注意力计算\npip install flash_attn\n```\n\n### 🧱 Blackwell 系列芯片的环境搭建\n\n```\npip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128\npip install -r requirements.txt\n# 可选安装 flash_attn 以加速注意力计算\npip install flash_attn\n```\n\n### 🧱 下载权重\n如果遇到与 Hugging Face 连接问题，可以通过设置环境变量来使用镜像端点：`export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com`。\n请按以下步骤手动下载权重：\n```\npip install \"huggingface_hub[cli]\"\ncd StableAvatar\nmkdir checkpoints\nhuggingface-cli download FrancisRing\u002FStableAvatar --local-dir .\u002Fcheckpoints\n```\n所有权重应按如下方式组织在 models 文件夹中：\n该项目的整体文件结构应如下所示：\n```\nStableAvatar\u002F\n├── accelerate_config\n├── deepspeed_config\n├── examples\n├── wan\n├── checkpoints\n│   ├── Kim_Vocal_2.onnx\n│   ├── wav2vec2-base-960h\n│   ├── Wan2.1-Fun-V1.1-1.3B-InP\n│   └── StableAvatar-1.3B\n├── inference.py\n├── inference.sh\n├── train_1B_square.py\n├── train_1B_square.sh\n├── train_1B_vec_rec.py\n├── train_1B_vec_rec.sh\n├── audio_extractor.py\n├── vocal_seperator.py\n├── requirement.txt \n```\n\n### 🧱 音频提取\n给定目标视频文件 (.mp4)，可以使用以下命令获取对应的音频文件 (.wav)：\n```\npython audio_extractor.py --video_path=\"path\u002Ftest\u002Fvideo.mp4\" --saved_audio_path=\"path\u002Ftest\u002Faudio.wav\"\n```\n\n### 🧱 人声分离\n由于嘈杂的背景音乐可能会在一定程度上影响 StableAvatar 的表现，可以进一步从音频文件中分离出人声，以获得更好的唇形同步效果。\n给定音频文件 (.wav) 的路径，可以运行以下命令提取相应的人声信号：\n```\npip install audio-separator[gpu]\npython vocal_seperator.py --audio_separator_model_file=\"path\u002FStableAvatar\u002Fcheckpoints\u002FKim_Vocal_2.onnx\" --audio_file_path=\"path\u002Ftest\u002Faudio.wav\" --saved_vocal_path=\"path\u002Ftest\u002Fvocal.wav\"\n```\n\n### 🧱 基础模型推理\n我们提供了一个用于测试的示例配置文件 `inference.sh`。您也可以根据自己的需求轻松修改其中的各项配置。\n\n```\nbash inference.sh\n```\n基于 Wan2.1-1.3B 的 StableAvatar 支持三种不同分辨率设置下的音频驱动虚拟人视频生成：512×512、480×832 和 832×480。您可以在 `inference.sh` 中修改 `--width` 和 `--height` 来设置动画的分辨率。`--output_dir` 指的是生成动画的保存路径。`--validation_reference_path`、`--validation_driven_audio_path` 和 `--validation_prompts` 分别指代提供的参考图像路径、音频文件路径以及文本提示。\n\n提示词同样非常重要。建议按照以下格式填写：“[第一帧描述]-[人物行为描述]-[背景描述（可选）]”。\n\n`--pretrained_model_name_or_path`、`--pretrained_wav2vec_path` 和 `--transformer_path` 分别是预训练的 Wan2.1-1.3B 权重、Wav2Vec2.0 预训练权重以及 StableAvatar 预训练权重的路径。\n\n`--sample_steps`、`--overlap_window_length` 和 `--clip_sample_n_frames` 分别表示总的推理步数、两个上下文窗口之间的重叠长度，以及每个批次\u002F上下文窗口中合成的帧数。\n\n值得注意的是，推荐的 `--sample_steps` 范围为 [30-50]，步数越多，生成质量越高。而 `--overlap_window_length` 的推荐范围则是 [5-15]，较长的重叠长度会带来更高的质量，但也会降低推理速度。\n\n`--sample_text_guide_scale` 和 `--sample_audio_guide_scale` 分别是文本提示和音频的无分类指导尺度。对于提示和音频的 CFG 值，推荐范围为 [3-6]。您可以适当提高音频的 CFG 值，以更好地实现唇形同步。\n\n此外，您还可以运行以下命令来启动 Gradio 界面：\n```\npython app.py\n```\n\n我们在 `path\u002FStableAvatar\u002Fexamples` 中提供了 6 个不同分辨率设置下的案例供您验证。❤️❤️欢迎尝试体验，并尽情享受无限长度虚拟人视频生成带来的无穷乐趣❤️❤️！\n\n#### 💡 小贴士\n- 基于 Wan2.1-1.3B 的 StableAvatar 权重有两个版本：`transformer3d-square.pt` 和 `transformer3d-rec-vec.pt`，它们分别在两种不同分辨率设置的视频数据集上进行训练。这两个版本都支持在三种不同分辨率下生成音频驱动的虚拟人视频：512×512、480×832 和 832×480。您可以通过修改 `inference.sh` 中的 `--transformer_path` 来切换这两个版本。\n\n- 如果您的 GPU 资源有限，可以通过修改 `inference.sh` 中的 `--GPU_memory_mode` 来调整 StableAvatar 的加载模式。`--GPU_memory_mode` 的选项包括 `model_full_load`、`sequential_cpu_offload`、`model_cpu_offload_and_qfloat8` 以及 `model_cpu_offload`。特别地，当您将 `--GPU_memory_mode` 设置为 `sequential_cpu_offload` 时，总的 GPU 内存消耗约为 3GB，但推理速度会较慢。\n将 `--GPU_memory_mode` 设置为 `model_cpu_offload` 可以显著减少 GPU 内存占用，相比 `model_full_load` 模式大约能节省一半的内存。\n\n- 如果您有多块 GPU，可以通过修改 `inference.sh` 中的 `--ulysses_degree` 和 `--ring_degree` 来进行多 GPU 推理以加速处理。例如，如果您有 8 张 GPU 卡，可以设置 `--ulysses_degree=4` 和 `--ring_degree=2`。需要注意的是，`ulysses_degree * ring_degree` 必须等于 GPU 总数或世界尺寸。此外，您还可以在 `inference.sh` 中添加 `--fsdp_dit` 来激活 DiT 中的 FSDP，进一步降低 GPU 内存占用。\n您可以运行以下命令：\n```\nbash multiple_gpu_inference.sh\n```\n在我的设置中，使用了 4 张 GPU 进行推理。\n\nStableAvatar 合成的视频不包含音频。如果您希望获得带有音频的高质量 MP4 文件，我们建议您在 `\u003Cb>output_path\u003C\u002Fb>` 上使用 ffmpeg 进行处理，具体命令如下：\n```\nffmpeg -i video_without_audio.mp4 -i \u002Fpath\u002Faudio.wav -c:v copy -c:a aac -shortest \u002Fpath\u002Foutput_with_audio.mp4\n```\n\n### 🧱 模型训练\n\u003Cb>🔥🔥需要注意的是，如果你打算训练一个条件化的视频扩散Transformer（DiT）模型，比如Wan2.1，那么本训练教程同样会对你有所帮助。🔥🔥\u003C\u002Fb>\n训练数据集需要按照以下结构进行组织：\n\n```\ntalking_face_data\u002F\n├── rec\n│   │  ├──speech\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  │  ├──frame_0.png\n│   │  │  │  │  ├──frame_1.png\n│   │  │  │  │  ├──frame_2.png\n│   │  │  │  │  ├──...\n│   │  │  │  ├──face_masks\n│   │  │  │  │  ├──frame_0.png\n│   │  │  │  │  ├──frame_1.png\n│   │  │  │  │  ├──frame_2.png\n│   │  │  │  │  ├──...\n│   │  │  │  ├──lip_masks\n│   │  │  │  │  ├──frame_0.png\n│   │  │  │  │  ├──frame_1.png\n│   │  │  │  │  ├──frame_2.png\n│   │  │  │  │  ├──...\n│   │  │  ├──00002\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n│   │  ├──singing\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n│   │  ├──dancing\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n├── vec\n│   │  ├──speech\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n│   │  ├──singing\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n│   │  ├──dancing\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n├── square\n│   │  ├──speech\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n│   │  ├──singing\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n│   │  ├──dancing\n│   │  │  ├──00001\n│   │  │  │  ├──sub_clip.mp4\n│   │  │  │  ├──audio.wav\n│   │  │  │  ├──images\n│   │  │  │  ├──face_masks\n│   │  │  │  ├──lip_masks\n│   │  │  └──...\n├── video_rec_path.txt\n├── video_square_path.txt\n└── video_vec_path.txt\n```\nStableAvatar是在混合分辨率的视频上进行训练的，其中512x512分辨率的视频存储在`talking_face_data\u002Fsquare`中，480x832分辨率的视频存储在`talking_face_data\u002Fvec`中，而832x480分辨率的视频则存储在`talking_face_data\u002Frec`中。在`talking_face_data\u002Fsquare`、`talking_face_data\u002Frec`和`talking_face_data\u002Fvec`中的每个文件夹都包含三个子文件夹，分别存放不同类型的视频（演讲、歌唱和舞蹈）。\n所有`.png`格式的图像文件均以`frame_i.png`的形式命名，例如`frame_0.png`、`frame_1.png`等。\n`00001`、`00002`、`00003`等编号用于标识不同的视频信息。\n在三个子文件夹中，`images`用于存放RGB帧，`face_masks`用于存放相应的人脸掩码，而`lip_masks`则用于存放相应的嘴唇掩码。\n`sub_clip.mp4`和`audio.wav`分别对应于`images`中的RGB视频以及对应的音频文件。\n`video_square_path.txt`、`video_rec_path.txt`和`video_vec_path.txt`分别记录了`talking_face_data\u002Fsquare`、`talking_face_data\u002Frec`和`talking_face_data\u002Fvec`的文件夹路径。\n例如，`video_rec_path.txt`的内容如下：\n```\npath\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fspeech\u002F00001\npath\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fspeech\u002F00002\n...\npath\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fsinging\u002F00003\npath\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fsinging\u002F00004\n...\npath\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fdancing\u002F00005\npath\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fdancing\u002F00006\n...\n```\n如果你只有原始视频，可以使用`ffmpeg`从原始视频（如演讲视频）中提取帧，并将其存储在`images`子文件夹中。\n```\nffmpeg -i raw_video_1.mp4 -q:v 1 -start_number 0 path\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fspeech\u002F00001\u002Fimages\u002Fframe_%d.png\n```\n提取出的帧将被保存到`path\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fspeech\u002F00001\u002Fimages`目录下。\n关于人脸掩码的提取，请参考[StableAnimator仓库](https:\u002F\u002Fgithub.com\u002FFrancis-Rings\u002FStableAnimator)。该教程中的“人脸掩码提取”部分提供了现成的代码。\n对于嘴唇掩码的提取，可以运行以下命令：\n```\npip install mediapipe\npython lip_mask_extractor.py --folder_root=\"path\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fsinging\" --start=1 --end=500\n```\n`--folder_root`指定了训练数据集的根路径。\n`--start`和`--end`则分别指定了所选训练数据集的起始和结束索引。例如，`--start=1 --end=500`表示嘴唇提取将从`path\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fsinging\u002F00001`开始，到`path\u002FStableAvatar\u002Ftalking_face_data\u002Frec\u002Fsinging\u002F00500`结束。\n有关音频提取的详细信息，请参阅“音频提取”部分。\n当你的数据集完全按照上述结构组织好后，你就可以通过运行以下命令轻松地训练基于Wan2.1-1.3B的StableAvatar模型：\n```\n# 在单台机器上以单一分辨率（512x512）训练StableAvatar\nbash train_1B_square.sh\n# 在多台机器上以单一分辨率（512x512）训练StableAvatar\nbash train_1B_square_64.sh\n# 在单台机器上以混合分辨率（480x832和832x480）训练StableAvatar\nbash train_1B_rec_vec.sh\n\n# 在多台机器上以混合分辨率设置（480×832 和 832×480）训练 StableAvatar\nbash train_1B_rec_vec_64.sh\n```\n关于 `train_1B_square.sh` 和 `train_1B_rec_vec.sh` 的参数说明：`CUDA_VISIBLE_DEVICES` 指定使用的 GPU 设备。在我的配置中，我使用 4 张 NVIDIA A100 80G 显卡来训练 StableAvatar（`CUDA_VISIBLE_DEVICES=3,2,1,0`）。\n`--pretrained_model_name_or_path`、`--pretrained_wav2vec_path` 和 `--output_dir` 分别指预训练的 Wan2.1-1.3B 路径、预训练的 Wav2Vec2.0 路径以及训练好的 StableAvatar 检查点保存路径。\n`--train_data_square_dir`、`--train_data_rec_dir` 和 `--train_data_vec_dir` 分别是 `video_square_path.txt`、`video_rec_path.txt` 和 `video_vec_path.txt` 的文件路径。\n`--validation_reference_path` 和 `--validation_driven_audio_path` 分别是验证用参考图像和驱动音频的文件路径。\n`--video_sample_n_frames` 是 StableAvatar 在单个批次中处理的帧数。\n`--num_train_epochs` 是训练轮数。值得注意的是，训练轮数的默认值被设置为无限，当您观察到 StableAvatar 达到最佳性能时，可以手动终止训练过程。\n对于 `train_1B_square_64.sh` 和 `train_1B_rec_vec_64.sh` 的参数说明，我们在 `path\u002FStableAvatar\u002Faccelerate_config\u002Faccelerate_config_machine_1B_multiple.yaml` 中设置了 GPU 配置。在我的配置中，训练环境由 8 个节点组成，每个节点配备 8 张 NVIDIA A100 80GB 显卡，用于训练 StableAvatar。\n\nStableAvatar 训练时的整体文件结构如下：\n```\nStableAvatar\u002F\n├── accelerate_config\n├── deepspeed_config\n├── talking_face_data\n├── examples\n├── wan\n├── checkpoints\n│   ├── Kim_Vocal_2.onnx\n│   ├── wav2vec2-base-960h\n│   ├── Wan2.1-Fun-V1.1-1.3B-InP\n│   └── StableAvatar-1.3B\n├── inference.py\n├── inference.sh\n├── train_1B_square.py\n├── train_1B_square.sh\n├── train_1B_vec_rec.py\n├── train_1B_vec_rec.sh\n├── audio_extractor.py\n├── vocal_seperator.py\n├── requirement.txt \n```\n\u003Cb>值得注意的是，由于采用混合分辨率（480×832 和 832×480）的训练流程，训练 StableAvatar 大约需要 50GB 的显存。\u003C\u002Fb>\n\u003Cb>然而，如果您仅使用 512×512 的视频进行训练，显存需求将降低至约 40GB。\u003C\u002Fb>\n此外，所选训练视频的背景应保持静止，这有助于扩散模型准确计算重建损失。\n音频应清晰，避免过多背景噪声。\n\n关于基于 Wan2.1-14B 的 StableAvatar 训练，您可以运行以下命令：\n```\n# 在多台机器上以混合分辨率设置（480×832、832×480 和 512×512）训练 StableAvatar\nhuggingface-cli download Wan-AI\u002FWan2.1-I2V-14B-480P --local-dir .\u002Fcheckpoints\u002FWan2.1-I2V-14B-480P\nhuggingface-cli download Wan-AI\u002FWan2.1-I2V-14B-720P --local-dir .\u002Fcheckpoints\u002FWan2.1-I2V-14B-720P # 可选\nbash train_14B.sh\n```\n我们使用 DeepSpeed Stage-2 来训练基于 Wan2.1-14B 的 StableAvatar。GPU 配置可以在 `path\u002FStableAvatar\u002Faccelerate_config\u002Faccelerate_config_machine_14B_multiple.yaml` 中修改。\nDeepSpeed 优化配置和调度器配置位于 `path\u002FStableAvatar\u002Fdeepspeed_config\u002Fzero_stage2_config.json`。\n值得注意的是，我们观察到基于 Wan2.1-1.3B 的 StableAvatar 已经能够合成无限长度的高质量头像视频。而基于 Wan2.1-14B 的骨干网络在训练过程中会显著增加推理延迟和 GPU 内存消耗，表明其性能与资源消耗之间的效率有限。\n\n您还可以运行以下命令进行 LoRA 微调：\n```\n# 在单台机器上以混合分辨率设置（480×832 和 832×480）训练 StableAvatar-1.3B\nbash train_1B_rec_vec_lora.sh\n# 在多台机器上以混合分辨率设置（480×832 和 832×480）训练 StableAvatar-1.3B\nbash train_1B_rec_vec_lora_64.sh\n# 在多台机器上以混合分辨率设置（480×832、832×480 和 512×512）训练基于 Wan2.1-14B 的 StableAvatar\nbash train_14B_lora.sh\n```\n您可以通过调整 `--rank` 和 `--network_alpha` 来控制 LoRA 微调的质量。\n\n如果您希望训练基于 Wan2.1-1.3B 或 Wan2.1-14B 的 720P 版本 StableAvatar，可以直接在 `train_1B_square.py`、`train_1B_vec_rec.py` 或 `train_14B.py` 中修改数据加载器的高度和宽度（480p→720p）。\n\n### 🧱 模型微调\n对于完整微调 StableAvatar，您可以在 `train_1B_rec_vec.sh` 或 `train_1B_rec_vec_64.sh` 中添加 `--transformer_path=\"path\u002FStableAvatar\u002Fcheckpoints\u002FStableAvatar-1.3B\u002Ftransformer3d-square.pt\"`：\n```\n# 在单台机器上以混合分辨率设置（480×832 和 832×480）微调 StableAvatar\nbash train_1B_rec_vec.sh\n# 在多台机器上以混合分辨率设置（480×832 和 832×480）微调 StableAvatar\nbash train_1B_rec_vec_64.sh\n```\n对于 LoRA 微调 StableAvatar，您可以在 `train_1B_rec_vec_lora.sh` 中添加 `--transformer_path=\"path\u002FStableAvatar\u002Fcheckpoints\u002FStableAvatar-1.3B\u002Ftransformer3d-square.pt\"`：\n```\n# 在单台机器上以混合分辨率设置（480×832 和 832×480）LoRA 微调 StableAvatar-1.3B\nbash train_1B_rec_vec_lora.sh\n```\n您可以通过调整 `--rank` 和 `--network_alpha` 来控制 LoRA 训练\u002F微调的质量。\n\n### 🧱 显存需求与运行时间\n\n对于一段 5 秒的视频（480×832，帧率 25fps），基础模型（`--GPU_memory_mode=\"model_full_load\"`) 大约需要 18GB 显存，并在 4090 显卡上 3 分钟内完成渲染。\n\n\u003Cb>🔥🔥理论上，StableAvatar 能够合成数小时的视频而不会出现明显的质量下降；然而，3D VAE 解码器对 GPU 显存的需求非常大，尤其是在解码 1 万帧以上时。您也可以选择在 CPU 上运行 VAE。🔥🔥\u003C\u002Fb>\n\n## 联系方式\n如果您有任何建议或认为我们的工作有所帮助，请随时与我联系。\n\n邮箱：francisshuyuan@gmail.com\n\n如果您觉得我们的工作有用，请\u003Cb>考虑给这个 GitHub 仓库点个赞 ⭐ 并引用它 ❤️\u003C\u002Fb>：\n```bib\n@article{tu2025stableavatar,\n  title={Stableavatar: Infinite-length audio-driven avatar video generation},\n  author={Tu, Shuyuan and Pan, Yueming and Huang, Yinming and Han, Xintong and Xing, Zhen and Dai, Qi and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang},\n  journal={arXiv preprint arXiv:2508.08248},\n  year={2025}\n}\n```","# StableAvatar 快速上手指南\n\nStableAvatar 是一个端到端的视频扩散 Transformer 模型，能够根据参考图像和音频生成**无限长度**、高保真且身份一致的虚拟人视频。无需任何面部交换或修复等后处理工具。\n\n## 1. 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐) 或 Windows\n- **GPU**: NVIDIA 显卡 (支持 CUDA)\n  - 常规显卡：建议显存 16GB 以上（可通过降低分辨率或帧数适配更低显存）\n  - Blackwell 系列 (如 RTX 6000 Pro)：需特定版本依赖\n- **Python**: 3.10+\n\n### 前置依赖\n确保已安装 Git 和 FFmpeg（用于音视频处理）。\n\n## 2. 安装步骤\n\n### 步骤一：克隆项目\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FFrancis-Rings\u002FStableAvatar.git\ncd StableAvatar\n```\n\n### 步骤二：安装核心依赖\n\n**方案 A：常规 NVIDIA 显卡 (CUDA 12.4)**\n```bash\npip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.1.1 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124\npip install -r requirements.txt\n# 可选：安装 flash_attn 加速注意力计算\npip install flash_attn\n```\n\n**方案 B：Blackwell 系列新架构显卡 (CUDA 12.8)**\n```bash\npip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128\npip install -r requirements.txt\npip install flash_attn\n```\n\n> **国内加速提示**：若下载 PyTorch 或 HuggingFace 资源缓慢，可设置镜像环境变量：\n> ```bash\n> export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n> ```\n\n### 步骤三：下载模型权重\n使用 HuggingFace CLI 下载模型至 `checkpoints` 目录：\n\n```bash\npip install \"huggingface_hub[cli]\"\nmkdir checkpoints\nhuggingface-cli download FrancisRing\u002FStableAvatar --local-dir .\u002Fcheckpoints\n```\n\n下载完成后，目录结构应如下：\n```text\nStableAvatar\u002F\n├── checkpoints\n│   ├── Kim_Vocal_2.onnx\n│   ├── wav2vec2-base-960h\n│   ├── Wan2.1-Fun-V1.1-1.3B-InP\n│   └── StableAvatar-1.3B\n├── inference.sh\n└── ... (其他代码文件)\n```\n\n## 3. 基本使用\n\n### 步骤一：音频预处理（可选但推荐）\n为了获得更好的唇形同步效果，建议先从视频中提取音频并分离人声（去除背景音乐干扰）。\n\n1. **提取音频** (.mp4 -> .wav):\n```bash\npython audio_extractor.py --video_path=\"path\u002Ftest\u002Fvideo.mp4\" --saved_audio_path=\"path\u002Ftest\u002Faudio.wav\"\n```\n\n2. **分离人声** (.wav -> vocal.wav):\n```bash\npip install audio-separator[gpu]\npython vocal_seperator.py --audio_separator_model_file=\"checkpoints\u002FKim_Vocal_2.onnx\" --audio_file_path=\"path\u002Ftest\u002Faudio.wav\" --saved_vocal_path=\"path\u002Ftest\u002Fvocal.wav\"\n```\n\n### 步骤二：运行推理\n修改 `inference.sh` 脚本中的参数以匹配你的文件路径，然后执行：\n\n```bash\nbash inference.sh\n```\n\n**关键参数说明**（可在 `inference.sh` 中调整）：\n- `--validation_reference_path`: 参考人物图片路径。\n- `--validation_driven_audio_path`: 驱动音频路径（推荐使用分离后的人声）。\n- `--validation_prompts`: 文本提示词，格式建议为 `[首帧描述]-[人物行为描述]-[背景描述 (可选)]`。\n- `--width` \u002F `--height`: 分辨率，支持 `512x512`, `480x832`, `832x480`。\n- `--sample_steps`: 推理步数，推荐范围 `[30-50]`，步数越多质量越高。\n- `--overlap_window_length`: 滑动窗口重叠长度，推荐 `[5-15]`，越大越平滑但速度越慢。\n- `--output_dir`: 生成视频的保存路径。\n\n生成完成后，无限长度的虚拟人视频将保存在指定输出目录中。","某在线教育团队需要为历史课程快速制作由“爱因斯坦”形象讲解的系列短视频，以增强学生的沉浸感。\n\n### 没有 StableAvatar 时\n- **视频长度受限**：现有模型难以生成连贯的长视频，通常只能输出几秒片段，导致讲解内容支离破碎，需频繁拼接。\n- **人物身份丢失**：随着视频时长增加，生成的“爱因斯坦”面部特征逐渐扭曲或变形，无法保持角色一致性，影响教学严肃性。\n- **后期流程繁琐**：必须依赖 FaceFusion 等换脸工具或 GFP-GAN 等修复模型进行逐帧后处理，大幅增加了渲染时间和算力成本。\n- **音画同步生硬**：口型与语音匹配度不高，尤其在长句讲解时容易出现明显的延迟或错位，降低观看体验。\n\n### 使用 StableAvatar 后\n- **无限时长生成**：StableAvatar 作为端到端视频扩散 Transformer，可直接合成任意长度的连续视频，轻松容纳整段课程讲解。\n- **身份高度保真**：仅需一张参考图，即可在整个视频中完美锁定“爱因斯坦”的面部特征，确保角色形象始终如一。\n- **无需后处理**：生成结果直接可用，彻底省去了换脸和人脸修复等额外的后处理步骤，将制作周期从数小时缩短至分钟级。\n- **自然音画同步**：基于音频驱动的高精度生成能力，使得口型、表情与语音节奏天然契合，呈现逼真的授课状态。\n\nStableAvatar 通过端到端的无限长视频生成能力，让高质量数字人教学视频的制作变得像输入文本一样简单高效。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFrancis-Rings_StableAvatar_cf09a742.png","Francis-Rings","Francis","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FFrancis-Rings_7c51a0b2.jpg","Big fan of blue cats. Recently, I am interested in video generation\u002Fediting and video understanding. ","Microsoft Research",null,"FrancisRing_Tu","https:\u002F\u002Fgithub.com\u002FFrancis-Rings",[85,89],{"name":86,"color":87,"percentage":88},"Python","#3572A5",98.8,{"name":90,"color":91,"percentage":92},"Shell","#89e051",1.2,1226,108,"2026-04-14T09:45:00","MIT","Linux","必需 NVIDIA GPU。支持 CUDA 12.4 (PyTorch 2.6.0) 或 CUDA 12.8 (Blackwell 系列\u002FPyTorch 2.7.0)。显存需求未明确说明，但建议根据分辨率和帧数调整，若遇显存不足可降低输出分辨率或动画帧数。","未说明",{"notes":101,"python":102,"dependencies":103},"1. Blackwell 系列显卡（如 RTX 6000 Pro）需安装特定版本的 PyTorch (2.7.0) 和 CUDA 12.8。\n2. 可选安装 flash_attn 以加速注意力计算。\n3. 人声分离功能需额外安装 audio-separator[gpu] 并下载 Kim_Vocal_2.onnx 模型。\n4. 若连接 Hugging Face 困难，可设置环境变量 HF_ENDPOINT 使用镜像站。\n5. 生成无限长视频时，若显存不足可适当减少动画帧数或降低输出分辨率 (支持 480x832, 832x480, 512x512)。","未说明 (需兼容 PyTorch 2.6.0\u002F2.7.0)",[104,105,106,107,108,109,110,111],"torch==2.6.0 (或 2.7.0)","torchvision==0.21.0 (或 0.22.0)","torchaudio==2.1.1 (或 2.7.0)","flash_attn (可选加速)","audio-separator[gpu]","huggingface_hub","accelerate","deepspeed",[16],[114,115,116],"aigc","avatar-generator","video-generation","2026-03-27T02:49:30.150509","2026-04-15T08:17:14.485077",[120,125,130,135,140,145],{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},34120,"为什么单卡和多卡推理的结果不一致？","这是一个已修复的问题。在多卡推理时，潜在张量（latent tensors）被分块分布到不同 GPU 上。在语音交叉注意力（vocal cross-attention）阶段，每个分块仍能访问完整的未分块音频，导致每块都试图将整段音频的影响融入其帧中，从而造成口型不一致。\n\n解决方案是在进入交叉注意力模块之前，先通过 `all_gather` 聚合所有 GPU 上的潜在张量，统一进行交叉注意力计算，然后再重新分块。该补丁已合并到最新构建中，请更新代码后重试。","https:\u002F\u002Fgithub.com\u002FFrancis-Rings\u002FStableAvatar\u002Fissues\u002F50",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},34121,"遇到 'assert error q.device.type == cuda and q.size(-1) \u003C= 256' 错误怎么办？","该错误通常由 Flash Attention 引起，特别是在加载 Wan2.1-I2V-14B-480P 模型时。由于 head_dim 计算结果为 640（5120\u002F8），超过了 Flash Attention 限制的 256。\n\n建议尝试使用 Wan2.1-Fun-V1.1-14B-InP 作为基础模型，维护者观察到其与 Wan2.1-I2V-14B-480P 的性能差异微乎其微，但该 InP 版本在框架中运行正常且不易报错。","https:\u002F\u002Fgithub.com\u002FFrancis-Rings\u002FStableAvatar\u002Fissues\u002F54",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},34122,"生成的视频画面出现乱码或花屏是怎么回事？","这通常是因为启用了 TeaCache 功能，目前该功能存在 Bug。\n\n解决方法是重启 `app.py` 并在设置中关闭 TeaCache 选项。关闭后生成的视频即可恢复正常。如果使用 `inference.sh` 脚本通常不会遇到此问题。","https:\u002F\u002Fgithub.com\u002FFrancis-Rings\u002FStableAvatar\u002Fissues\u002F31",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},34123,"音频过长导致显存不足或生成失败如何解决？","当音频较长（如 1 分 40 秒）时，去噪后的潜在张量数量巨大，会导致 VAE 解码器显存消耗过高。\n\n推荐解决方案：\n1. 将 VAE 解码器卸载到 CPU（offload to CPU）。\n2. 或者先将去噪后的潜在张量保存到本地磁盘，然后分批从磁盘加载到 VAE 解码器中进行独立解码。\n3. 临时测试时可尝试缩短音频长度（如改为 10 秒）以验证是否为时长问题。","https:\u002F\u002Fgithub.com\u002FFrancis-Rings\u002FStableAvatar\u002Fissues\u002F63",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},34124,"Windows 用户使用命令行生成的视频没有声音怎么办？","在 Windows 环境下，直接使用命令行参数运行 `inference.py` 可能会导致生成的视频缺失音频轨道。虽然具体修复命令未在评论中详述，但用户反馈通过 Gradio 界面生成通常能正常包含音频。建议 Windows 用户优先尝试使用 Gradio 界面进行推理，或检查 FFmpeg 是否正确安装并配置在环境变量中以确保音频合成正常。","https:\u002F\u002Fgithub.com\u002FFrancis-Rings\u002FStableAvatar\u002Fissues\u002F33",{"id":146,"question_zh":147,"answer_zh":148,"source_url":134},34125,"加载模型时出现关于缺失键（missing keys）的警告是否正常？","这是正常现象。由于 StableAvatar 的结构与原始的 Wan2.1 并不完全严格一致，因此在加载过程中会收到关于缺失键的警告。\n\n这些缺失的键会在下一步加载 StableAvatar 专用权重时被自动填充，因此可以安全忽略这些警告，不会影响最终生成效果。",[]]