[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Enemyx-net--VibeVoice-ComfyUI":3,"tool-Enemyx-net--VibeVoice-ComfyUI":62},[4,18,26,35,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,2,"2026-04-10T11:39:34",[14,15,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":32,"last_commit_at":41,"category_tags":42,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[43,13,15,14],"插件",{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[52,15,13,14],"语言模型",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,61],"视频",{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":76,"owner_location":77,"owner_email":76,"owner_twitter":76,"owner_website":76,"owner_url":78,"languages":79,"stars":84,"forks":85,"last_commit_at":86,"license":87,"difficulty_score":10,"env_os":88,"env_gpu":89,"env_ram":90,"env_deps":91,"category_tags":95,"github_topics":97,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":114,"updated_at":115,"faqs":116,"releases":144},8359,"Enemyx-net\u002FVibeVoice-ComfyUI","VibeVoice-ComfyUI","A comprehensive ComfyUI integration for Microsoft's VibeVoice text-to-speech model, enabling high-quality single and multi-speaker voice synthesis directly within your ComfyUI workflows.","VibeVoice-ComfyUI 是将微软 VibeVoice 文本转语音模型深度集成到 ComfyUI 工作流中的开源节点包。它让用户无需离开熟悉的可视化界面，即可直接生成高质量的自然语音，支持单人播报及多达四人的多角色对话模拟。\n\n该工具主要解决了传统 TTS 流程中需频繁切换软件、难以灵活控制多角色互动及长文本处理的痛点。通过内置的自动文本分块、自定义停顿标签和节点串联功能，它能轻松应对复杂脚本的合成需求，并支持中断操作以提升调试效率。\n\n无论是 AI 视频创作者、播客制作人，还是希望在工作流中嵌入语音功能的开发者与设计师，都能从中受益。普通用户也可利用其直观的界面快速上手，实现从文本到生动音频的一站式制作。\n\n技术亮点方面，VibeVoice-ComfyUI 不仅支持声音克隆和 LoRA 微调以定制独特音色，还提供了丰富的性能优化选项。包括 4-bit\u002F8-bit 量化技术以大幅降低显存占用、Apple Silicon 原生加速支持，以及多种注意力机制可选，确保在不同硬件配置下均能平衡音质与生成速度。自包含的代码结构也使其部署更加简便稳定。","# VibeVoice ComfyUI Nodes\n\nA comprehensive ComfyUI integration for Microsoft's VibeVoice text-to-speech model, enabling high-quality single and multi-speaker voice synthesis directly within your ComfyUI workflows.\n\n## ✨ Features\n\n### Core Functionality\n- 🎤 **Single Speaker TTS**: Generate natural speech with optional voice cloning\n- 👥 **Multi-Speaker Conversations**: Support for up to 4 distinct speakers\n- 🎯 **Voice Cloning**: Clone voices from audio samples\n- 🎨 **LoRA Support**: Fine-tune voices with custom LoRA adapters (v1.4.0+)\n- 🎚️ **Voice Speed Control**: Adjust speech rate by modifying reference voice speed (v1.5.0+)\n- 📝 **Text File Loading**: Load scripts from text files\n- 📚 **Automatic Text Chunking**: Handles long texts seamlessly with configurable chunk size\n- ⏸️ **Custom Pause Tags**: Insert silences with `[pause]` and `[pause:ms]` tags (wrapper feature)\n- 🔄 **Node Chaining**: Connect multiple VibeVoice nodes for complex workflows\n- ⏹️ **Interruption Support**: Cancel operations before or between generations\n- 🔧 **Flexible Configuration**: Control temperature, sampling, and guidance scale\n\n### Performance & Optimization\n- ⚡ **Attention Mechanisms**: Choose between auto, eager, sdpa, flash_attention_2 or sage\n- 🎛️ **Diffusion Steps**: Adjustable quality vs speed trade-off (default: 20)\n- 💾 **Memory Management**: Toggle automatic VRAM cleanup after generation\n- 🧹 **Free Memory Node**: Manual memory control for complex workflows\n- 🍎 **Apple Silicon Support**: Native GPU acceleration on M1\u002FM2\u002FM3 Macs via MPS\n- 🔢 **8-Bit Quantization**: Perfect audio quality with high VRAM reduction\n- 🔢 **4-Bit Quantization**: Maximum VRAM savings with minimal quality loss\n\n### Compatibility & Installation\n- 📦 **Self-Contained**: Embedded VibeVoice code, no external dependencies\n- 🔄 **Universal Compatibility**: Adaptive support for transformers v4.51.3+\n- 🖥️ **Cross-Platform**: Works on Windows, Linux, and macOS\n- 🎮 **Multi-Backend**: Supports CUDA, CPU, and MPS (Apple Silicon)\n\n## 🎥 Video Demo\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=fIBMepIBKhI\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEnemyx-net_VibeVoice-ComfyUI_readme_459aa9afc473.jpg\" alt=\"VibeVoice ComfyUI Wrapper Demo\" \u002F>\n  \u003C\u002Fa>\n  \u003Cbr>\n  \u003Cstrong>Click to watch the demo video\u003C\u002Fstrong>\n\u003C\u002Fp>\n\n## 📦 Installation\n\n### Automatic Installation (Recommended)\n1. Clone this repository into your ComfyUI custom nodes folder:\n```bash\ncd ComfyUI\u002Fcustom_nodes\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\n```\n\n2. Restart ComfyUI - the nodes will automatically install requirements on first use\n\n## 📥 Model Installation\n\n### Manual Download Required\nStarting from version 1.6.0, models and tokenizer must be manually downloaded and placed in the correct folder. The wrapper no longer downloads them automatically.\n\n### Download Links\n\n#### Models\nYou can download VibeVoice models from HuggingFace:\n\n| Model                  | Size   | Download Link |\n|------------------------|--------|---------------|\n| **VibeVoice-1.5B**     | ~5.4GB | [microsoft\u002FVibeVoice-1.5B](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FVibeVoice-1.5B) |\n| **VibeVoice-Large**    | ~18.7GB | [aoi-ot\u002FVibeVoice-Large](https:\u002F\u002Fhuggingface.co\u002Faoi-ot\u002FVibeVoice-Large) |\n| **VibeVoice-Large-Q8** | ~11.6GB | [FabioSarracino\u002FVibeVoice-Large-Q8](https:\u002F\u002Fhuggingface.co\u002FFabioSarracino\u002FVibeVoice-Large-Q8) |\n| **VibeVoice-Large-Q4** | ~6.6GB | [DevParker\u002FVibeVoice7b-low-vram](https:\u002F\u002Fhuggingface.co\u002FDevParker\u002FVibeVoice7b-low-vram) |\n\n#### Tokenizer (Required)\nVibeVoice uses the Qwen2.5-1.5B tokenizer:\n- Download from: [Qwen2.5-1.5B Tokenizer](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-1.5B\u002Ftree\u002Fmain)\n- Required files: `tokenizer_config.json`, `vocab.json`, `merges.txt`, `tokenizer.json`\n\n### Installation Steps\n1. Create the models folder if it doesn't exist:\n   ```\n   ComfyUI\u002Fmodels\u002Fvibevoice\u002F\n   ```\n\n2. Download and organize files in the vibevoice folder:\n   ```\n   ComfyUI\u002Fmodels\u002Fvibevoice\u002F\n   ├── tokenizer\u002F                 # Place Qwen tokenizer files here\n   │   ├── tokenizer_config.json\n   │   ├── vocab.json\n   │   ├── merges.txt\n   │   └── tokenizer.json\n   ├── VibeVoice-1.5B\u002F           # Model folder\n   │   ├── config.json\n   │   ├── model-00001-of-00003.safetensors\n   │   ├── model-00002-of-00003.safetensors\n   │   └── ... (other model files)\n   ├── VibeVoice-Large\u002F\n   │   └── ... (model files)\n   └── my-custom-vibevoice\u002F      # custom names are supported\n       └── ... (model files)\n   ```\n\n3. For models downloaded from HuggingFace using git-lfs or the HF CLI, you can also use the cache structure:\n   ```\n   ComfyUI\u002Fmodels\u002Fvibevoice\u002F\n   └── models--microsoft--VibeVoice-1.5B\u002F\n       └── snapshots\u002F\n           └── [hash]\u002F\n               └── ... (model files)\n   ```\n\n4. Refresh your browser - the models will appear in the dropdown menu\n\n### Notes\n- The dropdown will show user-friendly names extracted from folder names\n- Both regular folders and HuggingFace cache structures are supported\n- Models are rescanned on every browser refresh\n- Quantized models are automatically detected from their config files\n- The tokenizer is searched in this priority order:\n  1. `ComfyUI\u002Fmodels\u002Fvibevoice\u002Ftokenizer\u002F` (recommended)\n  2. `ComfyUI\u002Fmodels\u002Fvibevoice\u002Fmodels--Qwen--Qwen2.5-1.5B\u002F` (if exists from previous installations)\n  3. HuggingFace cache (if available)\n\n## 🔧 Available Nodes\n\n### 1. VibeVoice Load Text From File\nLoads text content from files in ComfyUI's input\u002Foutput\u002Ftemp directories.\n- **Supported formats**: .txt\n- **Output**: Text string for TTS nodes\n\n### 2. VibeVoice Single Speaker\nGenerates speech from text using a single voice.\n- **Text Input**: Direct text or connection from Load Text node\n- **Models**: Select from available models in dropdown menu\n- **Voice Cloning**: Optional audio input for voice cloning\n- **Parameters** (in order):\n  - `text`: Input text to convert to speech\n  - `model`: Select from dropdown list of available models found in `ComfyUI\u002Fmodels\u002Fvibevoice\u002F`\n  - `attention_type`: auto, eager, sdpa, flash_attention_2 or sage (default: auto)\n  - `quantize_llm`: Dynamically quantize only the LLM component for non-quantized models. Options: \"full precision\" (default), \"4bit\", or \"8bit\". 4-bit provides major VRAM savings with minimal quality loss. 8-bit provides a good balance between quality and memory usage. Requires CUDA GPU. Ignored for pre-quantized models.\n  - `free_memory_after_generate`: Free VRAM after generation (default: True)\n  - `diffusion_steps`: Number of denoising steps (5-100, default: 20)\n  - `seed`: Random seed for reproducibility (default: 42)\n  - `cfg_scale`: Classifier-free guidance (1.0-2.0, default: 1.3)\n  - `use_sampling`: Enable\u002Fdisable deterministic generation (default: False)\n- **Optional Parameters**:\n  - `voice_to_clone`: Audio input for voice cloning\n  - `lora`: LoRA configuration from VibeVoice LoRA node\n  - `temperature`: Sampling temperature (0.1-2.0, default: 0.95)\n  - `top_p`: Nucleus sampling parameter (0.1-1.0, default: 0.95)\n  - `max_words_per_chunk`: Maximum words per chunk for long texts (100-500, default: 250)\n  - `voice_speed_factor`: Speech rate adjustment (0.8-1.2, default: 1.0, step: 0.01)\n\n### 3. VibeVoice Multiple Speakers\nGenerates multi-speaker conversations with distinct voices.\n- **Speaker Format**: Use `[N]:` notation where N is 1-4\n- **Voice Assignment**: Optional voice samples for each speaker\n- **Recommended Model**: VibeVoice-Large for better multi-speaker quality\n- **Parameters** (in order):\n  - `text`: Input text with speaker labels\n  - `model`: Select from dropdown list of available models found in `ComfyUI\u002Fmodels\u002Fvibevoice\u002F`\n  - `attention_type`: auto, eager, sdpa, flash_attention_2 or sage (default: auto)\n  - `quantize_llm`: Dynamically quantize only the LLM component for non-quantized models. Options: \"full precision\" (default), \"4bit\", or \"8bit\". 4-bit provides major VRAM savings with minimal quality loss. 8-bit provides a good balance between quality and memory usage. Requires CUDA GPU. Ignored for pre-quantized models.\n  - `free_memory_after_generate`: Free VRAM after generation (default: True)\n  - `diffusion_steps`: Number of denoising steps (5-100, default: 20)\n  - `seed`: Random seed for reproducibility (default: 42)\n  - `cfg_scale`: Classifier-free guidance (1.0-2.0, default: 1.3)\n  - `use_sampling`: Enable\u002Fdisable deterministic generation (default: False)\n- **Optional Parameters**:\n  - `speaker1_voice` to `speaker4_voice`: Audio inputs for voice cloning\n  - `lora`: LoRA configuration from VibeVoice LoRA node\n  - `temperature`: Sampling temperature (0.1-2.0, default: 0.95)\n  - `top_p`: Nucleus sampling parameter (0.1-1.0, default: 0.95)\n  - `voice_speed_factor`: Speech rate adjustment for all speakers (0.8-1.2, default: 1.0, step: 0.01)\n\n### 4. VibeVoice Free Memory\nManually frees all loaded VibeVoice models from memory.\n- **Input**: `audio` - Connect audio output to trigger memory cleanup\n- **Output**: `audio` - Passes through the input audio unchanged\n- **Use Case**: Insert between nodes to free VRAM\u002FRAM at specific workflow points\n- **Example**: `[VibeVoice Node] → [Free Memory] → [Save Audio]`\n\n### 5. VibeVoice LoRA\nConfigure and load custom LoRA adapters for fine-tuned VibeVoice models.\n- **LoRA Selection**: Dropdown menu with available LoRA adapters\n- **LoRA Location**: Place your LoRA folders in `ComfyUI\u002Fmodels\u002Fvibevoice\u002Floras\u002F`\n- **Parameters**:\n  - `lora_name`: Select from available LoRA adapters or \"None\" to disable\n  - `llm_strength`: Strength of the language model LoRA (0.0-2.0, default: 1.0)\n  - `use_llm`: Apply language model LoRA component (default: True)\n  - `use_diffusion_head`: Apply diffusion head replacement (default: True)\n  - `use_acoustic_connector`: Apply acoustic connector LoRA (default: True)\n  - `use_semantic_connector`: Apply semantic connector LoRA (default: True)\n- **Output**: `lora` - LoRA configuration to connect to speaker nodes\n- **Usage**: `[VibeVoice LoRA] → [Single\u002FMultiple Speaker Node]`\n\n## 💬 Multi-Speaker Text Format\n\nFor multi-speaker generation, format your text using the `[N]:` notation:\n\n```\n[1]: Hello, how are you today?\n[2]: I'm doing great, thanks for asking!\n[1]: That's wonderful to hear.\n[3]: Hey everyone, mind if I join the conversation?\n[2]: Not at all, welcome!\n```\n\n**Important Notes:**\n- Use `[1]:`, `[2]:`, `[3]:`, `[4]:` for speaker labels\n- Maximum 4 speakers supported\n- The system automatically detects the number of speakers from your text\n- Each speaker can have an optional voice sample for cloning\n\n## 🧠 Model Information\n\n### VibeVoice-1.5B\n- **Size**: ~5.4GB download\n- **VRAM**: ~6GB\n- **Speed**: Faster inference\n- **Quality**: Good for single speaker\n- **Use Case**: Quick prototyping, single voices\n\n### VibeVoice-Large\n- **Size**: ~18.7GB download\n- **VRAM**: ~20GB\n- **Speed**: Slower inference but optimized\n- **Quality**: Best available quality (full precision)\n- **Use Case**: Highest quality production, multi-speaker conversations\n- **Note**: Latest official release from Microsoft\n\n### VibeVoice-Large-Q8\n- **Size**: ~11.6GB download (38% reduction from full model)\n- **VRAM**: ~12GB (40% reduction from full precision)\n- **Speed**: Balanced inference\n- **Quality**: Identical to full precision - perfect audio preservation\n- **Use Case**: Production-quality audio with 12GB VRAM GPUs (RTX 3060, 4070 Ti, etc.)\n- **Quantization**: Selective 8-bit - only LLM quantized, audio components at full precision\n- **Note**: Quantized by Fabio Sarracino\n\n### VibeVoice-Large-Q4\n- **Size**: ~6.6GB download\n- **VRAM**: ~8GB\n- **Speed**: Balanced inference\n- **Quality**: Good quality with minimal loss\n- **Use Case**: Maximum VRAM savings for lower-end GPUs\n- **Note**: Quantized by DevParker\n\nModels are automatically downloaded on first use and cached in `ComfyUI\u002Fmodels\u002Fvibevoice\u002F`.\n\n## ⚙️ Generation Modes\n\n### Deterministic Mode (Default)\n- `use_sampling = False`\n- Produces consistent, stable output\n- Recommended for production use\n\n### Sampling Mode\n- `use_sampling = True`\n- More variation in output\n- Uses temperature and top_p parameters\n- Good for creative exploration\n\n## 🎯 Voice Cloning\n\nTo clone a voice:\n1. Connect an audio node to the `voice_to_clone` input (single speaker)\n2. Or connect to `speaker1_voice`, `speaker2_voice`, etc. (multi-speaker)\n3. The model will attempt to match the voice characteristics\n\n**Requirements for voice samples:**\n- Clear audio with minimal background noise\n- Minimum 3–10 seconds. Recommended at least 30 seconds for better quality\n- Automatically resampled to 24kHz\n\n## 🎨 LoRA Support\n\n### Overview\nStarting from version 1.4.0, VibeVoice ComfyUI supports custom LoRA (Low-Rank Adaptation) adapters for fine-tuning voice characteristics. This allows you to train and use specialized voice models while maintaining the base VibeVoice capabilities.\n\n### Setting Up LoRA Adapters\n\n1. **LoRA Directory Structure**:\n   Place your LoRA adapter folders in: `ComfyUI\u002Fmodels\u002Fvibevoice\u002Floras\u002F`\n   ```\n   ComfyUI\u002F\n   └── models\u002F\n       └── vibevoice\u002F\n           └── loras\u002F\n               ├── my_custom_voice\u002F\n               │   ├── adapter_config.json\n               │   ├── adapter_model.safetensors\n               │   └── diffusion_head\u002F  (optional)\n               ├── character_voice\u002F\n               └── style_adaptation\u002F\n   ```\n\n2. **Required Files**:\n   - `adapter_config.json`: LoRA configuration\n   - `adapter_model.safetensors` or `adapter_model.bin`: Model weights\n   - Optional components:\n     - `diffusion_head\u002F`: Custom diffusion head weights\n     - `acoustic_connector\u002F`: Acoustic connector adaptation\n     - `semantic_connector\u002F`: Semantic connector adaptation\n\n### Using LoRA in ComfyUI\n\n1. **Add VibeVoice LoRA Node**:\n   - Create a \"VibeVoice LoRA\" node in your workflow\n   - Select your LoRA from the dropdown menu\n   - Configure component settings and strength\n\n2. **Connect to Speaker Nodes**:\n   - Connect the LoRA node's output to the `lora` input of speaker nodes\n   - Both Single Speaker and Multiple Speakers nodes support LoRA\n\n3. **LoRA Parameters**:\n   - **llm_strength**: Controls the influence of the language model LoRA (0.0-2.0)\n   - **Component toggles**: Enable\u002Fdisable specific LoRA components\n   - Select \"None\" to disable LoRA and use the base model\n\n### Training Your Own LoRA\n\nTo create custom LoRA adapters for VibeVoice, use the official fine-tuning repository:\n- **Repository**: [VibeVoice Fine-tuning](https:\u002F\u002Fgithub.com\u002Fvoicepowered-ai\u002FVibeVoice-finetuning)\n- **Features**:\n  - Parameter-efficient fine-tuning\n  - Support for custom datasets\n  - Adjustable LoRA rank and scaling\n  - Optional diffusion head adaptation\n\n### Best Practices\n\n- **Voice Consistency**: Use the same LoRA across all chunks for long texts\n- **Memory Management**: LoRA adds minimal memory overhead (~100-500MB)\n- **Compatibility**: LoRA adapters are compatible with all VibeVoice model variants\n- **Strength Tuning**: Start with default strength (1.0) and adjust based on results\n\n### Compatibility Note\n\n⚠️ **Transformers Version**: The LoRA implementation was developed and tested with `transformers==4.51.3`. While our wrapper supports `transformers>=4.51.3`, LoRA functionality with newer versions of transformers is not guaranteed. If you experience issues with LoRA loading, consider using `transformers==4.51.3` specifically:\n```bash\npip install transformers==4.51.3\n```\n\n### 🙏 Credits\n\nLoRA implementation by [@jpgallegoar](https:\u002F\u002Fgithub.com\u002Fjpgallegoar) (PR #127)\n\n## 🎚️ Voice Speed Control\n\n### Overview\nThe Voice Speed Control feature allows you to influence the speaking rate of generated speech by adjusting the speed of the reference voice. This feature modifies the input voice sample before processing, causing the model to learn and reproduce the altered speech rate.\n\n**Available from version 1.5.0**\n\n### How It Works\nThe system applies time-stretching to the reference voice audio:\n- Values \u003C 1.0 slow down the reference voice, resulting in slower generated speech\n- Values > 1.0 speed up the reference voice, resulting in faster generated speech\n- The model learns from the modified voice characteristics and generates speech at a similar pace\n\n### Usage\n- **Parameter**: `voice_speed_factor`\n- **Range**: 0.8 to 1.2\n- **Default**: 1.0 (normal speed)\n- **Step**: 0.01 (1% increments)\n\n### Recommended Settings\n- **Optimal Range**: 0.95 to 1.05 for natural-sounding results\n- **Slower Speech**: Try 0.95 (5% slower) or 0.97 (3% slower)\n- **Faster Speech**: Try 1.03 (3% faster) or 1.05 (5% faster)\n- **Best Results**: Provide reference audio of at least 20 seconds for more accurate speed matching\n\n### Important Notes\n- The effect works best with longer reference audio samples (20+ seconds recommended)\n- Extreme values (\u003C 0.9 or > 1.1) may produce unnatural-sounding speech\n- In Multi Speaker mode, the speed adjustment applies to all speakers equally\n- Synthetic voices (when no audio is provided) are not affected by this parameter\n\n### 📖 Examples\n```\n# Single Speaker\nvoice_speed_factor: 0.95  # Slightly slower, more deliberate speech\nvoice_speed_factor: 1.05  # Slightly faster, more energetic speech\n\n# Multi Speaker\nvoice_speed_factor: 0.98  # All speakers talk 2% slower\nvoice_speed_factor: 1.02  # All speakers talk 2% faster\n```\n\n## ⏸️ Pause Tags Support\n\n### Overview\nThe VibeVoice wrapper includes a custom pause tag feature that allows you to insert silences between text segments. **This is NOT a standard Microsoft VibeVoice feature** - it's an original implementation of our wrapper to provide more control over speech pacing.\n\n**Available from version 1.3.0**\n\n### Usage\nYou can use two types of pause tags in your text:\n- `[pause]` - Inserts a 1-second silence (default)\n- `[pause:ms]` - Inserts a custom duration silence in milliseconds (e.g., `[pause:2000]` for 2 seconds)\n\n### 📖 Examples\n\n#### Single Speaker\n```\nWelcome to our presentation. [pause] Today we'll explore artificial intelligence. [pause:500] Let's begin!\n```\n\n#### Multi-Speaker  \n```\n[1]: Hello everyone [pause] how are you doing today?\n[2]: I'm doing great! [pause:500] Thanks for asking.\n[1]: Wonderful to hear!\n```\n\n### Important Notes\n\n⚠️ **Context Limitation Warning**:\n> **Note: The pause forces the text to be split into chunks. This may worsen the model's ability to understand the context. The model's context is represented ONLY by its own chunk.**\n\nThis means:\n- Text before a pause and text after a pause are processed separately\n- The model cannot see across pause boundaries when generating speech\n- This may affect prosody and intonation consistency\n- This may affect prosody and intonation consistency\n\n### How It Works\n1. The wrapper parses your text to find pause tags\n2. Text segments between pauses are processed independently \n3. Silence audio is generated for each pause duration\n4. All audio segments (speech and silence) are concatenated\n\n### Best Practices\n- Use pauses at natural breaking points (end of sentences, paragraphs)\n- Avoid pauses in the middle of phrases where context is important\n- Test different pause durations to find what sounds most natural\n\n## 💡 Tips for Best Results\n\n1. **Text Preparation**:\n   - Use proper punctuation for natural pauses\n   - Break long texts into paragraphs\n   - For multi-speaker, ensure clear speaker transitions\n   - Use pause tags sparingly to maintain context continuity\n\n2. **Model Selection**:\n   - Use 1.5B for quick single-speaker tasks (fastest, ~8GB VRAM)\n   - Use Large for absolute best quality (~20GB VRAM)\n   - Use Large-Q8 for production quality with 12GB VRAM (perfect audio, 38% smaller)\n   - Use Large-Quant-4Bit for maximum VRAM savings (~7GB VRAM)\n\n3. **Seed Management**:\n   - Default seed (42) works well for most cases\n   - Save good seeds for consistent character voices\n   - Try random seeds if default doesn't work well\n\n4. **Performance**:\n   - First run downloads models (5-17GB)\n   - Subsequent runs use cached models\n   - GPU recommended for faster inference\n\n## 💻 System Requirements\n\n### Hardware\n- **Minimum**: 8GB VRAM for VibeVoice-1.5B\n- **Recommended**: 17GB+ VRAM for VibeVoice-Large\n- **RAM**: 16GB+ system memory\n\n### Software\n- Python 3.8+\n- PyTorch 2.0+\n- CUDA 11.8+ (for GPU acceleration)\n- Transformers 4.51.3+\n- ComfyUI (latest version)\n\n## 🔧 Troubleshooting\n\n### Installation Issues\n- Ensure you're using ComfyUI's Python environment\n- Try manual installation if automatic fails\n- Restart ComfyUI after installation\n\n### Generation Issues\n- If voices sound unstable, try deterministic mode\n- For multi-speaker, ensure text has proper `[N]:` format\n- Check that speaker numbers are sequential (1,2,3 not 1,3,5)\n\n### Memory Issues\n- Large model requires ~16GB VRAM\n- Use 1.5B model for lower VRAM systems\n- Models use bfloat16 precision for efficiency\n\n## 📖 Examples\n\n### Single Speaker\n```\nText: \"Welcome to our presentation. Today we'll explore the fascinating world of artificial intelligence.\"\nModel: [Select from available models]\ncfg_scale: 1.3\nuse_sampling: False\n```\n\n### Two Speakers\n```\n[1]: Have you seen the new AI developments?\n[2]: Yes, they're quite impressive!\n[1]: I think voice synthesis has come a long way.\n[2]: Absolutely, it sounds so natural now.\n```\n\n### Four Speaker Conversation\n```\n[1]: Welcome everyone to our meeting.\n[2]: Thanks for having us!\n[3]: Glad to be here.\n[4]: Looking forward to the discussion.\n[1]: Let's begin with the agenda.\n```\n\n## 📊 Performance Benchmarks\n\n| Model              | VRAM Usage | Context Length | Max Audio Duration |\n|--------------------|------------|----------------|-------------------|\n| VibeVoice-1.5B     | ~6GB       | 64K tokens | ~90 minutes |\n| VibeVoice-Large | ~20GB      | 32K tokens | ~45 minutes |\n| VibeVoice-Large-Q8 | ~12GB      | 32K tokens | ~45 minutes |\n| VibeVoice-Large-Q4 | ~8GB       | 32K tokens | ~45 minutes |\n\n## ⚠️ Known Limitations\n\n- Maximum 4 speakers in multi-speaker mode\n- Works best with English and Chinese text\n- Some seeds may produce unstable output\n- Background music generation cannot be directly controlled\n\n## 📄 License\n\nThis ComfyUI wrapper is released under the MIT License. See LICENSE file for details.\n\n**Note**: The VibeVoice model itself is subject to Microsoft's licensing terms:\n- VibeVoice is for research purposes only\n- Check Microsoft's VibeVoice repository for full model license details\n\n## 🔗 Links\n\n- [Original VibeVoice Repository](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVibeVoice) - Official Microsoft VibeVoice repository (currently unavailable)\n\n## 🙏 Credits\n\n- **VibeVoice Model**: Microsoft Research\n- **ComfyUI Integration**: Fabio Sarracino\n- **Base Model**: Built on Qwen2.5 architecture\n\n## 💬 Support\n\nFor issues or questions:\n1. Check the troubleshooting section\n2. Review ComfyUI logs for error messages\n3. Ensure VibeVoice is properly installed\n4. Open an issue with detailed error information\n\n## 🤝 Contributing\n\nContributions welcome! Please:\n1. Test changes thoroughly\n2. Follow existing code style\n3. Update documentation as needed\n4. Submit pull requests with clear descriptions\n\n## 📝 Changelog\n\n### Version 1.8.1\n- Forced installation of the bitsandbytes>=0.48.1 library as version 0.48.0 has a critical bug that prevents the Q8 model from working.\n- Bug Fixing\n\n### Version 1.8.0\n- **New Official 8-bit Quantized Model**: VibeVoice-Large-Q8\n  - Released on HuggingFace: [FabioSarracino\u002FVibeVoice-Large-Q8](https:\u002F\u002Fhuggingface.co\u002FFabioSarracino\u002FVibeVoice-Large-Q8)\n  - Model size: 11.6GB (38% reduction from 18.7GB full precision)\n  - VRAM usage: ~12GB (40% reduction from ~20GB)\n  - **Perfect audio quality**: Identical to full precision model - no quality degradation\n  - **Selective quantization approach**: audio-critical components (diffusion head, VAE, connectors) kept at full precision\n  - Optimized for 12GB VRAM GPUs (RTX 3060, 4070 Ti, etc.)\n  - Solves the common 8-bit \"noise problem\" by carefully selecting which components to quantize\n- **Added 8-bit Dynamic LLM Quantization**\n  - New \"8bit\" option in `quantize_llm` parameter for both Single and Multiple Speaker nodes\n  - Options now: \"full precision\" (default), \"4bit\", \"8bit\"\n  - Dynamically quantizes only the LLM component for non-quantized models\n  - Skips all audio-critical components (diffusion_head, acoustic\u002Fsemantic connectors, tokenizers)\n  - Provides good balance between quality and VRAM savings\n  - Requires CUDA GPU and bitsandbytes library\n  - Automatically ignored for pre-quantized models\n\n### Version 1.7.0\n- Added dynamic LLM-only 4-bit quantization for non-quantized models\n  - New `quantize_llm` parameter in both Single and Multiple Speaker nodes\n  - Options: \"full precision\" (default) or \"4bit\"\n  - Quantizes only the language model component while keeping diffusion head at full precision\n  - Significantly faster generation with major VRAM savings\n  - Minimal quality loss compared to full precision\n  - Requires CUDA GPU for quantization\n  - Automatically ignored for pre-quantized models\n  - Uses NF4 (4-bit NormalFloat) quantization type optimized for neural networks\n\n### Version 1.6.3\n- Fixed tokenizer initialization error\n  - Resolved `TypeError: expected str, bytes or os.PathLike object, not NoneType` when loading processor\n  - Added robust fallback mechanism for tokenizer file path resolution\n  - Improved handling of vocab.json and merges.txt file loading\n  - Enhanced error handling for edge cases in tokenizer initialization\n\n### Version 1.6.2\n- Fixed tokenizer loading issue where HuggingFace cache could interfere with local files\n- Tokenizer now loads directly from specified path, avoiding cache conflicts\n- Added explicit file path loading for better reliability\n- Improved logging to show which tokenizer files are being used\n\n### Version 1.6.1\n- Improved integration by removing HuggingFace unnecessary settings\n\n### Version 1.6.0\n- **Major Change**: Removed automatic model downloading from HuggingFace\n  - Models must now be manually downloaded and placed in `ComfyUI\u002Fmodels\u002Fvibevoice\u002F`\n  - Dynamic model dropdown that scans available models on each browser refresh\n  - Support for custom folder names and HuggingFace cache structure\n  - Automatic detection of quantized models from config files\n  - Better user control over model management\n  - Eliminates authentication issues with private HuggingFace repos\n- **Improved Logging System**:\n  - Optimized logging to reduce console clutter\n  - Cleaner output for better user experience\n\n### Version 1.5.0\n- Added Voice Speed Control feature for adjusting speech rate\n  - New `voice_speed_factor` parameter in both Single and Multi Speaker nodes\n  - Time-stretching applied to reference audio to influence output speech rate\n  - Range: 0.8 to 1.2 with 0.01 step increments\n  - Recommended range: 0.95 to 1.05 for natural results\n  - Best results with 20+ seconds of reference audio\n\n### Version 1.4.3\n- Improved LoRA system with better logging and compatibility checks\n  - Added model compatibility detection to prevent mismatched LoRA loading\n  - Enhanced debug logging for LoRA component loading process\n  - Automatic detection and clear error messages for incompatible model-LoRA combinations\n  - Prevents loading errors when using quantized models with standard LoRAs\n  - Minor optimizations to LoRA weight loading process\n\n### Version 1.4.2\n- Bug Fixing\n\n### Version 1.4.1\n- Fixed HuggingFace authentication error when loading locally cached models\n  - Resolved 401 authorization errors for already downloaded models\n  - Node now correctly uses local model snapshots without requiring HuggingFace API authentication\n  - Prevents unnecessary API calls when models exist in `ComfyUI\u002Fmodels\u002Fvibevoice\u002F`\n\n### Version 1.4.0\n- Added LoRA (Low-Rank Adaptation) support for fine-tuned models\n  - New \"VibeVoice LoRA\" node for configuring custom voice adaptations\n  - Support for language model, diffusion head, and connector adaptations\n  - Dropdown menu for easy LoRA selection from `ComfyUI\u002Fmodels\u002Fvibevoice\u002Floras\u002F`\n  - Adjustable LoRA strength and component toggles\n  - Compatible with both Single and Multiple Speaker nodes\n  - Minimal memory overhead (~100-500MB per LoRA)\n  - Credits: Implementation by [@jpgallegoar](https:\u002F\u002Fgithub.com\u002Fjpgallegoar)\n\n### Version 1.3.0\n- Added custom pause tag support for speech pacing control\n  - New `[pause]` tag for 1-second silence (default)\n  - New `[pause:ms]` tag for custom duration in milliseconds (e.g., `[pause:2000]` for 2 seconds)\n  - Works with both Single Speaker and Multiple Speakers nodes\n  - Automatically splits text at pause points while maintaining voice consistency\n  - Note: This is a wrapper feature, not part of Microsoft's VibeVoice\n\n### Version 1.2.5\n- Bug Fixing\n\n### Version 1.2.4\n- Added automatic text chunking for long texts in Single Speaker node\n  - Single Speaker node now automatically splits texts longer than 250 words to prevent audio acceleration issues\n  - New optional parameter `max_words_per_chunk` (range: 100-500 words, default: 250)\n  - Maintains consistent voice characteristics across all chunks using the same seed\n  - Seamlessly concatenates audio chunks for smooth, natural output\n\n### Version 1.2.3\n- Added SageAttention support for inference speedup\n  - New attention option \"sage\" using quantized attention (INT8\u002FFP8) for faster generation\n  - Requirements: NVIDIA GPU with CUDA and sageattention library installation\n\n### Version 1.2.2\n- Added 4-bit quantized model support\n  - New model in menu: `VibeVoice-Large-Quant-4Bit` using ~7GB VRAM instead of ~17GB\n  - Requirements: NVIDIA GPU with CUDA and bitsandbytes library installed\n\n### Version 1.2.1\n- Bug Fixing\n\n### Version 1.2.0\n- MPS Support for Apple Silicon:\n  - Added GPU acceleration support for Mac with Apple Silicon (M1\u002FM2\u002FM3)\n  - Automatically detects and uses MPS backend when available, providing significant performance improvements over CPU\n\n### Version 1.1.1\n- Universal Transformers Compatibility:\n  - Implemented adaptive system that automatically adjusts to different transformers versions\n  - Guaranteed compatibility from v4.51.3 onwards\n  - Auto-detects and adapts to API changes between versions\n\n### Version 1.1.0\n- Updated the URL for downloading the VibeVoice-Large model\n- Removed VibeVoice-Large-Preview deprecated model\n\n### Version 1.0.9\n- Embedded VibeVoice code directly into the wrapper\n  - Added vvembed folder containing the complete VibeVoice code (MIT licensed)\n  - No longer requires external VibeVoice installation\n  - Ensures continued functionality for all users\n\n### Version 1.0.8\n- BFloat16 Compatibility Fix\n  - Fixed tensor type compatibility issues with audio processing nodes\n  - Input audio tensors are now converted from BFloat16 to Float32 for numpy compatibility\n  - Output audio tensors are explicitly converted to Float32 to ensure compatibility with downstream nodes\n  - Resolves \"Got unsupported ScalarType BFloat16\" errors when using voice cloning or saving audio\n\n### Version 1.0.7\n- Added interruption handler to detect user's cancel request\n- Bug fixing\n\n### Version 1.0.6\n- Fixed a bug that prevented VibeVoice nodes from receiving audio directly from another VibeVoice node\n\n### Version 1.0.5\n- Added support for Microsoft's official VibeVoice-Large model (stable release)\n\n### Version 1.0.4\n- Improved tokenizer dependency handling\n\n### Version 1.0.3\n- Added `attention_type` parameter to both Single Speaker and Multi Speaker nodes for performance optimization\n  - auto (default): Automatic selection of best implementation\n  - eager: Standard implementation without optimizations\n  - sdpa: PyTorch's optimized Scaled Dot Product Attention\n  - flash_attention_2: Flash Attention 2 for maximum performance (requires compatible GPU)\n- Added `diffusion_steps` parameter to control generation quality vs speed trade-off\n  - Default: 20 (VibeVoice default)\n  - Higher values: Better quality, longer generation time\n  - Lower values: Faster generation, potentially lower quality\n\n### Version 1.0.2\n- Added `free_memory_after_generate` toggle to both Single Speaker and Multi Speaker nodes\n- New dedicated \"Free Memory Node\" for manual memory management in workflows\n- Improved VRAM\u002FRAM usage optimization\n- Enhanced stability for long generation sessions\n- Users can now choose between automatic or manual memory management\n\n### Version 1.0.1\n- Fixed issue with line breaks in speaker text (both single and multi-speaker nodes)\n- Line breaks within individual speaker text are now automatically removed before generation\n- Improved text formatting handling for all generation modes\n\n### Version 1.0.0\n- Initial release\n- Single speaker node with voice cloning\n- Multi-speaker node with automatic speaker detection\n- Text file loading from ComfyUI directories\n- Deterministic and sampling generation modes\n- Support for VibeVoice 1.5B and Large models","# VibeVoice ComfyUI 节点\n\n微软 VibeVoice 文本转语音模型的全面 ComfyUI 集成，可在您的 ComfyUI 工作流中直接实现高质量的单人及多人语音合成。\n\n## ✨ 功能\n\n### 核心功能\n- 🎤 **单人 TTS**：生成自然语音，可选语音克隆\n- 👥 **多人对话**：支持最多 4 位不同说话者\n- 🎯 **语音克隆**：从音频样本中克隆声音\n- 🎨 **LoRA 支持**：使用自定义 LoRA 适配器对声音进行微调（v1.4.0+）\n- 🎚️ **语速控制**：通过调整参考语音速度来改变语速（v1.5.0+）\n- 📝 **文本文件加载**：从文本文件加载脚本\n- 📚 **自动文本分块**：无缝处理长文本，支持自定义分块大小\n- ⏸️ **自定义暂停标签**：使用 `[pause]` 和 `[pause:ms]` 标签插入静音（封装功能）\n- 🔄 **节点串联**：连接多个 VibeVoice 节点以构建复杂工作流\n- ⏹️ **中断支持**：在生成前或生成过程中取消操作\n- 🔧 **灵活配置**：可控制温度、采样和引导尺度\n\n### 性能与优化\n- ⚡ **注意力机制**：可选择 auto、eager、sdpa、flash_attention_2 或 sage\n- 🎛️ **扩散步数**：可调节质量与速度之间的平衡（默认：20 步）\n- 💾 **内存管理**：可切换生成后自动清理显存\n- 🧹 **释放内存节点**：手动控制内存，适用于复杂工作流\n- 🍎 **Apple Silicon 支持**：通过 MPS 在 M1\u002FM2\u002FM3 Mac 上实现原生 GPU 加速\n- 🔢 **8 位量化**：在大幅降低显存占用的同时保持完美音质\n- 🔢 **4 位量化**：在几乎不损失音质的情况下实现最大显存节省\n\n### 兼容性与安装\n- 📦 **自包含**：内置 VibeVoice 代码，无外部依赖\n- 🔄 **通用兼容性**：自适应支持 transformers v4.51.3+\n- 🖥️ **跨平台**：适用于 Windows、Linux 和 macOS\n- 🎮 **多后端支持**：支持 CUDA、CPU 和 MPS（Apple Silicon）\n\n## 🎥 视频演示\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=fIBMepIBKhI\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEnemyx-net_VibeVoice-ComfyUI_readme_459aa9afc473.jpg\" alt=\"VibeVoice ComfyUI 封装演示\" \u002F>\n  \u003C\u002Fa>\n  \u003Cbr>\n  \u003Cstrong>点击观看演示视频\u003C\u002Fstrong>\n\u003C\u002Fp>\n\n## 📦 安装\n\n### 自动安装（推荐）\n1. 将此仓库克隆到您的 ComfyUI 自定义节点文件夹中：\n```bash\ncd ComfyUI\u002Fcustom_nodes\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\n```\n\n2. 重启 ComfyUI - 节点将在首次使用时自动安装所需依赖项\n\n## 📥 模型安装\n\n### 需要手动下载\n从版本 1.6.0 开始，模型和分词器必须手动下载并放置到正确文件夹中。封装不再自动下载它们。\n\n### 下载链接\n\n#### 模型\n您可以从 HuggingFace 下载 VibeVoice 模型：\n\n| 模型                  | 大小   | 下载链接 |\n|------------------------|--------|---------------|\n| **VibeVoice-1.5B**     | ~5.4GB | [microsoft\u002FVibeVoice-1.5B](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FVibeVoice-1.5B) |\n| **VibeVoice-Large**    | ~18.7GB | [aoi-ot\u002FVibeVoice-Large](https:\u002F\u002Fhuggingface.co\u002Faoi-ot\u002FVibeVoice-Large) |\n| **VibeVoice-Large-Q8** | ~11.6GB | [FabioSarracino\u002FVibeVoice-Large-Q8](https:\u002F\u002Fhuggingface.co\u002FFabioSarracino\u002FVibeVoice-Large-Q8) |\n| **VibeVoice-Large-Q4** | ~6.6GB | [DevParker\u002FVibeVoice7b-low-vram](https:\u002F\u002Fhuggingface.co\u002FDevParker\u002FVibeVoice7b-low-vram) |\n\n#### 分词器（必需）\nVibeVoice 使用 Qwen2.5-1.5B 分词器：\n- 下载地址：[Qwen2.5-1.5B 分词器](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-1.5B\u002Ftree\u002Fmain)\n- 必需文件：`tokenizer_config.json`、`vocab.json`、`merges.txt`、`tokenizer.json`\n\n### 安装步骤\n1. 如果 `models\u002Fvibevoice\u002F` 文件夹不存在，请创建它：\n   ```\n   ComfyUI\u002Fmodels\u002Fvibevoice\u002F\n   ```\n\n2. 下载并整理文件到 `vibevoice` 文件夹中：\n   ```\n   ComfyUI\u002Fmodels\u002Fvibevoice\u002F\n   ├── tokenizer\u002F                 # 将 Qwen 分词器文件放入此处\n   │   ├── tokenizer_config.json\n   │   ├── vocab.json\n   │   ├── merges.txt\n   │   └── tokenizer.json\n   ├── VibeVoice-1.5B\u002F           # 模型文件夹\n   │   ├── config.json\n   │   ├── model-00001-of-00003.safetensors\n   │   ├── model-00002-of-00003.safetensors\n   │   └── ... (其他模型文件)\n   ├── VibeVoice-Large\u002F\n   │   └── ... (模型文件)\n   └── my-custom-vibevoice\u002F      # 支持自定义名称\n       └── ... (模型文件)\n   ```\n\n3. 对于使用 git-lfs 或 HF CLI 从 HuggingFace 下载的模型，也可以使用缓存结构：\n   ```\n   ComfyUI\u002Fmodels\u002Fvibevoice\u002F\n   └── models--microsoft--VibeVoice-1.5B\u002F\n       └── snapshots\u002F\n           └── [hash]\u002F\n               └── ... (模型文件)\n   ```\n\n4. 刷新浏览器 - 模型将出现在下拉菜单中\n\n### 注意事项\n- 下拉菜单将显示从文件夹名称中提取的友好名称\n- 同时支持普通文件夹和 HuggingFace 缓存结构\n- 每次刷新浏览器时都会重新扫描模型\n- 量化模型会根据其配置文件自动检测\n- 分词器的搜索优先级如下：\n  1. `ComfyUI\u002Fmodels\u002Fvibevoice\u002Ftokenizer\u002F`（推荐）\n  2. `ComfyUI\u002Fmodels\u002Fvibevoice\u002Fmodels--Qwen--Qwen2.5-1.5B\u002F`（如果之前安装过）\n  3. HuggingFace 缓存（如有）\n\n## 🔧 可用节点\n\n### 1. VibeVoice 从文件加载文本\n从 ComfyUI 的输入\u002F输出\u002F临时目录中加载文本内容。\n- **支持格式**：.txt\n- **输出**：用于 TTS 节点的文本字符串\n\n### 2. VibeVoice 单人声\n使用单一声音将文本转换为语音。\n- **文本输入**：直接输入文本或从“加载文本”节点连接\n- **模型**：从下拉菜单中选择可用模型\n- **声音克隆**：可选音频输入，用于声音克隆\n- **参数**（按顺序）：\n  - `text`：要转换为语音的输入文本\n  - `model`：从 `ComfyUI\u002Fmodels\u002Fvibevoice\u002F` 中找到的可用模型列表中选择\n  - `attention_type`：auto、eager、sdpa、flash_attention_2 或 sage（默认：auto）\n  - `quantize_llm`：对未量化的模型仅动态量化 LLM 组件。选项：“全精度”（默认）、“4bit”或“8bit”。4bit 可显著节省显存，且质量损失极小。8bit 在质量和内存占用之间提供了良好平衡。需要 CUDA GPU。对于已量化的模型则忽略此参数。\n  - `free_memory_after_generate`：生成后释放显存（默认：True）\n  - `diffusion_steps`：去噪步骤数（5–100，默认：20）\n  - `seed`：随机种子，用于结果可重复性（默认：42）\n  - `cfg_scale`：无分类器指导（1.0–2.0，默认：1.3）\n  - `use_sampling`：启用或禁用确定性生成（默认：False）\n- **可选参数**：\n  - `voice_to_clone`：用于声音克隆的音频输入\n  - `lora`：来自 VibeVoice LoRA 节点的 LoRA 配置\n  - `temperature`：采样温度（0.1–2.0，默认：0.95）\n  - `top_p`：核采样参数（0.1–1.0，默认：0.95）\n  - `max_words_per_chunk`：长文本每块的最大词数（100–500，默认：250）\n  - `voice_speed_factor`：语速调整（0.8–1.2，默认：1.0，步长：0.01）\n\n### 3. VibeVoice 多人声\n生成具有不同声音的多角色对话。\n- **说话人格式**：使用 `[N]:` 标记，其中 N 为 1–4\n- **声音分配**：可为每个说话人提供可选的声音样本\n- **推荐模型**：VibeVoice-Large，以获得更好的多人声效果\n- **参数**（按顺序）：\n  - `text`：包含说话人标签的输入文本\n  - `model`：从 `ComfyUI\u002Fmodels\u002Fvibevoice\u002F` 中找到的可用模型列表中选择\n  - `attention_type`：auto、eager、sdpa、flash_attention_2 或 sage（默认：auto）\n  - `quantize_llm`：对未量化的模型仅动态量化 LLM 组件。选项：“全精度”（默认）、“4bit”或“8bit”。4bit 可显著节省显存，且质量损失极小。8bit 在质量和内存占用之间提供了良好平衡。需要 CUDA GPU。对于已量化的模型则忽略此参数。\n  - `free_memory_after_generate`：生成后释放显存（默认：True）\n  - `diffusion_steps`：去噪步骤数（5–100，默认：20）\n  - `seed`：随机种子，用于结果可重复性（默认：42）\n  - `cfg_scale`：无分类器指导（1.0–2.0，默认：1.3）\n  - `use_sampling`：启用或禁用确定性生成（默认：False）\n- **可选参数**：\n  - `speaker1_voice` 至 `speaker4_voice`：用于声音克隆的音频输入\n  - `lora`：来自 VibeVoice LoRA 节点的 LoRA 配置\n  - `temperature`：采样温度（0.1–2.0，默认：0.95）\n  - `top_p`：核采样参数（0.1–1.0，默认：0.95）\n  - `voice_speed_factor`：所有说话人的语速调整（0.8–1.2，默认：1.0，步长：0.01）\n\n### 4. VibeVoice 释放内存\n手动从内存中释放所有已加载的 VibeVoice 模型。\n- **输入**：`audio` — 连接音频输出以触发内存清理\n- **输出**：`audio` — 不改变地传递输入音频\n- **使用场景**：插入到节点之间，以便在工作流的特定点释放显存\u002F内存\n- **示例**：[VibeVoice 节点] → [释放内存] → [保存音频]\n\n### 5. VibeVoice LoRA\n配置并加载自定义 LoRA 适配器，用于微调 VibeVoice 模型。\n- **LoRA 选择**：包含可用 LoRA 适配器的下拉菜单\n- **LoRA 位置**：将您的 LoRA 文件夹放置在 `ComfyUI\u002Fmodels\u002Fvibevoice\u002Floras\u002F`\n- **参数**：\n  - `lora_name`：从可用 LoRA 适配器中选择，或选择“None”以禁用\n  - `llm_strength`：语言模型 LoRA 的强度（0.0–2.0，默认：1.0）\n  - `use_llm`：应用语言模型 LoRA 组件（默认：True）\n  - `use_diffusion_head`：应用扩散头替换（默认：True）\n  - `use_acoustic_connector`：应用声学连接器 LoRA（默认：True）\n  - `use_semantic_connector`：应用语义连接器 LoRA（默认：True）\n- **输出**：`lora` — LoRA 配置，用于连接到说话人节点\n- **用法**：[VibeVoice LoRA] → [单人声\u002F多人声节点]\n\n## 💬 多人声文本格式\n\n对于多人声生成，请使用 `[N]:` 格式来组织文本：\n\n```\n[1]: 你好，今天过得怎么样？\n[2]: 我很好，谢谢你的关心！\n[1]: 听到这个真好。\n[3]: 大家好，我可以加入你们的对话吗？\n[2]: 当然可以，欢迎！\n```\n\n**重要提示**：\n- 使用 `[1]:`、`[2]:`、`[3]:`、`[4]:` 来标记说话人\n- 最多支持 4 个说话人\n- 系统会自动从您的文本中检测说话人的数量\n- 每个说话人可以选择性地提供声音样本进行克隆\n\n## 🧠 模型信息\n\n### VibeVoice-1.5B\n- **大小**：约 5.4GB 下载\n- **显存**：约 6GB\n- **速度**：推理更快\n- **质量**：适合单人声\n- **使用场景**：快速原型设计、单人声音效\n\n### VibeVoice-Large\n- **大小**：约 18.7GB 下载\n- **显存**：约 20GB\n- **速度**：推理较慢但经过优化\n- **质量**：现有最佳质量（全精度）\n- **使用场景**：最高质量制作、多人声对话\n- **备注**：微软最新官方发布版本\n\n### VibeVoice-Large-Q8\n- **大小**：约 11.6GB 下载（比完整模型减少 38%）\n- **显存**：约 12GB（比全精度减少 40%）\n- **速度**：推理速度均衡\n- **质量**：与全精度完全相同——完美保留音频质量\n- **使用场景**：适用于配备 12GB 显存的 GPU（如 RTX 3060、4070 Ti 等），实现制作级音频\n- **量化**：仅对 LLM 进行 8bit 量化，音频组件保持全精度\n- **备注**：由 Fabio Sarracino 量化\n\n### VibeVoice-Large-Q4\n- **大小**：约 6.6GB 下载\n- **显存**：约 8GB\n- **速度**：推理速度均衡\n- **质量**：质量良好，损失极小\n- **使用场景**：为低端 GPU 提供最大显存节省\n- **备注**：由 DevParker 量化\n\n模型将在首次使用时自动下载，并缓存在 `ComfyUI\u002Fmodels\u002Fvibevoice\u002F` 中。\n\n## ⚙️ 生成模式\n\n### 确定性模式（默认）\n- `use_sampling = False`\n- 产生一致、稳定的输出\n- 推荐用于生产环境\n\n### 采样模式\n- `use_sampling = True`\n- 输出更具变化性\n- 使用 temperature 和 top_p 参数\n- 适合创意探索\n\n## 🎯 语音克隆\n\n要克隆一种声音：\n1. 将音频节点连接到 `voice_to_clone` 输入端（单人说话）\n2. 或者连接到 `speaker1_voice`、`speaker2_voice` 等（多人说话）\n3. 模型会尝试匹配该声音的特征\n\n**语音样本要求：**\n- 清晰的音频，背景噪音尽量少\n- 最短3–10秒。建议至少30秒以获得更好的质量\n- 自动重采样至24kHz\n\n## 🎨 LoRA 支持\n\n### 概述\n从1.4.0版本开始，VibeVoice ComfyUI支持自定义的LoRA（低秩适应）适配器，用于微调语音特征。这使您能够在保持基础VibeVoice功能的同时，训练和使用专门的语音模型。\n\n### 设置LoRA适配器\n\n1. **LoRA目录结构**：\n   将您的LoRA适配器文件夹放置在：`ComfyUI\u002Fmodels\u002Fvibevoice\u002Floras\u002F`\n   ```\n   ComfyUI\u002F\n   └── models\u002F\n       └── vibevoice\u002F\n           └── loras\u002F\n               ├── my_custom_voice\u002F\n               │   ├── adapter_config.json\n               │   ├── adapter_model.safetensors\n               │   └── diffusion_head\u002F  (可选)\n               ├── character_voice\u002F\n               └── style_adaptation\u002F\n   ```\n\n2. **所需文件**：\n   - `adapter_config.json`：LoRA配置文件\n   - `adapter_model.safetensors` 或 `adapter_model.bin`：模型权重\n   - 可选组件：\n     - `diffusion_head\u002F`：自定义扩散头权重\n     - `acoustic_connector\u002F`：声学连接器适配\n     - `semantic_connector\u002F`：语义连接器适配\n\n### 在ComfyUI中使用LoRA\n\n1. **添加VibeVoice LoRA节点**：\n   - 在工作流中创建一个“VibeVoice LoRA”节点\n   - 从下拉菜单中选择您的LoRA\n   - 配置组件设置和强度\n\n2. **连接到说话人节点**：\n   - 将LoRA节点的输出连接到说话人节点的 `lora` 输入端\n   - 单人说话和多人说话节点都支持LoRA\n\n3. **LoRA参数**：\n   - **llm_strength**：控制语言模型LoRA的影响程度（0.0-2.0）\n   - **组件开关**：启用或禁用特定的LoRA组件\n   - 选择“无”以禁用LoRA并使用基础模型\n\n### 训练您自己的LoRA\n\n要为VibeVoice创建自定义LoRA适配器，请使用官方微调仓库：\n- **仓库**：[VibeVoice微调](https:\u002F\u002Fgithub.com\u002Fvoicepowered-ai\u002FVibeVoice-finetuning)\n- **特性**：\n  - 参数高效的微调\n  - 支持自定义数据集\n  - 可调节的LoRA等级和缩放\n  - 可选的扩散头适配\n\n### 最佳实践\n\n- **语音一致性**：对于长文本，在所有分块中使用相同的LoRA\n- **内存管理**：LoRA只会增加少量内存开销（约100-500MB）\n- **兼容性**：LoRA适配器与所有VibeVoice模型变体兼容\n- **强度调整**：从默认强度（1.0）开始，根据结果进行调整\n\n### 兼容性说明\n\n⚠️ **Transformers版本**：LoRA实现是在 `transformers==4.51.3` 版本上开发和测试的。虽然我们的封装支持 `transformers>=4.51.3`，但不能保证在更新版本的transformers中LoRA功能正常工作。如果您遇到LoRA加载问题，建议专门使用 `transformers==4.51.3`：\n```bash\npip install transformers==4.51.3\n```\n\n### 🙏 致谢\n\nLoRA实现由[@jpgallegoar](https:\u002F\u002Fgithub.com\u002Fjpgallegoar)完成（PR #127）\n\n## 🎚️ 语音速度控制\n\n### 概述\n语音速度控制功能允许您通过调整参考语音的速度来影响生成语音的语速。此功能会在处理前对输入的参考语音样本进行时间拉伸，从而使模型学习并重现改变后的语速。\n\n**自1.5.0版本起可用**\n\n### 工作原理\n系统会对参考语音音频应用时间拉伸：\n- 值小于1.0会减慢参考语音，生成较慢的语音\n- 值大于1.0会加快参考语音，生成较快的语音\n- 模型会从修改后的语音特征中学习，并以相似的速度生成语音\n\n### 使用方法\n- **参数**：`voice_speed_factor`\n- **范围**：0.8到1.2\n- **默认值**：1.0（正常速度）\n- **步长**：0.01（1%增量）\n\n### 推荐设置\n- **最佳范围**：0.95至1.05，以获得自然的声音效果\n- **较慢的语音**：尝试0.95（慢5%）或0.97（慢3%）\n- **较快的语音**：尝试1.03（快3%）或1.05（快5%）\n- **最佳效果**：提供至少20秒的参考音频，以便更准确地匹配语速\n\n### 注意事项\n- 效果在较长的参考音频样本上表现更好（建议20秒以上）\n- 极端值（小于0.9或大于1.1）可能会产生不自然的语音\n- 在多说话人模式下，速度调整会平等地应用于所有说话人\n- 合成语音（未提供音频时）不受此参数影响\n\n### 📖 示例\n```\n# 单人说话\nvoice_speed_factor: 0.95  # 稍微慢一点，语气更沉稳\nvoice_speed_factor: 1.05  # 稍微快一点，语气更活泼\n\n# 多人说话\nvoice_speed_factor: 0.98  # 所有说话人语速慢2%\nvoice_speed_factor: 1.02  # 所有说话人语速快2%\n```\n\n## ⏸️ 暂停标签支持\n\n### 概述\nVibeVoice封装包含一个自定义的暂停标签功能，允许您在文本段落之间插入静音。**这不是微软VibeVoice的标准功能**——而是我们封装的原创实现，旨在提供更多对语音节奏的控制。\n\n**自1.3.0版本起可用**\n\n### 使用方法\n您可以在文本中使用两种类型的暂停标签：\n- `[pause]`：插入1秒的静音（默认）\n- `[pause:ms]`：插入自定义持续时间的静音，单位为毫秒（例如，`[pause:2000]`表示2秒）\n\n### 📖 示例\n\n#### 单人说话\n```\n欢迎来到我们的演讲。[pause] 今天我们将探讨人工智能。[pause:500] 让我们开始吧！\n```\n\n#### 多人说话\n```\n[1]: 大家好 [pause] 今天过得怎么样？\n[2]: 我很好！[pause:500] 谢谢你的关心。\n[1]: 很高兴听到！\n```\n\n### 注意事项\n\n⚠️ **上下文限制警告**：\n> **注意：暂停会强制将文本分割成多个分块。这可能会降低模型理解上下文的能力。模型的上下文仅限于其当前分块。**\n\n这意味着：\n- 暂停前后的文本会被分别处理\n- 模型在生成语音时无法跨越暂停边界查看上下文\n- 这可能会影响韵律和语调的一致性\n- 这可能会影响韵律和语调的一致性\n\n### 工作原理\n1. 封装会解析您的文本以查找暂停标签\n2. 暂停之间的文本段落会被独立处理\n3. 会为每个暂停时长生成静音音频\n4. 所有音频片段（语音和静音）会被拼接在一起\n\n### 最佳实践\n- 在自然的断句处使用暂停（如句子或段落结束）\n- 避免在需要保持上下文连贯的短语中间使用暂停\n- 测试不同的暂停时长，找到最自然的效果\n\n## 💡 取得最佳效果的提示\n\n1. **文本准备**：\n   - 使用适当的标点符号以产生自然停顿\n   - 将长文本拆分为多个段落\n   - 对于多说话人场景，确保清晰的说话人切换标记\n   - 适度使用停顿标签，以保持上下文连贯性\n\n2. **模型选择**：\n   - 对于快速的单说话人任务，使用1.5B参数量的模型（速度最快，约需8GB显存）\n   - 若追求极致音质，可选用Large版本（约需20GB显存）\n   - 如需在12GB显存下获得优质音频，推荐使用Large-Q8版本（音质完美，模型体积缩小38%）\n   - 若要最大限度节省显存，可选择Large-Quant-4Bit版本（约需7GB显存）\n\n3. **种子管理**：\n   - 默认种子（42）通常适用于大多数情况\n   - 建议保存表现良好的种子，以便为同一角色保持一致的音色\n   - 若默认种子效果不佳，可尝试随机生成种子\n\n4. **性能优化**：\n   - 首次运行时会下载模型文件（5–17GB）\n   - 后续运行将直接调用缓存中的模型\n   - 推荐使用GPU以加快推理速度\n\n## 💻 系统要求\n\n### 硬件\n- **最低配置**：8GB显存，适用于VibeVoice-1.5B\n- **推荐配置**：17GB及以上显存，适用于VibeVoice-Large\n- **内存**：16GB及以上系统内存\n\n### 软件\n- Python 3.8+\n- PyTorch 2.0+\n- CUDA 11.8+（用于GPU加速）\n- Transformers 4.51.3+\n- ComfyUI（最新版本）\n\n## 🔧 故障排除\n\n### 安装问题\n- 确保使用ComfyUI提供的Python环境\n- 自动安装失败时，可尝试手动安装\n- 安装完成后请重启ComfyUI\n\n### 生成问题\n- 若语音不稳定，可尝试启用确定性模式\n- 多说话人场景中，请确保文本格式正确，如“[N]:”\n- 检查说话人编号是否连续（例如1,2,3，而非1,3,5）\n\n### 显存问题\n- Large模型需要约16GB显存\n- 在显存较低的设备上，建议使用1.5B模型\n- 模型采用bfloat16精度以提升效率\n\n## 📖 示例\n\n### 单说话人\n```\n文本: “欢迎来到我们的演示。今天我们将探索令人着迷的人工智能世界。”\n模型: [从可用模型中选择]\ncfg_scale: 1.3\nuse_sampling: False\n```\n\n### 两说话人\n```\n[1]: 你看到最新的AI进展了吗？\n[2]: 是的，它们非常令人印象深刻！\n[1]: 我觉得语音合成已经取得了长足的进步。\n[2]: 绝对如此，现在听起来非常自然。\n```\n\n### 四说话人对话\n```\n[1]: 各位，欢迎大家参加我们的会议。\n[2]: 感谢邀请！\n[3]: 很高兴能来。\n[4]: 期待今天的讨论。\n[1]: 那么我们开始议程吧。\n```\n\n## 📊 性能基准\n\n| 模型              | 显存占用 | 上下文长度 | 最大音频时长 |\n|--------------------|----------|------------|-------------|\n| VibeVoice-1.5B     | ~6GB     | 64K tokens | ~90分钟    |\n| VibeVoice-Large    | ~20GB    | 32K tokens | ~45分钟    |\n| VibeVoice-Large-Q8 | ~12GB    | 32K tokens | ~45分钟    |\n| VibeVoice-Large-Q4 | ~8GB     | 32K tokens | ~45分钟    |\n\n## ⚠️ 已知限制\n\n- 多说话人模式最多支持4个说话人\n- 最佳效果适用于英文和中文文本\n- 部分种子可能导致输出不稳定\n- 无法直接控制背景音乐的生成\n\n## 📄 许可证\n\n本ComfyUI插件基于MIT许可证发布。详细信息请参阅LICENSE文件。\n\n**注意**：VibeVoice模型本身受微软许可条款约束：\n- VibeVoice仅用于研究目的\n- 请查阅微软VibeVoice仓库以获取完整的模型许可详情\n\n## 🔗 链接\n\n- [原始VibeVoice仓库](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FVibeVoice) - 官方微软VibeVoice仓库（目前暂不可用）\n\n## 🙏 致谢\n\n- **VibeVoice模型**：微软研究院\n- **ComfyUI集成**：Fabio Sarracino\n- **基础模型**：基于Qwen2.5架构构建\n\n## 💬 支持\n\n如遇问题或疑问：\n1. 请查看故障排除部分\n2. 检查ComfyUI日志以获取错误信息\n3. 确保VibeVoice已正确安装\n4. 提交包含详细错误信息的问题\n\n## 🤝 贡献\n\n欢迎贡献代码！请：\n1. 全面测试修改内容\n2. 遵循现有代码风格\n3. 根据需要更新文档\n4. 提交带有清晰描述的拉取请求\n\n## 📝 更改记录\n\n### 版本1.8.1\n- 强制安装bitsandbytes>=0.48.1库，因为0.48.0版本存在严重bug，导致Q8模型无法正常工作。\n- Bug修复\n\n### 版本1.8.0\n- **全新官方8位量化模型**：VibeVoice-Large-Q8\n  - 发布于HuggingFace：[FabioSarracino\u002FVibeVoice-Large-Q8](https:\u002F\u002Fhuggingface.co\u002FFabioSarracino\u002FVibeVoice-Large-Q8)\n  - 模型大小：11.6GB（相比18.7GB的全精度版本缩小38%）\n  - 显存占用：~12GB（相比~20GB减少40%）\n  - **音质完美**：与全精度模型无差异，未出现任何质量损失\n  - **选择性量化策略**：保留扩散头、VAE及连接器等音频关键组件的全精度\n  - 专为12GB显存的GPU（如RTX 3060、4070 Ti等）设计\n  - 通过精心选择量化组件，有效解决了常见8位量化带来的“噪声”问题\n- **新增8位动态LLM量化**\n  - Single和Multiple Speaker节点新增“8bit”选项\n  - 选项包括：“full precision”（默认）、“4bit”、“8bit”\n  - 对非量化模型仅动态量化LLM部分\n  - 扩散头、声学\u002F语义连接器以及分词器等音频关键组件均保持全精度\n  - 在音质与显存节省之间取得良好平衡\n  - 需要CUDA GPU和bitsandbytes库支持\n  - 对于已量化模型则自动忽略\n\n### 版本1.7.0\n- 新增针对非量化模型的动态LLM-only 4位量化功能\n  - Single和Multiple Speaker节点新增“quantize_llm”参数\n  - 选项为“full precision”（默认）或“4bit”\n  - 仅量化语言模型部分，而扩散头仍保持全精度\n  - 生成速度显著提升，同时大幅节省显存\n  - 相比全精度，音质损失极小\n  - 需要CUDA GPU进行量化\n  - 对于已量化模型则自动忽略\n  - 使用NF4（4-bit NormalFloat）量化类型，专为神经网络优化\n\n### 版本1.6.3\n- 修复了分词器初始化错误\n  - 解决了加载处理器时出现的`TypeError: expected str, bytes or os.PathLike object, not NoneType`问题\n  - 添加了健壮的分词器文件路径解析回退机制\n  - 改进了vocab.json和merges.txt文件的加载处理\n  - 加强了分词器初始化过程中异常情况的处理能力\n\n### 版本1.6.2\n- 修复了分词器加载问题，该问题源于HuggingFace缓存可能干扰本地文件\n- 分词器现直接从指定路径加载，避免缓存冲突\n- 增加了明确的文件路径加载方式，以提高可靠性\n- 改进了日志记录，以便显示实际使用的分词器文件\n\n### 版本1.6.1\n- 通过移除不必要的HuggingFace设置，改进了集成工作\n\n### 版本 1.6.0\n- **重大变更**：移除了从 HuggingFace 自动下载模型的功能\n  - 现在必须手动下载模型并放置到 `ComfyUI\u002Fmodels\u002Fvibevoice\u002F` 目录下\n  - 动态模型下拉菜单，每次刷新浏览器时都会扫描可用的模型\n  - 支持自定义文件夹名称和 HuggingFace 缓存结构\n  - 能够自动从配置文件中检测量化模型\n  - 提高了用户对模型管理的控制能力\n  - 消除了与私有 HuggingFace 仓库相关的认证问题\n- **改进的日志系统**：\n  - 优化了日志记录，减少了控制台的混乱输出\n  - 输出更加清晰，提升了用户体验\n\n### 版本 1.5.0\n- 新增语音速度控制功能，用于调整语速\n  - 在单人说话者节点和多人说话者节点中新增 `voice_speed_factor` 参数\n  - 对参考音频进行时间拉伸以影响输出语速\n  - 取值范围为 0.8 到 1.2，步长为 0.01\n  - 推荐取值范围为 0.95 到 1.05，以获得自然效果\n  - 使用 20 秒以上的参考音频效果最佳\n\n### 版本 1.4.3\n- 改进了 LoRA 系统，增强了日志记录和兼容性检查\n  - 添加了模型兼容性检测，防止加载不匹配的 LoRA\n  - 增强了 LoRA 组件加载过程的调试日志\n  - 自动检测不兼容的模型-LoRA 组合，并给出明确的错误信息\n  - 防止在使用量化模型时加载标准 LoRA 出现错误\n  - 对 LoRA 权重加载过程进行了小幅优化\n\n### 版本 1.4.2\n- 修复了若干 bug\n\n### 版本 1.4.1\n- 修复了加载本地缓存模型时出现的 HuggingFace 认证错误\n  - 解决了已下载模型的 401 授权错误问题\n  - 节点现在能够正确使用本地模型快照，无需 HuggingFace API 认证\n  - 防止在 `ComfyUI\u002Fmodels\u002Fvibevoice\u002F` 中存在模型时进行不必要的 API 调用\n\n### 版本 1.4.0\n- 新增 LoRA（低秩适应）支持，用于微调模型\n  - 新增“VibeVoice LoRA”节点，用于配置自定义语音适配\n  - 支持语言模型、扩散头和连接器的适配\n  - 下拉菜单可方便地从 `ComfyUI\u002Fmodels\u002Fvibevoice\u002Floras\u002F` 中选择 LoRA\n  - 可调节 LoRA 强度及各组件开关\n  - 兼容单人说话者节点和多人说话者节点\n  - 内存开销极小（每份 LoRA 约 100-500MB）\n  - 致谢：由 [@jpgallegoar](https:\u002F\u002Fgithub.com\u002Fjpgallegoar) 实现\n\n### 版本 1.3.0\n- 新增自定义暂停标签支持，用于控制语音节奏\n  - 新增 `[pause]` 标签，表示 1 秒钟的静音（默认）\n  - 新增 `[pause:ms]` 标签，用于指定自定义持续时间（单位为毫秒），例如 `[pause:2000]` 表示 2 秒钟\n  - 适用于单人说话者节点和多人说话者节点\n  - 自动在暂停点分割文本，同时保持语音一致性\n  - 注意：此功能为封装特性，并非 Microsoft VibeVoice 的一部分\n\n### 版本 1.2.5\n- 修复了若干 bug\n\n### 版本 1.2.4\n- 在单人说话者节点中新增了针对长文本的自动分块功能\n  - 单人说话者节点现在会自动将超过 250 字的文本拆分为多个块，以避免音频加速问题\n  - 新增可选参数 `max_words_per_chunk`（取值范围为 100-500 字，默认为 250）\n  - 使用相同的种子确保所有分块的语音特征一致\n  - 无缝拼接音频分块，生成流畅自然的输出\n\n### 版本 1.2.3\n- 新增 SageAttention 支持，用于加速推理\n  - 新增“sage”注意力选项，采用量化注意力（INT8\u002FFP8）以加快生成速度\n  - 要求：配备 CUDA 的 NVIDIA 显卡，并安装 sageattention 库\n\n### 版本 1.2.2\n- 新增 4 位量化模型支持\n  - 菜单中新增 `VibeVoice-Large-Quant-4Bit` 模型，仅需约 7GB 显存，而此前需要约 17GB\n  - 要求：配备 CUDA 的 NVIDIA 显卡，并安装 bitsandbytes 库\n\n### 版本 1.2.1\n- 修复了若干 bug\n\n### 版本 1.2.0\n- MPS 支持 Apple Silicon：\n  - 新增对搭载 Apple Silicon（M1\u002FM2\u002FM3）的 Mac 设备的 GPU 加速支持\n  - 自动检测并使用 MPS 后端（如可用），相比 CPU 可显著提升性能\n\n### 版本 1.1.1\n- 通用 Transformer 兼容性：\n  - 实现了自适应系统，可自动调整以适应不同版本的 Transformer\n  - 保证从 v4.51.3 版本起的兼容性\n  - 自动检测并适应不同版本之间的 API 变化\n\n### 版本 1.1.0\n- 更新了下载 VibeVoice-Large 模型的 URL\n- 移除了已弃用的 VibeVoice-Large-Preview 模型\n\n### 版本 1.0.9\n- 将 VibeVoice 代码直接嵌入到封装器中\n  - 新增 vvembed 文件夹，包含完整的 VibeVoice 代码（MIT 许可证）\n  - 不再需要外部安装 VibeVoice\n  - 确保所有用户都能继续正常使用\n\n### 版本 1.0.8\n- 修复了 BFloat16 兼容性问题\n  - 解决了与音频处理节点相关的张量类型兼容性问题\n  - 输入音频张量现在会被转换为 Float32，以实现与 numpy 的兼容性\n  - 输出音频张量则被显式转换为 Float32，以确保与下游节点的兼容性\n  - 解决了在使用语音克隆或保存音频时出现的“Got unsupported ScalarType BFloat16”错误\n\n### 版本 1.0.7\n- 新增中断处理器，用于检测用户的取消请求\n- 修复了若干 bug\n\n### 版本 1.0.6\n- 修复了一个导致 VibeVoice 节点无法直接接收来自其他 VibeVoice 节点音频的 bug\n\n### 版本 1.0.5\n- 新增对 Microsoft 官方 VibeVoice-Large 模型（稳定版）的支持\n\n### 版本 1.0.4\n- 改进了分词器依赖项的处理\n\n### 版本 1.0.3\n- 在单人说话者节点和多人说话者节点中新增了 `attention_type` 参数，用于优化性能\n  - auto（默认）：自动选择最佳实现\n  - eager：无优化的标准实现\n  - sdpa：PyTorch 优化的缩放点积注意力\n  - flash_attention_2：Flash Attention 2，用于获得最高性能（需要兼容的 GPU）\n- 新增 `diffusion_steps` 参数，用于控制生成质量与速度的权衡\n  - 默认值为 20（VibeVoice 默认值）\n  - 值越高：质量越好，但生成时间越长\n  - 值越低：生成速度更快，但可能质量稍差\n\n### 版本 1.0.2\n- 在单人说话者节点和多人说话者节点中新增了 `free_memory_after_generate` 开关\n  - 新增专用“释放内存节点”，用于工作流中的手动内存管理\n  - 改进了 VRAM\u002FRAM 使用优化\n  - 提升了长时间生成任务的稳定性\n  - 用户现在可以选择自动或手动内存管理方式\n\n### 版本 1.0.1\n- 修复了说话者文本中的换行符问题（无论是单人还是多人说话者节点）\n  - 现在会在生成前自动移除单个说话者文本中的换行符\n  - 改善了所有生成模式下的文本格式处理\n\n### 版本 1.0.0\n- 初始发布\n  - 单人说话者节点，具备语音克隆功能\n  - 多人说话者节点，具备自动说话者检测功能\n  - 支持从 ComfyUI 目录加载文本文件\n  - 提供确定性和采样两种生成模式\n  - 支持 VibeVoice 1.5B 和 Large 模型","# VibeVoice-ComfyUI 快速上手指南\n\nVibeVoice-ComfyUI 是微软 VibeVoice 文本转语音（TTS）模型在 ComfyUI 中的集成节点，支持高质量单人及多人语音合成、声音克隆及 LoRA 微调。\n\n## 1. 环境准备\n\n### 系统要求\n- **操作系统**：Windows, Linux, macOS (包括 Apple Silicon M1\u002FM2\u002FM3)\n- **显卡 (GPU)**：\n  - NVIDIA GPU (推荐 CUDA 后端)\n  - Apple Silicon (支持 MPS 加速)\n  - 或仅使用 CPU (速度较慢)\n- **显存 (VRAM) 建议**：\n  - **VibeVoice-1.5B**: 最低 6GB\n  - **VibeVoice-Large (全精度)**: 约 20GB\n  - **VibeVoice-Large-Q8 (8-bit)**: 约 12GB (推荐 RTX 3060\u002F4070 Ti 等)\n  - **VibeVoice-Large-Q4 (4-bit)**: 约 8GB (低显存方案)\n\n### 前置依赖\n- 已安装 **ComfyUI** 及其管理器 (ComfyUI Manager)。\n- Python 环境需兼容 `transformers >= 4.51.3` (插件首次运行时会自动安装所需依赖)。\n\n---\n\n## 2. 安装步骤\n\n### 第一步：安装插件节点\n打开终端，进入 ComfyUI 的自定义节点目录并克隆仓库：\n\n```bash\ncd ComfyUI\u002Fcustom_nodes\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\n```\n\n重启 ComfyUI。插件会在首次使用时自动安装必要的 Python 依赖包。\n\n### 第二步：下载模型与分词器 (必需)\n**注意**：从 v1.6.0 开始，模型和分词器不再自动下载，需手动放置到指定目录。\n\n1. **创建目录结构**：\n   在 `ComfyUI\u002Fmodels\u002F` 下创建 `vibevoice` 文件夹：\n   ```\n   ComfyUI\u002Fmodels\u002Fvibevoice\u002F\n   ├── tokenizer\u002F          # 存放分词器文件\n   └── [模型名称文件夹]\u002F    # 存放模型文件\n   ```\n\n2. **下载分词器 (Tokenizer)**：\n   VibeVoice 依赖 `Qwen2.5-1.5B` 的分词器。\n   - 来源：[HuggingFace - Qwen\u002FQwen2.5-1.5B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-1.5B\u002Ftree\u002Fmain)\n   - 需下载文件：`tokenizer_config.json`, `vocab.json`, `merges.txt`, `tokenizer.json`\n   - 将上述文件放入 `ComfyUI\u002Fmodels\u002Fvibevoice\u002Ftokenizer\u002F` 目录。\n\n3. **下载主模型**：\n   根据显存选择以下任一模型下载，并将文件放入 `ComfyUI\u002Fmodels\u002Fvibevoice\u002F` 下的独立文件夹中（文件夹名可自定义，如 `VibeVoice-1.5B`）：\n\n   | 模型版本 | 大小 | 适用场景 | 下载链接 |\n   | :--- | :--- | :--- | :--- |\n   | **VibeVoice-1.5B** | ~5.4GB | 快速测试、单人语音 | [microsoft\u002FVibeVoice-1.5B](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FVibeVoice-1.5B) |\n   | **VibeVoice-Large-Q8** | ~11.6GB | **推荐**：高质量且省显存 | [FabioSarracino\u002FVibeVoice-Large-Q8](https:\u002F\u002Fhuggingface.co\u002FFabioSarracino\u002FVibeVoice-Large-Q8) |\n   | **VibeVoice-Large-Q4** | ~6.6GB | 极低显存方案 | [DevParker\u002FVibeVoice7b-low-vram](https:\u002F\u002Fhuggingface.co\u002FDevParker\u002FVibeVoice7b-low-vram) |\n   | **VibeVoice-Large** | ~18.7GB | 最高质量 (需大显存) | [aoi-ot\u002FVibeVoice-Large](https:\u002F\u002Fhuggingface.co\u002Faoi-ot\u002FVibeVoice-Large) |\n\n   *国内用户若访问 HuggingFace 困难，可使用镜像站 (如 `hf-mirror.com`) 或 Git LFS 加速工具下载。*\n\n4. **刷新界面**：\n   点击 ComfyUI 界面上的 \"Refresh\" 按钮，模型将出现在节点的下拉菜单中。\n\n---\n\n## 3. 基本使用\n\n### 场景一：单人语音合成 (Single Speaker)\n\n这是最基础的用法，将文本转换为语音，可选声音克隆。\n\n1. **添加节点**：\n   - 双击搜索并添加 `VibeVoice Single Speaker` 节点。\n   - 添加 `Save Audio` 节点用于保存结果。\n\n2. **配置参数**：\n   - **text**: 输入要合成的文本（例如：\"你好，欢迎使用 VibeVoice。\"）。\n   - **model**: 选择已下载的模型（如 `VibeVoice-Large-Q8`）。\n   - **voice_to_clone** (可选): 连接一个 `Load Audio` 节点，上传参考音频以实现声音克隆。\n   - **diffusion_steps**: 去噪步数，默认 20 (越高音质越好但越慢)。\n   - **quantize_llm**: 若使用非量化模型且显存紧张，可设为 `8bit` 或 `4bit`。\n\n3. **连接与运行**：\n   ```\n   [VibeVoice Single Speaker] (AUDIO 输出) --> [Save Audio] (AUDIO 输入)\n   ```\n   点击 \"Queue Prompt\" 生成音频。\n\n### 场景二：多人对话合成 (Multi-Speaker)\n\n支持最多 4 人对话，需按特定格式编写脚本。\n\n1. **添加节点**：\n   - 添加 `VibeVoice Multiple Speakers` 节点。\n\n2. **编写脚本**：\n   在 `text` 输入框中使用 `[N]:` 标记说话人（N 为 1-4）：\n   ```text\n   [1]: 你好，今天天气不错！\n   [2]: 是啊，适合出去走走。\n   [1]: 那我们一起去公园吧？\n   ```\n\n3. **声音克隆 (可选)**：\n   - 将不同的参考音频分别连接到 `speaker1_voice`, `speaker2_voice` 等输入端，为每个角色赋予不同音色。\n\n4. **运行**：\n   连接 `Save Audio` 节点后执行，系统将自动识别说话人并生成连贯对话。\n\n### 进阶技巧\n- **长文本处理**：节点支持自动分块，无需手动切割长文章。\n- **插入停顿**：在文本中使用 `[pause]` 或 `[pause:1000]` (毫秒) 插入静音。\n- **显存优化**：在工作流中插入 `VibeVoice Free Memory` 节点，可在生成间隙手动释放显存，防止 OOM。","一位独立游戏开发者正在为一款多角色互动的视觉小说制作动态配音，需要快速生成不同性格角色的对话音频并整合到工作流中。\n\n### 没有 VibeVoice-ComfyUI 时\n- **工作流割裂严重**：必须在外部软件单独运行 TTS 模型生成音频，再手动导入 ComfyUI 与画面合成，反复切换窗口导致效率极低。\n- **多角色对话难以协调**：无法在一个流程中自然处理多个角色的轮替对话，需分别生成每个角色的音频片段再后期拼接，耗时且容易出错。\n- **长文本处理繁琐**：面对大段剧本，缺乏自动分块功能，手动切割文本并调整停顿不仅枯燥，还常因忘记插入静音标签导致语速过快或呼吸感缺失。\n- **显存管理困难**：在生成高清画面的同时运行大型语音模型，常因显存不足导致崩溃，缺乏灵活的量化选项和手动清理机制来平衡资源。\n\n### 使用 VibeVoice-ComfyUI 后\n- **全流程无缝集成**：直接在 ComfyUI 节点图中调用微软 VibeVoice 模型，实现从文本输入到音频输出的全自动化，无需离开界面即可完成配音合成。\n- **原生支持多人会话**：利用多说话人功能，单个节点链即可生成包含 4 个不同角色的自然对话，自动处理角色切换与语气差异，极大简化了剧情音频制作。\n- **智能文本与停顿控制**：通过加载脚本文件自动分块处理长文本，并使用 `[pause]` 标签精准控制对话节奏，轻松营造出真实的交流呼吸感。\n- **高效显存优化**：借助 4-bit\u002F8-bit 量化技术和专用显存清理节点，即使在生成复杂画面时也能稳定运行大型模型，Apple Silicon 用户更能享受原生加速带来的流畅体验。\n\nVibeVoice-ComfyUI 将高质量的多角色语音合成彻底融入可视化工作流，让创作者能专注于内容本身而非技术琐碎。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEnemyx-net_VibeVoice-ComfyUI_459aa9af.jpg","Enemyx-net","Fabio Sarracino","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FEnemyx-net_f1d5b2ad.png",null,"Italy","https:\u002F\u002Fgithub.com\u002FEnemyx-net",[80],{"name":81,"color":82,"percentage":83},"Python","#3572A5",100,1458,226,"2026-04-16T10:45:18","MIT","Windows, Linux, macOS","可选但推荐。支持 NVIDIA GPU (CUDA)、Apple Silicon (MPS) 或 CPU。显存需求取决于模型：VibeVoice-1.5B 约需 6GB；VibeVoice-Large 约需 20GB；量化版 VibeVoice-Large-Q8 约需 12GB；量化版 VibeVoice-Large-Q4 约需 8GB。动态量化功能（4bit\u002F8bit）需要 CUDA GPU。","未说明",{"notes":92,"python":90,"dependencies":93},"1. 自 v1.6.0 起，模型和分词器必须手动下载并放置到指定目录，不再自动下载。2. 需要单独下载 Qwen2.5-1.5B 的分词器文件。3. 支持多种注意力机制（flash_attention_2, sdpa 等）以优化性能。4. 提供 4-bit 和 8-bit 量化选项以降低显存占用，其中 8-bit 量化在保持音质的同时显著减少显存需求。5. 原生支持 Apple Silicon (M1\u002FM2\u002FM3) 的 MPS 加速。",[94],"transformers>=4.51.3",[15,96],"音频",[98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113],"ai-voice","ai-voice-clone","ai-voice-clonining","comfyui-custom-node","comfyui-custom-nodes-text-to-speech","comfyui-nodes","t2s","text-to-speech","tts","vibevoice","vibevoice-microsoft","voice-cloning","voice-generation","voice-generator","ai-audio","ai-tts","2026-03-27T02:49:30.150509","2026-04-17T10:19:24.323181",[117,122,127,131,136,140],{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},37409,"遇到 'VibeVoice embedded module import failed' 错误，提示请确保 vvembed 文件夹存在且 transformers>=4.51.3 已安装，如何解决？","此错误通常由环境配置不一致引起。虽然报错提示版本问题，但即使安装了更高版本的 transformers（如 4.57.1）且 vvembed 文件夹存在，仍可能报错。这通常发生在非标准 ComfyUI 安装或依赖冲突时。\n解决方案：\n1. 尝试使用官方提供的 ComfyUI Windows Portable 版本进行全新安装，不要手动修补依赖。\n2. 确保在干净的环境中安装节点，避免与其他自定义节点的库版本冲突。\n3. 如果问题依旧，检查 Python 环境是否被其他全局安装的库干扰，建议在 ComfyUI 自带的嵌入式 Python 环境中运行。","https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\u002Fissues\u002F114",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},37410,"生成语音时报错 'Failed to load VibeVoice processor. Error: expected str, bytes or os.PathLike object, not NoneType' 是什么原因？","该错误通常意味着模型文件加载失败或路径配置为空。最常见的原因是下载的模型文件损坏或不完整。\n解决方案：\n1. 删除现有的 VibeVoice 模型文件（通常位于 models\u002Fvibevoice\u002F 目录下）。\n2. 重新从头下载完整的模型文件（如 VibeVoice-Large）。\n3. 确保模型文件放置在正确的目录中，并且文件名与代码预期一致。\n4. 重启 ComfyUI 后重试。","https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\u002Fissues\u002F156",{"id":128,"question_zh":129,"answer_zh":130,"source_url":126},37411,"如何在显存较小（如 16GB）的显卡上运行 VibeVoice 以避免显存不足？","对于显存有限的用户（如 16GB VRAM），启用量化（quantization）是关键的优化手段。\n解决方案：\n1. 确保安装了支持量化的依赖库，特别是 bitsandbytes。可以使用命令 `pip install bitsandbytes==0.48.1` 进行安装（注意需匹配对应的 torch 版本）。\n2. 在工作流或节点设置中启用量化选项（如果节点支持）。\n3. 用户反馈表明，正确配置量化后，16GB 显存可以顺利运行大型模型。",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},37412,"为什么在默认的多说话人（Multiple-Speaker）工作流中会报 'numpy.dtype size changed' 二进制不兼容错误？","这是一个典型的 numpy 版本与编译后的 C 扩展不匹配的问题，常发生在全新安装 ComfyUI 或更新某些库之后。\n解决方案：\n1. 尝试重新安装 numpy 以确保版本兼容：`pip install --force-reinstall numpy`。\n2. 如果使用的是 ComfyUI Portable 版本，尽量避免手动升级核心科学计算库（如 numpy, scipy），除非明确知道兼容版本。\n3. 在某些情况下，卸载并重新安装整个 VibeVoice 自定义节点及其依赖项可以解决此二进制兼容性问题。","https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\u002Fissues\u002F49",{"id":137,"question_zh":138,"answer_zh":139,"source_url":121},37413,"安装了正确版本的 transformers 和模型文件，但仍然报错 'import failed'，是否与环境有关？","是的，这与系统环境密切相关。ComfyUI 没有单一的“标准”版本，每个系统的硬件（CUDA 版本）、操作系统和已安装的库版本都不同。\n解决方案：\n1. 不要试图在复杂的现有 Python 环境中修补所有依赖，这往往会导致更多问题。\n2. 推荐直接使用 ComfyUI Windows Portable 版本，它包含了一个隔离的、经过测试的 Python 环境。\n3. 在该便携版中直接通过 Manager 安装节点，通常能自动解决大部分依赖路径和版本问题。",{"id":141,"question_zh":142,"answer_zh":143,"source_url":126},37414,"遇到 'expected Tensor as element 0 in argument 0, but got NoneType' 错误该如何排查？","此错误表示节点期望接收一个张量（Tensor）数据，但实际接收到的是空值（None）。这通常是由于上游节点执行失败或未正确连接导致的。\n解决方案：\n1. 检查工作流中连接到该节点的上游节点是否正常运行，是否有报错。\n2. 确认模型文件是否已成功加载（参考模型损坏的解决方法，重新下载模型）。\n3. 检查输入参数是否正确填写，确保没有留空必要的输入项。\n4. 查看控制台日志中在该错误之前是否有其他警告或错误信息，这通常是根本原因所在。",[145,150,155,160,165,170,175,180,185,190,195,200,205,210,215,220,225,230,235,240],{"id":146,"version":147,"summary_zh":148,"released_at":149},297996,"v1.8.1","## 🐛 重要修复\n\n- **Bitsandbytes 更新**：强制安装 >=0.48.1 版本，该版本修复了 0.48.0 版本中的严重 bug，该 bug 导致 Q8 模型无法正常工作。\n- **通用 Bug 修复**：多项稳定性改进。\n\n## ⚠️ 重要提示\n\n如果您正在使用 Q8 模型或动态量化，并遇到 `VibeVoice 生成失败：参数 0 中的元素 0 应为 Tensor 类型，但实际为 NoneType` 的问题：\n- 请更新 bitsandbytes：`pip install bitsandbytes>=0.48.1 --upgrade`\n- 0.48.0 版本存在一个已知 bug，会导致 8 位量化失效。\n\n## 💾 安装方法\n\n可通过 ComfyUI 管理器安装，或手动克隆仓库：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\n```","2025-10-02T18:51:41",{"id":151,"version":152,"summary_zh":153,"released_at":154},297997,"v1.8.0","## 🎉 量化8位模型发布\n\n- **新模型**: VibeVoice-Large-Q8 - 在显存减少40%的情况下仍保持完美音质\n- **动态8位量化**: 新增即时量化为8位的选项\n\n## 📦 VibeVoice-Large-Q8\n\n### 下载\n- **HuggingFace**: [FabioSarracino\u002FVibeVoice-Large-Q8](https:\u002F\u002Fhuggingface.co\u002FFabioSarracino\u002FVibeVoice-Large-Q8)\n- **大小**: 11.6GB（比全精度版本小38%）\n- **显存需求**: 约12GB（相比约20GB减少了40%）\n\n### 核心创新\n- **完美音质**: 与全精度版本完全一致，无任何质量损失\n- **选择性量化**: 对音频关键组件仍保持全精度\n  - 扩散头 ✓ 全精度\n  - VAE ✓ 全精度  \n  - 连接器 ✓ 全精度\n  - LLM ✓ 8位量化\n\n## 🔧 动态8位量化\n\n### `quantize_llm` 参数增强\n现支持三种选项：\n- `full precision`（默认）：原始质量\n- `8bit`（新增）：平衡质量与显存占用\n- `4bit`：最大程度节省显存\n\n### 智能组件选择\n8位模式会智能跳过以下部分：\n- 扩散头\n- 声学\u002F语义连接器\n- 分词器\n- 所有音频处理组件\n\n## 💾 显存对比\n\n| 模型\u002F模式 | 显存占用 | 质量 |\n|------------|----------|------|\n| VibeVoice-Large | ~20GB | 完美 |\n| VibeVoice-Large-Q8 | ~12GB | 完美 |\n| 动态8位 | ~13-14GB | 优秀 |\n| 动态4位 | ~10-11GB | 很好 |\n| VibeVoice-Large-Q4 | ~8GB | 良好 |\n\n## 🎯 适用场景\n\n- **RTX 3060（12GB）**: 可以运行VibeVoice-Large-Q8并保持完美音质\n- **RTX 4070 Ti（12GB）**: 可以完整运行模型，无需妥协\n- **RTX 3050（8GB）**: 可使用VibeVoice-Large-Q4或动态4位模式\n- **生产环境**: Q8提供最佳的质量与显存比例\n- **开发环境**: 使用动态量化可实现快速迭代\n\n## ⚙️ 系统要求\n\n进行动态量化时需满足以下条件：\n- NVIDIA CUDA GPU\n- bitsandbytes 库\n- 在 CPU\u002FMPS 上会回退到全精度模式\n\n## 💾 安装方法\n\n可通过 ComfyUI 管理器安装，或手动克隆仓库：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\n```","2025-10-01T02:44:45",{"id":156,"version":157,"summary_zh":158,"released_at":159},297998,"v1.7.0","## 🚀 动态 LLM 量化\n\n- **新功能**：对未量化的模型进行实时 4 位量化\n- **选择性量化**：仅量化语言模型部分，扩散模型保持全精度\n- **大幅节省显存**：在质量影响极小的情况下显著降低内存占用\n- **轻松切换**：通过简单的下拉菜单即可在全精度和 4 位之间切换\n\n## ✨ 新增内容\n\n### LLM 参数量化\n- **新增参数**：在单人声和多人声节点中均新增 `quantize_llm`\n- **选项**：\n  - `full precision`（默认）——保持原始模型质量\n  - `4bit`——动态量化以节省显存\n\n### 工作原理\n- 仅量化语言模型组件（模型中最大的部分）\n- 扩散头部分保持全精度以保证质量\n- 使用针对神经网络优化的 NF4（4 位 NormalFloat）\n- 动态应用——无需下载单独的量化模型\n\n## ⚡ 性能优势\n\n- ✅ **生成速度更快**：减少内存带宽需求，提升处理速度\n- ✅ **显存占用更低**：可在较小显存的 GPU 上运行更大模型\n- ✅ **批量处理能力更强**：更多内容可同时加载到显存中，支持并行生成\n- ✅ **质量无明显下降**：与全精度相比，质量损失极小\n\n## 🎯 使用场景\n\n非常适合以下情况：\n- 在 16GB 显存的 GPU 上运行 VibeVoice-Large\n- 批量生成多个音频片段\n- 开发阶段快速迭代\n- 显存有限的生产环境\n\n## ⚙️ 系统要求\n\n- **GPU**：支持 NVIDIA CUDA 的 GPU\n- **库**：bitsandbytes（如缺失会自动安装）\n- **注意**：在 CPU 或 MPS 上会回退到全精度\n\n## 💡 智能检测\n\n- 自动禁用预量化模型\n- 仅适用于标准的全精度模型\n- 日志会清晰记录量化是否已启用\n\n## 💾 安装方法\n\n可通过 ComfyUI 管理器安装，或手动执行：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\n```\n\n## 📋 使用建议\n\n- 建议先使用全精度模式建立质量基准\n- 在生产环境或显存不足时切换至 4 位模式\n- 非常适合在消费级 GPU 上运行 VibeVoice-Large\n- 对于已经量化的模型不会产生任何影响，保持原样\n\n## 🔧 技术细节\n\n- 量化类型：NF4（4 位 NormalFloat）\n- 量化为会话级操作（不会保存到磁盘）","2025-09-30T18:07:59",{"id":161,"version":162,"summary_zh":163,"released_at":164},297999,"v1.6.3","## 🐛 修复错误\n\n- 修复了 `TypeError: expected str, bytes or os.PathLike object, not NoneType` 错误\n- 添加了针对分词器文件路径解析的健壮回退机制\n- 改进了 `vocab.json` 和 `merges.txt` 文件的加载处理\n- 增强了分词器初始化中边缘情况的错误处理\n\n## 🔧 修复内容\n\n### 分词器加载\n- 更好地处理缺失或不完整的分词器文件\n- 对分词器组件的路径解析更加稳健\n- 改进了针对各种安装场景的回退机制\n\n## 💾 安装\n\n可通过 ComfyUI 管理器安装，或手动安装：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\n```","2025-09-30T01:26:50",{"id":166,"version":167,"summary_zh":168,"released_at":169},298000,"v1.6.2","## 🐛 分词器修复\n\n- 修复了分词器加载与 Hugging Face 缓存之间的冲突\n- 分词器现在可直接从指定路径加载\n- 添加了显式文件路径加载，以提高可靠性\n- 改进了日志记录，以便显示正在使用的分词器文件\n\n## 🔧 修复内容\n\n- 解决了 HF 缓存可能干扰本地分词器文件的问题\n- 分词器加载更加可预测和可靠\n- 更清晰地反馈正在使用的文件\n\n## 💾 安装\n\n可通过 ComfyUI 管理器安装，或手动安装：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\n```","2025-09-29T16:37:22",{"id":171,"version":172,"summary_zh":173,"released_at":174},298001,"v1.6.1","## 🧹 代码清理\n\n- 移除了不必要的 Hugging Face 配置设置\n- 简化了集成代码\n- 代码库更加整洁，便于维护\n\n## 💾 安装\n\n可通过 ComfyUI 管理器安装，或手动克隆：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\n```\n\n## 📋 注意事项\n\n本次为小幅更新，主要提升了代码质量。对用户功能无影响。","2025-09-27T21:12:49",{"id":176,"version":177,"summary_zh":178,"released_at":179},298002,"v1.6.0","## ⚠️ 重大变更 - 模型管理\n\n- **需手动下载**：模型和分词器现在必须手动下载\n- **动态模型检测**：浏览器刷新时，下拉菜单会扫描可用的模型\n- **灵活的文件夹结构**：支持自定义名称和 HuggingFace 缓存格式\n- **改进的日志记录**：控制台输出更整洁、更清晰\n\n## 🔄 主要变更\n\n### 手动管理模型和分词器\n**v1.6.0 之前：**\n- 自动从 HuggingFace 下载模型和分词器\n- 部分模型需要身份验证\n- 模型列表固定不变\n\n**v1.6.0 之后：**\n- 需要手动下载并放置文件\n- 无需身份验证\n- 动态检测模型\n- 完全掌控模型文件\n\n## 📦 安装要求\n\n### 1. 分词器（必选）\n下载 Qwen2.5-1.5B 的分词器文件：\n- **来源**：[Qwen2.5-1.5B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-1.5B\u002Ftree\u002Fmain)\n- **所需文件**：`tokenizer_config.json`、`vocab.json`、`merges.txt`、`tokenizer.json`\n- **存放路径**：`ComfyUI\u002Fmodels\u002Fvibevoice\u002Ftokenizer\u002F`\n\n### 2. 模型\n下载您偏好的模型：\n\n| 模型 | 大小 | 来源 |\n|-------|------|--------|\n| **VibeVoice-1.5B** | ~5GB | [microsoft\u002FVibeVoice-1.5B](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FVibeVoice-1.5B) |\n| **VibeVoice-Large** | ~17GB | [aoi-ot\u002FVibeVoice-Large](https:\u002F\u002Fhuggingface.co\u002Faoi-ot\u002FVibeVoice-Large) |\n| **VibeVoice-Large-Quant-4Bit** | ~7GB | [DevParker\u002FVibeVoice7b-low-vram](https:\u002F\u002Fhuggingface.co\u002FDevParker\u002FVibeVoice7b-low-vram) |\n\n### 3. 文件夹结构\nComfyUI\u002Fmodels\u002Fvibevoice\u002F\n├── tokenizer\u002F                 # 必需 - Qwen 分词器\n│   ├── tokenizer_config.json\n│   ├── vocab.json\n│   ├── merges.txt\n│   └── tokenizer.json\n├── VibeVoice-1.5B\u002F           # 模型文件夹\n│   └── ... (模型文件)\n├── VibeVoice-Large\u002F\n│   └── ... (模型文件)\n└── custom-model-name\u002F        # 支持自定义名称\n    └── ... (模型文件)\n\n## ✨ 优势\n\n- ✅ **无认证问题**：可使用私有或受限制的仓库\n- ✅ **离线运行**：下载后无需互联网连接\n- ✅ **自定义模型**：轻松集成微调后的变体\n- ✅ **存储可控**：仅保留所需的模型\n- ✅ **日志更整洁**：改进了控制台输出\n\n## 🎯 功能\n\n### 动态模型检测\n- 每次浏览器刷新时扫描\n- 自动检测模型类型（标准\u002F量化）\n- 支持自定义文件夹名称\n- 兼容 HF 缓存结构\n\n### 分词器搜索优先级\n1. `ComfyUI\u002Fmodels\u002Fvibevoice\u002Ftokenizer\u002F`（推荐）\n2. 现有的 Qwen 缓存文件夹\n3. HuggingFace 缓存（如有）\n\n## 📝 迁移指南\n\n**对于现有用户：**\n1. 您现有的模型将继续正常工作\n2. 如果之前使用自动下载功能，分词器很可能已经缓存\n3. 刷新浏览器即可在动态下拉菜单中看到可用的模型\n4. 可选：为更好地组织文件，您可以将分词器移动到 `ComfyUI\u002Fmodels\u002Fvibevoice\u002Ftokenizer\u002F` 目录下\n\n**对于新安装用户：**\n1. 下载分词器（r","2025-09-27T19:55:48",{"id":181,"version":182,"summary_zh":183,"released_at":184},298003,"v1.5.0","## 🎚️ 语音速度控制\n\n- **语速调节**：新增功能，可控制生成音频的朗读节奏\n- **时间伸缩技术**：通过修改参考语音来影响输出速度\n- **精细调控**：速度可在0.8倍至1.2倍之间以1%为单位进行调整\n- **通用支持**：同时适用于单人声和多人声节点\n\n## ✨ 新增内容\n\n### 语音速度因子参数\n- **参数**：`voice_speed_factor` \n- **范围**：0.8至1.2（比正常速度慢20%至快20%）\n- **默认值**：1.0（正常速度）\n- **步长**：0.01，便于精确调整\n\n### 工作原理\n该功能会在处理前对参考音频应用时间伸缩：\n- **\u003C 1.0**：放慢参考音频 → 生成较慢的语音\n- **= 1.0**：正常速度（默认）\n- **> 1.0**：加快参考音频 → 生成较快的语音\n\n## 🎯 推荐用法\n\n### 最佳设置\n- **自然范围**：0.95至1.05（±5%调整）\n- **较慢语音**：0.95（轻松舒缓）或0.97（稍慢）\n- **较快语音**：1.03（充满活力）或1.05（快速节奏）\n\n### 使用建议\n- ✅ 建议使用20秒以上的参考音频以获得最佳效果\n- ✅ 进行小幅调整（±5%），以保持自然音效\n- ✅ 尝试不同数值，找到最适合您内容的语速\n\n## 💡 应用场景\n\n- 📚 **有声书**：根据不同类型调整语速\n- 🎙️ **播客**：匹配主播的说话风格\n- 📺 **视频旁白**：与视频节奏同步\n- 🎓 **教育内容**：放慢语速以方便学习\n- 📢 **公告**：加快语速以增强动态感\n\n## ⚠️ 重要提示\n\n- 仅适用于语音克隆功能（需提供参考音频）\n- 合成语音（无输入音频）不受此功能影响\n- 极端值可能导致不自然的效果\n- 在多人声模式下，该功能会同时应用于所有说话者\n- 参考音频越长（20秒以上），效果越好\n\n## 🔧 技术细节\n\n- 调整语速时保持音高不变\n- 无缝集成现有功能\n- 兼容所有模型变体及LoRA插件\n\n## 💾 安装方法\n\n可通过ComfyUI管理器安装，或手动克隆仓库：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\n```","2025-09-26T15:59:20",{"id":186,"version":187,"summary_zh":188,"released_at":189},298004,"v1.4.3","## 🔧 LoRA 系统增强\n\n- **兼容性检测**：自动检查模型与 LoRA 的兼容性\n- **更完善的日志记录**：增强的调试信息，便于故障排查\n- **清晰的错误提示**：针对不兼容组合提供详尽的反馈信息\n\n## ✨ 改进内容\n\n### 兼容性检查\n- **自动检测**：在加载前验证 LoRA 的兼容性\n- **模型匹配**：确保 LoRA 与基础模型架构一致\n- **量化感知**：检测并处理量化模型的限制\n- **防止崩溃**：在出现错误之前阻止不兼容的加载\n\n### 增强的调试功能\n- **详细日志**：可查看 LoRA 加载的每一步流程\n- **组件状态**：明确指示哪些 LoRA 组件成功加载\n- **错误诊断**：具体说明加载失败的原因\n\n## 🛡️ 稳定性提升\n\n- ✅ **不再无声失败**：当 LoRA 无法加载时，会给出明确提示\n- ✅ **量化模型安全**：避免在 4 位模型上使用标准 LoRA 时发生错误\n- ✅ **更好的错误恢复**：对不兼容组合进行优雅处理\n- ✅ **优化加载过程**：权重加载性能略有提升\n\n## 💡 用户体验\n\n## 💾 安装方法\n\n可通过 ComfyUI 管理器安装，或手动克隆：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\n```","2025-09-26T12:36:48",{"id":191,"version":192,"summary_zh":193,"released_at":194},298005,"v1.4.2","## 🐛 修复错误\n\n- 修复了用户报告的 minor 问题\n\n## 💾 安装\n\n可通过 ComfyUI 管理器安装，或手动安装：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\n```","2025-09-25T21:48:14",{"id":196,"version":197,"summary_zh":198,"released_at":199},298006,"v1.4.1","## 🐛 Bug Fix\r\n\r\n- **Fixed HuggingFace Authentication**: Resolved 401 authorization errors when loading locally cached models\r\n- **Local Model Loading**: Node now correctly uses local model snapshots without API authentication\r\n- **Reduced API Calls**: Prevents unnecessary HuggingFace API requests for existing models\r\n\r\n## 🔧 What's Fixed\r\n\r\n### Authentication Issue\r\n- Models already downloaded to `ComfyUI\u002Fmodels\u002Fvibevoice\u002F` no longer require HuggingFace login\r\n- Eliminates \"401 Unauthorized\" errors for cached models\r\n- Properly detects and uses local model files without internet verification\r\n\r\n## ✅ Impact\r\n\r\n- **Offline Usage**: Work with downloaded models without internet connection\r\n- **No Login Required**: Use cached models without HuggingFace account\r\n- **Faster Loading**: Skip API validation for local models\r\n- **Better Reliability**: No dependency on HuggingFace API availability\r\n\r\n## 💾 Installation\r\n\r\nInstall via ComfyUI Manager or manually:\r\n```bash\r\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\r\n```\r\n\r\n## 📋 Notes\r\n\r\nThis fix ensures smooth operation for users who:\r\n- Work offline with pre-downloaded models\r\n- Experience HuggingFace API rate limits\r\n- Prefer local-only workflows without external dependencies","2025-09-25T18:19:51",{"id":201,"version":202,"summary_zh":203,"released_at":204},298007,"v1.4.0","## 🎯 LoRA (Low-Rank Adaptation) Support\r\n\r\n- **Custom Voice Adaptations**: New LoRA support for fine-tuned voice models\r\n- **Dedicated Node**: Added \"VibeVoice LoRA\" node for easy configuration\r\n- **Flexible Control**: Adjustable strength and component toggles\r\n- **Low Memory Impact**: Only ~100-500MB overhead per LoRA\r\n\r\n## ✨ What's New\r\n\r\n### VibeVoice LoRA Node\r\n- **Easy Selection**: Dropdown menu for LoRAs stored in `ComfyUI\u002Fmodels\u002Fvibevoice\u002Floras\u002F`\r\n- **Component Control**: Toggle individual LoRA components\r\n  - Language Model adaptation\r\n  - Diffusion Head adaptation\r\n  - Connector adaptation\r\n- **Strength Adjustment**: Fine-tune LoRA influence (0.0-1.0)\r\n- **Chain Compatible**: Works with both Single and Multiple Speaker nodes\r\n\r\n## 🎛️ How to Use\r\n\r\n1. **Install LoRAs**: Place all LoRA files in a new directory inside:\r\n`ComfyUI\u002Fmodels\u002Fvibevoice\u002Floras\u002F` for example: `ComfyUI\u002Fmodels\u002Fvibevoice\u002Floras\u002Fmylora`\r\n\r\n2. **Add LoRA Node**: Insert \"VibeVoice LoRA\" node in your workflow\r\n\r\n3. **Connect**: Link LoRA output to speaker nodes\r\n\r\n4. **Configure**:\r\n   - Select LoRA from dropdown\r\n   - Adjust strength (default: 1.0)\r\n   - Toggle components as needed\r\n\r\n## 💡 Use Cases\r\n\r\n- 🎭 **Character Voices**: Fine-tuned voices for specific characters\r\n- 🌍 **Language\u002FAccent**: Specialized language or accent models\r\n- 🎯 **Domain Specific**: Industry or context-specific adaptations\r\n- 🎨 **Style Transfer**: Apply specific speaking styles or emotions\r\n- 👤 **Personal Voices**: Custom voice cloning improvements\r\n\r\n## 🚀 Benefits\r\n\r\n- ✅ **Minimal Overhead**: Only 100-500MB per LoRA vs full model fine-tuning\r\n- ✅ **Preservation**: Keep base model intact while adding adaptations\r\n\r\n## 🔧 Technical Details\r\n\r\n- Supports VibeVoice LoRA\r\n- Compatible with all VibeVoice model variants\r\n- Works alongside existing features (pause tags, text chunking, etc.)\r\n- Efficient memory usage through low-rank decomposition\r\n\r\n## 💾 Installation\r\n\r\nInstall via ComfyUI Manager or manually:\r\n```bash\r\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\r\n```\r\n\r\n## 🙏 Credits\r\n\r\nLoRA implementation contributed by [@jpgallegoar](https:\u002F\u002Fgithub.com\u002Fjpgallegoar) - Thank you for this amazing addition!\r\n\r\n## 📋 Notes\r\n\r\nVibeVoice finetuning:\r\nhttps:\u002F\u002Fgithub.com\u002Fvoicepowered-ai\u002FVibeVoice-finetuning","2025-09-25T17:05:24",{"id":206,"version":207,"summary_zh":208,"released_at":209},298008,"v1.3.0","## 🎙️ Custom Pause Tags\r\n\r\n- **Speech Pacing Control**: Added support for inserting pauses in generated speech\r\n- **Default Pause**: Use `[pause]` for a 1-second silence\r\n- **Custom Duration**: Use `[pause:ms]` for specific durations (e.g., `[pause:2000]` for 2 seconds)\r\n- **Universal Support**: Works with both Single Speaker and Multiple Speakers nodes\r\n\r\n## ✨ What's New\r\n\r\n### Pause Tag Syntax\r\n- **`[pause]`**: Inserts a 1-second silence\r\n- **`[pause:500]`**: Inserts 500ms (0.5 second) silence\r\n- **`[pause:2000]`**: Inserts 2-second silence\r\n- **`[pause:3500]`**: Inserts 3.5-second silence\r\n\r\n### How It Works\r\n- The wrapper parses your text to find pause tags\r\n- Text segments between pauses are processed independently\r\n- Silence audio is generated for each pause duration\r\n- All audio segments (speech and silence) are concatenated\r\n\r\n## 💡 Use Cases\r\n\r\nPerfect for:\r\n- 📖 **Dramatic Reading**: Add pauses for emphasis\r\n- 🎭 **Dialogue**: Natural conversation pacing\r\n- 📚 **Lists & Instructions**: Clear separation between items\r\n- 🎙️ **Podcasts\u002FNarration**: Professional pacing control\r\n- 📝 **Poetry**: Preserve verse structure and rhythm\r\n\r\n## 📝 Example Usage\r\n`Welcome to our presentation. [pause] Today we'll discuss three main topics. [pause] First, the introduction. [pause:500] Second, the main content. [pause:500] And finally, our conclusions.`\r\n\r\n## 🔧 Technical Details\r\n\r\n- **Note**: This is a wrapper feature, not part of Microsoft's VibeVoice\r\n- **Note**: The pause forces the text to be split into chunks. This may worsen the model's ability to understand the context. The model's context is represented ONLY by its own chunk.\r\n\r\n## 💾 Installation\r\n\r\nInstall via ComfyUI Manager or manually:\r\n```bash\r\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\r\n```","2025-09-11T02:55:09",{"id":211,"version":212,"summary_zh":213,"released_at":214},298009,"v1.2.5","## 🐛 Bug Fixes\r\n\r\n- Various stability improvements\r\n- Fixed minor issues reported by users\r\n- Fixed \"got an unexpected keyword argument enable_gqa\" bug with sage attention\r\n- General code optimizations\r\n\r\n## 💾 Installation\r\n\r\nInstall via ComfyUI Manager or manually:\r\n```bash\r\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\r\n```","2025-09-11T01:05:01",{"id":216,"version":217,"summary_zh":218,"released_at":219},298010,"v1.2.4","## 📝 Automatic Text Chunking\r\n\r\n- **Long Text Support**: Single Speaker node now automatically handles texts of any length\r\n- **Audio Quality Fix**: Prevents audio acceleration issues that occurred with long texts\r\n- **Configurable Chunking**: New `max_words_per_chunk` parameter for fine-tuning\r\n- **Seamless Output**: Automatically concatenates chunks for smooth, natural speech\r\n\r\n## ✨ What's New\r\n\r\n### Smart Text Processing\r\n- **Automatic Splitting**: Texts longer than 250 words are intelligently divided\r\n- **Consistent Voice**: Uses the same seed across all chunks for uniform voice characteristics\r\n- **Natural Breaks**: Splits at sentence boundaries when possible\r\n- **Transparent Process**: Works automatically without user intervention\r\n\r\n### New Parameter\r\n- **`max_words_per_chunk`**: Customize chunk size\r\n  - Range: 100-500 words\r\n  - Default: 250 words\r\n  - Lower values: More stable but more processing time\r\n  - Higher values: Faster but may risk acceleration on very long texts\r\n\r\n## 🎯 Benefits\r\n\r\n- ✅ **No More Audio Acceleration**: Fixes the common issue of sped-up audio on long texts\r\n- ✅ **Unlimited Text Length**: Process entire documents, books, or scripts\r\n- ✅ **Consistent Quality**: Maintains voice characteristics throughout\r\n- ✅ **Smooth Playback**: Seamless audio concatenation with no audible breaks\r\n\r\n## 💡 Use Cases\r\n\r\nPerfect for:\r\n- 📖 Audiobook generation\r\n- 📚 Long-form content narration\r\n- 📝 Document reading\r\n- 🎭 Extended monologues\r\n\r\n## 🔧 Usage\r\n\r\nThe chunking happens automatically. To adjust:\r\n1. Find the `max_words_per_chunk` parameter in Single Speaker node\r\n2. Adjust based on your needs:\r\n   - **Stability Priority**: Use 150-200 words\r\n   - **Balanced** (default): 250 words\r\n   - **Speed Priority**: Use 300-500 words\r\n\r\n## 💾 Installation\r\n\r\nInstall via ComfyUI Manager or manually:\r\n```\r\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\r\n```\r\n\r\n## 📋 Notes\r\n\r\nMulti-Speaker node unaffected (already processes by speaker segments)\r\nChunking is transparent - you'll receive a single audio output\r\nProcessing time scales linearly with text length\r\nVoice consistency is maintained using deterministic seeding","2025-09-09T11:48:25",{"id":221,"version":222,"summary_zh":223,"released_at":224},298011,"v1.2.3","## ⚡ SageAttention Integration\r\n\r\n- **New Attention Type**: Added \"sage\" option to attention_type parameter\r\n- **Quantized Attention**: Uses INT8\u002FFP8 precision for faster inference\r\n- **Performance Boost**: Speedup in generation with minimal quality impact\r\n\r\n## 🚀 Performance Improvements\r\n\r\n### Speed Comparison\r\n- **SageAttention**: Up to 2-3x faster than standard attention\r\n- **Memory Efficient**: Reduced memory bandwidth usage\r\n- **Quality Maintained**: Negligible quality difference in output\r\n\r\n## ✨ How It Works\r\n\r\nSageAttention uses quantized operations (INT8\u002FFP8) instead of full precision:\r\n- ✅ **Faster Matrix Operations**: Quantized math is significantly faster\r\n- ✅ **Lower Memory Bandwidth**: Smaller data types reduce memory transfer\r\n- ✅ **Intelligent Quantization**: Preserves important information while reducing precision\r\n- ✅ **Hardware Acceleration**: Leverages modern GPU tensor cores\r\n\r\n## ⚙️ Requirements\r\n\r\n### For SageAttention:\r\n- **GPU**: NVIDIA GPU with CUDA support\r\n- **Library**: Install sageattention\r\n- **CUDA**: Compatible CUDA version installed\r\n\r\n## 🎛️ Usage\r\n\r\nSelect \"sage\" from the attention_type dropdown:\r\n`auto`: Let transformers choose (default)\r\n`eager`: Standard implementation\r\n`sdpa`: Scaled Dot Product Attention\r\n`flash_attention_2`: Flash Attention 2\r\n`sage`: SageAttention (NEW) - quantized for speed\r\n\r\n## 💾 Installation\r\n\r\nInstall via ComfyUI Manager or manually:\r\n`git clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI`\r\n\r\n## 📋 Notes\r\n\r\nSageAttention is only available for NVIDIA GPUs with CUDA\r\nFalls back to standard attention if requirements aren't met\r\nIdeal for production environments prioritizing speed\r\nCompatible with all model variants including 4-bit quantized models","2025-09-09T09:58:39",{"id":226,"version":227,"summary_zh":228,"released_at":229},298012,"v1.2.2","## 🚀 4-Bit Quantized Model\r\n\r\n- **New Model Option**: Added `VibeVoice-Large-Quant-4Bit` to the model selection menu (Thanks to DevParker for quantizing the model)\r\n- **Memory Optimization**: Uses only ~7GB VRAM instead of ~17GB for the full model\r\n- **Performance**: Maintains excellent quality while significantly reducing memory requirements\r\n\r\n## 💾 Memory Savings\r\n\r\n### VRAM Usage Comparison\r\n- **VibeVoice-Large**: ~17GB VRAM\r\n- **VibeVoice-Large-Quant-4Bit**: ~7GB VRAM\r\n- **Savings**: 60% reduction in memory usage\r\n\r\n## ✨ Benefits\r\n\r\n- ✅ **Accessible to More GPUs**: Run on cards with 8GB+ VRAM\r\n- ✅ **Faster Loading**: Reduced model size means quicker initialization\r\n- ✅ **Batch Processing**: Fit larger batches in the same VRAM budget\r\n- ✅ **Quality Preserved**: Minimal quality loss with 4-bit quantization\r\n\r\n## ⚙️ Requirements\r\n\r\n### For 4-Bit Model:\r\n- **GPU**: NVIDIA GPU with CUDA support\r\n- **Library**: `bitsandbytes` must be installed\r\n  ```bash\r\n  pip install bitsandbytes\r\n    ```\r\n- VRAM: Minimum 8GB recommended\r\n\r\n## 💾 Installation\r\nInstall via ComfyUI Manager or manually:\r\n  ```bash\r\n  git clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\r\n   ```\r\n\r\n## 📋 Usage\r\n\r\n- Ensure bitsandbytes is installed\r\n- Select VibeVoice-Large-Quant-4Bit from the model dropdown\r\n- Enjoy the same great quality with lower VRAM usage\r\n\r\n## 🔧 Notes\r\n\r\n- The 4-bit model is only available for NVIDIA GPUs with CUDA\r\n- First-time use will download the quantized model (~7GB)","2025-09-07T05:02:45",{"id":231,"version":232,"summary_zh":233,"released_at":234},298013,"v1.2.1","## 🐛 Bug Fixes\r\n\r\n- Various stability improvements\r\n- Fixed minor issues reported by users\r\n- General code optimizations\r\n\r\n## 💾 Installation\r\n\r\nInstall via ComfyUI Manager or manually:\r\n```bash\r\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\r\n```\r\n\r\n## 📋 Notes\r\n\r\nThis maintenance release improves overall stability and reliability. No new features or breaking changes.","2025-09-06T16:45:54",{"id":236,"version":237,"summary_zh":238,"released_at":239},298014,"v1.2.0","## 🎉 Apple Silicon Support\r\n\r\n- **MPS Backend**: Added GPU acceleration for Mac with Apple Silicon (M1\u002FM2\u002FM3)\r\n- **Automatic Detection**: Seamlessly detects and utilizes MPS when available\r\n- **Performance Boost**: Significant speed improvements over CPU-only processing\r\n\r\n## 🍎 What's New\r\n\r\n### Metal Performance Shaders (MPS)\r\n- **Native GPU Acceleration**: Leverages Apple's Metal framework for neural processing\r\n- **Automatic Backend Selection**: No manual configuration needed\r\n- **Full Compatibility**: All VibeVoice features work with MPS acceleration\r\n- **Fallback Support**: Gracefully falls back to CPU if MPS is unavailable\r\n\r\n## ⚡ Performance Improvements\r\n\r\nOn Apple Silicon Macs:\r\n- ✅ **5-10x faster** generation compared to CPU-only mode\r\n- ✅ **Lower memory pressure** with efficient GPU memory management\r\n- ✅ **Better power efficiency** using dedicated neural engines\r\n- ✅ **Smoother workflows** with reduced generation times\r\n\r\n## 🖥️ Supported Devices\r\n\r\n- **M1**: MacBook Air, MacBook Pro, Mac mini, iMac\r\n- **M1 Pro\u002FMax\u002FUltra**: MacBook Pro, Mac Studio\r\n- **M2**: MacBook Air, MacBook Pro, Mac mini\r\n- **M2 Pro\u002FMax\u002FUltra**: MacBook Pro, Mac Studio\r\n- **M3**: MacBook Pro, iMac\r\n- **M3 Pro\u002FMax**: MacBook Pro\r\n\r\n## 💾 Installation\r\n\r\nInstall via ComfyUI Manager or manually:\r\n```bash\r\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\r\n```\r\n\r\n## 📋 Notes\r\n\r\nMPS support is automatically enabled on compatible devices\r\nNo additional configuration or dependencies required\r\nIntel Macs continue to use CPU processing as before\r\nWindows\u002FLinux users are unaffected by this update\r\n\r\n## 🔧 Technical Details\r\nThe implementation uses PyTorch's MPS backend, providing native Metal acceleration while maintaining full compatibility with existing workflows and features.","2025-09-06T03:05:01",{"id":241,"version":242,"summary_zh":243,"released_at":244},298015,"v1.1.1","## 🔧 Compatibility Enhancement\r\n\r\n- **Universal Transformers Support**: Implemented adaptive system for automatic compatibility with different transformers versions\r\n- **Version Range**: Guaranteed compatibility from v4.51.3 onwards\r\n- **Auto-Detection**: Automatically adapts to API changes between versions\r\n\r\n## ✨ What's New\r\n\r\n### Adaptive Compatibility System\r\n- **Automatic Version Detection**: Detects installed transformers version at runtime\r\n- **Dynamic API Adaptation**: Adjusts function calls based on detected version\r\n- **Future-Proof**: Handles both current and upcoming transformers releases\r\n- **No Manual Configuration**: Works out-of-the-box with any supported version\r\n\r\n## 🎯 Benefits\r\n\r\n- ✅ **No More Version Conflicts**: Works with any transformers version ≥4.51.3\r\n- ✅ **Seamless Updates**: Update transformers without breaking VibeVoice\r\n- ✅ **Better Integration**: Compatible with other nodes requiring different transformers versions\r\n- ✅ **Zero Configuration**: No need to manually specify or lock versions\r\n\r\n## 💡 Technical Details\r\n\r\nThe adaptive system handles:\r\n- API signature changes between versions\r\n- New parameter requirements\r\n- Breaking changes in transformers updates\r\n\r\n## 💾 Installation\r\n\r\nInstall via ComfyUI Manager or manually:\r\n```bash\r\ngit clone https:\u002F\u002Fgithub.com\u002FEnemyx-net\u002FVibeVoice-ComfyUI\r\n```\r\n\r\n## 📋 Notes\r\n\r\nThis update ensures maximum compatibility across different ComfyUI environments\r\nNo action required from users - the system automatically adapts\r\nResolves common installation issues related to transformers version mismatches","2025-09-05T16:32:11"]