[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-OpenBMB--VoxCPM":3,"tool-OpenBMB--VoxCPM":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":78,"owner_email":79,"owner_twitter":75,"owner_website":80,"owner_url":81,"languages":82,"stars":87,"forks":88,"last_commit_at":89,"license":90,"difficulty_score":10,"env_os":91,"env_gpu":92,"env_ram":91,"env_deps":93,"category_tags":101,"github_topics":102,"view_count":114,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":115,"updated_at":116,"faqs":117,"releases":156},2298,"OpenBMB\u002FVoxCPM","VoxCPM","VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning","VoxCPM 是一款创新的开源文本转语音（TTS）系统，致力于重新定义语音合成的真实感。它主要解决了传统技术因依赖离散令牌（tokenization）而导致的语音生硬、情感缺失及克隆失真等痛点，能够生成极具表现力且自然流畅的语音。\n\n该工具特别适合 AI 研究人员、开发者以及需要高质量语音内容的创作者使用。无论是开发实时交互应用、进行多语言语音研究，还是为视频内容定制拟真配音，VoxCPM 都能提供强大的支持。普通用户也可通过其在线演示轻松体验前沿的语音克隆技术。\n\nVoxCPM 的核心亮点在于其“无令牌”（Tokenizer-Free）架构。它摒弃了将声音转换为离散代码的传统步骤，采用端到端的扩散自回归模型，直接在连续空间中生成语音。基于 MiniCPM-4 骨干网络，它能隐式解耦语义与声学特征，从而实现两大旗舰能力：一是“语境感知”，能根据文本内容自动推断并生成恰当的语调、节奏和情感；二是“高保真零样本克隆”，仅需极短的参考音频，即可精准复刻说话人的音色、口音甚至细微的情绪波动。此外，其在消费级显卡上实现了低至 0.17 的实时因子，足以胜任实时流式合成任务。","## 🎙️ VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning\n\n\n[![Project Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject%20Page-GitHub-blue)](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002F) [![Technical Report](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTechnical%20Report-Arxiv-red)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24650)[![Live Playground](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLive%20PlayGround-Demo-orange)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenBMB\u002FVoxCPM-Demo) [![Samples](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAudio%20Samples-Page-green)](https:\u002F\u002Fopenbmb.github.io\u002FVoxCPM-demopage)\n\n#### VoxCPM1.5 Model Weights\n\n [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-OpenBMB-yellow)](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM1.5) [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-OpenBMB-purple)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FVoxCPM1.5)  \n\n\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_readme_1da4dc4794a5.png\" alt=\"VoxCPM Logo\" width=\"40%\">\n  \n  \u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F17704\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_readme_4a68feb902da.png\" alt=\"OpenBMB%2FVoxCPM | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\n\n\u003Cdiv align=\"center\">\n\n👋 Contact us on [WeChat](assets\u002Fwechat.png)\n\n\u003C\u002Fdiv>\n\n## News \n\n* [2026.03.30] **VoxCPM2 is comming soon** 🤗\n* [2025.12.05] 🎉 🎉 🎉  We Open Source the VoxCPM1.5 [weights](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM1.5)! The model now supports both full-parameter fine-tuning and efficient LoRA fine-tuning, empowering you to create your own tailored version. See [Release Notes](docs\u002Frelease_note.md) for details.\n* [2025.09.30] 🔥 🔥 🔥  We Release VoxCPM [Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24650)!\n* [2025.09.16] 🔥 🔥 🔥  We Open Source the VoxCPM-0.5B [weights](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM-0.5B)!\n* [2025.09.16] 🎉 🎉 🎉  We Provide the [Gradio PlayGround](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenBMB\u002FVoxCPM-Demo) for VoxCPM-0.5B, try it now! \n\n## Overview\n\nVoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in a continuous space, it overcomes the limitations of discrete tokenization and enables two flagship capabilities: context-aware speech generation and true-to-life zero-shot voice cloning.\n\nUnlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on [MiniCPM-4](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-0.5B) backbone, it achieves implicit semantic-acoustic decoupling through hierachical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_readme_525bce2c3920.png\" alt=\"VoxCPM Model Architecture\" width=\"90%\">\n\u003C\u002Fdiv>\n\n\n###  🚀 Key Features\n- **Context-Aware, Expressive Speech Generation** - VoxCPM comprehends text to infer and generate appropriate prosody, delivering speech with remarkable expressiveness and natural flow. It spontaneously adapts speaking style based on content, producing highly fitting vocal expression trained on a massive 1.8 million-hour bilingual corpus.\n- **True-to-Life Voice Cloning** - With only a short reference audio clip, VoxCPM performs accurate zero-shot voice cloning, capturing not only the speaker's timbre but also fine-grained characteristics such as accent, emotional tone, rhythm, and pacing to create a faithful and natural replica.\n- **High-Efficiency Synthesis** - VoxCPM supports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 GPU, making it possible for real-time applications.\n\n### 📦 Model Versions\nSee [Release Notes](docs\u002Frelease_note.md) for details\n- **VoxCPM1.5** (Latest): \n  - Model Params: 800M\n  - Sampling rate of AudioVAE: 44100\n  - Token rate in LM Backbone: 6.25Hz (patch-size=4)\n  - RTF in a single NVIDIA-RTX 4090 GPU: ~0.15\n\n- **VoxCPM-0.5B** (Original):\n  - Model Params: 640M\n  - Sampling rate of AudioVAE: 16000\n  - Token rate in LM Backbone: 12.5Hz (patch-size=2)\n  - RTF in a single NVIDIA-RTX 4090 GPU: 0.17\n\n\n\n##  Quick Start\n\n### 🔧 Install from PyPI\n``` sh\npip install voxcpm\n```\n### 1.  Model Download (Optional)\nBy default, when you first run the script, the model will be downloaded automatically, but you can also download the model in advance.\n- Download VoxCPM1.5\n    ```\n    from huggingface_hub import snapshot_download\n    snapshot_download(\"openbmb\u002FVoxCPM1.5\")\n    ```\n\n- Or Download VoxCPM-0.5B\n    ```\n    from huggingface_hub import snapshot_download\n    snapshot_download(\"openbmb\u002FVoxCPM-0.5B\")\n    ```\n- Download ZipEnhancer and SenseVoice-Small. We use ZipEnhancer to enhance speech prompts and SenseVoice-Small for speech prompt ASR in the web demo. \n    ```\n    from modelscope import snapshot_download\n    snapshot_download('iic\u002Fspeech_zipenhancer_ans_multiloss_16k_base')\n    snapshot_download('iic\u002FSenseVoiceSmall')\n    ```\n\n### 2. Basic Usage\n```python\nimport soundfile as sf\nimport numpy as np\nfrom voxcpm import VoxCPM\n\nmodel = VoxCPM.from_pretrained(\"openbmb\u002FVoxCPM1.5\")\n\n# Non-streaming\nwav = model.generate(\n    text=\"VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.\",\n    prompt_wav_path=None,      # optional: path to a prompt speech for voice cloning\n    prompt_text=None,          # optional: reference text\n    cfg_value=2.0,             # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse\n    inference_timesteps=10,   # LocDiT inference timesteps, higher for better result, lower for fast speed\n    normalize=False,           # enable external TN tool, but will disable native raw text support\n    denoise=False,             # enable external Denoise tool, but it may cause some distortion and restrict the sampling rate to 16kHz\n    retry_badcase=True,        # enable retrying mode for some bad cases (unstoppable)\n    retry_badcase_max_times=3,  # maximum retrying times\n    retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech\n)\n\nsf.write(\"output.wav\", wav, model.tts_model.sample_rate)\nprint(\"saved: output.wav\")\n\n# Streaming\nchunks = []\nfor chunk in model.generate_streaming(\n    text = \"Streaming text to speech is easy with VoxCPM!\",\n    # supports same args as above\n):\n    chunks.append(chunk)\nwav = np.concatenate(chunks)\n\nsf.write(\"output_streaming.wav\", wav, model.tts_model.sample_rate)\nprint(\"saved: output_streaming.wav\")\n```\n\n### 3. CLI Usage\n\nAfter installation, the entry point is `voxcpm` (or use `python -m voxcpm.cli`).\n\n```bash\n# 1) Direct synthesis (single text)\nvoxcpm --text \"VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.\" --output out.wav\n\n# 2) Voice cloning (reference audio + transcript)\nvoxcpm --text \"VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.\" \\\n  --prompt-audio path\u002Fto\u002Fvoice.wav \\\n  --prompt-text \"reference transcript\" \\\n  --output out.wav \\\n  # --denoise\n\n# (Optinal) Voice cloning (reference audio + transcript file)\nvoxcpm --text \"VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.\" \\\n  --prompt-audio path\u002Fto\u002Fvoice.wav \\\n  --prompt-file \"\u002Fpath\u002Fto\u002Ftext-file\" \\\n  --output out.wav \\\n  # --denoise\n\n# 3) Batch processing (one text per line)\nvoxcpm --input examples\u002Finput.txt --output-dir outs\n# (optional) Batch + cloning\nvoxcpm --input examples\u002Finput.txt --output-dir outs \\\n  --prompt-audio path\u002Fto\u002Fvoice.wav \\\n  --prompt-text \"reference transcript\" \\\n  # --denoise\n\n# 4) Inference parameters (quality\u002Fspeed)\nvoxcpm --text \"...\" --output out.wav \\\n  --cfg-value 2.0 --inference-timesteps 10 --normalize\n\n# 5) Model loading\n# Prefer local path\nvoxcpm --text \"...\" --output out.wav --model-path \u002Fpath\u002Fto\u002FVoxCPM_model_dir\n# Or from Hugging Face (auto download\u002Fcache)\nvoxcpm --text \"...\" --output out.wav \\\n  --hf-model-id openbmb\u002FVoxCPM1.5 --cache-dir ~\u002F.cache\u002Fhuggingface --local-files-only\n\n# 6) Denoiser control\nvoxcpm --text \"...\" --output out.wav \\\n  --no-denoiser --zipenhancer-path iic\u002Fspeech_zipenhancer_ans_multiloss_16k_base\n\n# 7) Help\nvoxcpm --help\npython -m voxcpm.cli --help\n```\n\n### 4. Start web demo\n\nYou can start the UI interface by running `python app.py`, which allows you to perform Voice Cloning and Voice Creation.\n\n### 5. Fine-tuning\n\nVoxCPM1.5 supports both full fine-tuning (SFT) and LoRA fine-tuning, allowing you to train personalized voice models on your own data. See the [Fine-tuning Guide](docs\u002Ffinetune.md) for detailed instructions.\n\n**Quick Start:**\n```bash\n# Full fine-tuning\npython scripts\u002Ftrain_voxcpm_finetune.py \\\n    --config_path conf\u002Fvoxcpm_v1.5\u002Fvoxcpm_finetune_all.yaml\n\n# LoRA fine-tuning\npython scripts\u002Ftrain_voxcpm_finetune.py \\\n    --config_path conf\u002Fvoxcpm_v1.5\u002Fvoxcpm_finetune_lora.yaml\n```\n\n## 📚 Documentation\n\n- **[Usage Guide](docs\u002Fusage_guide.md)** - Detailed guide on how to use VoxCPM effectively, including text input modes, voice cloning tips, and parameter tuning\n- **[Fine-tuning Guide](docs\u002Ffinetune.md)** - Complete guide for fine-tuning VoxCPM models with SFT and LoRA\n- **[Release Notes](docs\u002Frelease_note.md)** - Version history and updates\n- **[Performance Benchmarks](docs\u002Fperformance.md)** - Detailed performance comparisons on public benchmarks\n\n---\n\n## 📚 More Information\n\n###  🌟 Community Projects\nWe're excited to see the VoxCPM community growing! Here are some amazing projects and features built by our community:\n- **[ComfyUI-VoxCPM](https:\u002F\u002Fgithub.com\u002Fwildminder\u002FComfyUI-VoxCPM)** A VoxCPM extension for ComfyUI.\n- **[ComfyUI-VoxCPMTTS](https:\u002F\u002Fgithub.com\u002F1038lab\u002FComfyUI-VoxCPMTTS)** A VoxCPM extension for ComfyUI.\n- **[WebUI-VoxCPM](https:\u002F\u002Fgithub.com\u002Frsxdalv\u002Ftts_webui_extension.vox_cpm)** A template extension for TTS WebUI.\n- **[PR: Streaming API Support (by AbrahamSanders)](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fpull\u002F26)** \n- **[VoxCPM-NanoVLLM](https:\u002F\u002Fgithub.com\u002Fa710128\u002Fnanovllm-voxcpm)** NanoVLLM integration for VoxCPM for faster, high-throughput inference on GPU.\n- **[VoxCPM-ONNX](https:\u002F\u002Fgithub.com\u002Fbluryar\u002FVoxCPM-ONNX)** ONNX export for VoxCPM supports faster CPU inference.\n- **[VoxCPMANE](https:\u002F\u002Fgithub.com\u002F0seba\u002FVoxCPMANE)** VoxCPM TTS with Apple Neural Engine backend server.\n- **[PR: LoRA finetune web UI (by Ayin1412)](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fpull\u002F100)**\n- **[voxcpm_rs](https:\u002F\u002Fgithub.com\u002Fmadushan1000\u002Fvoxcpm_rs)** A re-implementation of VoxCPM-0.5B in Rust.\n\n*Note: The projects are not officially maintained by OpenBMB.*\n\n\n\n*Have you built something cool with VoxCPM? We'd love to feature it here! Please open an issue or pull request to add your project.*\n\n### 📊 Performance Highlights\n\nVoxCPM achieves competitive results on public zero-shot TTS benchmarks. See [Performance Benchmarks](docs\u002Fperformance.md) for detailed comparison tables.\n\n\n\n## ⚠️ Risks and limitations\n- General Model Behavior: While VoxCPM has been trained on a large-scale dataset, it may still produce outputs that are unexpected, biased, or contain artifacts.\n- Potential for Misuse of Voice Cloning: VoxCPM's powerful zero-shot voice cloning capability can generate highly realistic synthetic speech. This technology could be misused for creating convincing deepfakes for purposes of impersonation, fraud, or spreading disinformation. Users of this model must not use it to create content that infringes upon the rights of individuals. It is strictly forbidden to use VoxCPM for any illegal or unethical purposes. We strongly recommend that any publicly shared content generated with this model be clearly marked as AI-generated.\n- Current Technical Limitations: Although generally stable, the model may occasionally exhibit instability, especially with very long or expressive inputs. Furthermore, the current version offers limited direct control over specific speech attributes like emotion or speaking style.\n- Bilingual Model: VoxCPM is trained primarily on Chinese and English data. Performance on other languages is not guaranteed and may result in unpredictable or low-quality audio.\n- This model is released for research and development purposes only. We do not recommend its use in production or commercial applications without rigorous testing and safety evaluations. Please use VoxCPM responsibly.\n\n---\n\n## 📝 TO-DO List\nPlease stay tuned for updates!\n- [x] Release the VoxCPM technical report.\n- [x] Support higher sampling rate (44.1kHz in VoxCPM-1.5).\n- [x] Support SFT and LoRA fine-tuning.\n- [ ] Multilingual Support (besides ZH\u002FEN).\n- [ ] Controllable Speech Generation by Human Instruction.\n\n\n\n## 📄 License\nThe VoxCPM model weights and code are open-sourced under the [Apache-2.0](LICENSE) license.\n\n## 🙏 Acknowledgments\n\nWe extend our sincere gratitude to the following works and resources for their inspiration and contributions:\n\n- [DiTAR](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03930) for the diffusion autoregressive backbone used in speech generation\n- [MiniCPM-4](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM) for serving as the language model foundation\n- [CosyVoice](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice) for the implementation of Flow Matching-based LocDiT\n- [DAC](https:\u002F\u002Fgithub.com\u002Fdescriptinc\u002Fdescript-audio-codec) for providing the Audio VAE backbone\n\n## Institutions\n\nThis project is developed by the following institutions:\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_readme_f6e2504f7447.png\" width=\"28px\"> [ModelBest](https:\u002F\u002Fmodelbest.cn\u002F)\n\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_readme_3afab49d3c6c.png\" width=\"28px\"> [THUHCSI](https:\u002F\u002Fgithub.com\u002Fthuhcsi)\n\n\n## ⭐ Star History\n [![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_readme_c53faa263493.png)](https:\u002F\u002Fstar-history.com\u002F#OpenBMB\u002FVoxCPM&Date)\n\n\n## 📚 Citation\n\nIf you find our model helpful, please consider citing our projects 📝 and staring us ⭐️！\n\n```bib\n@article{voxcpm2025,\n  title        = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},\n  author       = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and Gui, Jiancheng and Li, Kehan and Wu, Zhiyong  and Liu, Zhiyuan},\n  journal      = {arXiv preprint arXiv:2509.24650},\n  year         = {2025},\n}\n```\n","## 🎙️ VoxCPM：无分词器TTS，用于上下文感知语音生成与逼真声音克隆\n\n\n[![项目页面](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject%20Page-GitHub-blue)](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002F) [![技术报告](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTechnical%20Report-Arxiv-red)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24650)[![在线试用](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLive%20PlayGround-Demo-orange)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenBMB\u002FVoxCPM-Demo) [![样本音频](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAudio%20Samples-Page-green)](https:\u002F\u002Fopenbmb.github.io\u002FVoxCPM-demopage)\n\n#### VoxCPM1.5 模型权重\n\n [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-OpenBMB-yellow)](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM1.5) [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-OpenBMB-purple)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FVoxCPM1.5)  \n\n\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_readme_1da4dc4794a5.png\" alt=\"VoxCPM Logo\" width=\"40%\">\n  \n  \u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F17704\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_readme_4a68feb902da.png\" alt=\"OpenBMB%2FVoxCPM | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\n\n\u003Cdiv align=\"center\">\n\n👋 欢迎通过 [微信](assets\u002Fwechat.png) 联系我们\n\n\u003C\u002Fdiv>\n\n## 最新消息 \n\n* [2026.03.30] **VoxCPM2 即将发布** 🤗\n* [2025.12.05] 🎉 🎉 🎉  我们开源了 VoxCPM1.5 的 [权重](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM1.5)! 该模型现在支持全参数微调和高效的 LoRA 微调，使您能够创建属于自己的定制版本。详情请参阅 [发布说明](docs\u002Frelease_note.md)。\n* [2025.09.30] 🔥 🔥 🔥  我们发布了 VoxCPM 的 [技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24650)!\n* [2025.09.16] 🔥 🔥 🔥  我们开源了 VoxCPM-0.5B 的 [权重](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM-0.5B)!\n* [2025.09.16] 🎉 🎉 🎉  我们为 VoxCPM-0.5B 提供了 [Gradio 演示平台](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenBMB\u002FVoxCPM-Demo)，快来试试吧！ \n\n## 概述\n\nVoxCPM 是一种新颖的无分词器文本转语音（TTS）系统，重新定义了语音合成的真实感。通过在连续空间中建模语音，它克服了离散分词的局限性，并实现了两大旗舰功能：上下文感知的语音生成和逼真的零样本声音克隆。\n\n与将语音转换为离散标记的主流方法不同，VoxCPM 使用端到端扩散自回归架构，直接从文本生成连续的语音表示。它基于 [MiniCPM-4](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-0.5B) 骨干网络构建，通过层次化语言建模和 FSQ 约束实现隐式的语义-声学解耦，从而极大地提升了表达能力和生成稳定性。\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_readme_525bce2c3920.png\" alt=\"VoxCPM 模型架构\" width=\"90%\">\n\u003C\u002Fdiv>\n\n\n###  🚀 核心特性\n- **上下文感知、富有表现力的语音生成** - VoxCPM 能够理解文本并推断出合适的韵律，生成极具表现力和自然流畅的语音。它会根据内容自发地调整说话风格，基于180万小时的双语大规模语料库训练，产生高度贴合的语音表达。\n- **逼真的声音克隆** - 仅需一段简短的参考音频，VoxCPM 就能进行准确的零样本声音克隆，不仅捕捉到说话者的音色，还能细致入微地还原口音、情感基调、节奏和语速等特征，从而创造出忠实而自然的复制品。\n- **高效合成** - VoxCPM 支持流式合成，在消费级 NVIDIA RTX 4090 GPU 上实时因子（RTF）低至0.17，使其适用于实时应用场景。\n\n### 📦 模型版本\n详情请参阅 [发布说明](docs\u002Frelease_note.md)\n- **VoxCPM1.5**（最新）：\n  - 模型参数：8亿\n  - AudioVAE 采样率：44100\n  - 语言模型骨干中的标记频率：6.25Hz（补丁大小=4）\n  - 在单个 NVIDIA RTX 4090 GPU 上的 RTF：约 0.15\n\n- **VoxCPM-0.5B**（原始）：\n  - 模型参数：6.4亿\n  - AudioVAE 采样率：16000\n  - 语言模型骨干中的标记频率：12.5Hz（补丁大小=2）\n  - 在单个 NVIDIA RTX 4090 GPU 上的 RTF：0.17\n\n\n\n## 快速入门\n\n### 🔧 从 PyPI 安装\n``` sh\npip install voxcpm\n```\n### 1. 模型下载（可选）\n默认情况下，首次运行脚本时，模型会自动下载，但您也可以提前下载模型。\n- 下载 VoxCPM1.5\n    ```\n    from huggingface_hub import snapshot_download\n    snapshot_download(\"openbmb\u002FVoxCPM1.5\")\n    ```\n\n- 或下载 VoxCPM-0.5B\n    ```\n    from huggingface_hub import snapshot_download\n    snapshot_download(\"openbmb\u002FVoxCPM-0.5B\")\n    ```\n- 下载 ZipEnhancer 和 SenseVoice-Small。我们在网页演示中使用 ZipEnhancer 来增强语音提示，以及 SenseVoice-Small 进行语音提示的 ASR。\n    ```\n    from modelscope import snapshot_download\n    snapshot_download('iic\u002Fspeech_zipenhancer_ans_multiloss_16k_base')\n    snapshot_download('iic\u002FSenseVoiceSmall')\n    ```\n\n### 2. 基本使用\n```python\nimport soundfile as sf\nimport numpy as np\nfrom voxcpm import VoxCPM\n\nmodel = VoxCPM.from_pretrained(\"openbmb\u002FVoxCPM1.5\")\n\n# 非流式\nwav = model.generate(\n    text=\"VoxCPM 是 ModelBest 推出的一款创新端到端 TTS 模型，旨在生成极具表现力的语音。\",\n    prompt_wav_path=None,      # 可选：用于声音克隆的提示语音路径\n    prompt_text=None,          # 可选：参考文本\n    cfg_value=2.0,             # LM 对 LocDiT 的引导强度，值越高越贴近提示，但也可能影响质量\n    inference_timesteps=10,   # LocDiT 推理步数，数值越高效果越好，但速度较慢\n    normalize=False,           # 启用外部文本规范化工具，但会禁用原生纯文本支持\n    denoise=False,             # 启用外部降噪工具，但可能会导致失真并将采样率限制为16kHz\n    retry_badcase=True,        # 启用对某些不良情况的重试模式（不可停止）\n    retry_badcase_max_times=3,  # 最大重试次数\n    retry_badcase_ratio_threshold=6.0, # 不良情况检测的最大长度限制（简单有效），对于语速较慢的语音可以适当调整\n)\n\nsf.write(\"output.wav\", wav, model.tts_model.sample_rate)\nprint(\"已保存：output.wav\")\n\n# 流式\nchunks = []\nfor chunk in model.generate_streaming(\n    text = \"使用 VoxCPM 进行流式文本转语音非常容易！\",\n    # 支持与上述相同的参数\n):\n    chunks.append(chunk)\nwav = np.concatenate(chunks)\n\nsf.write(\"output_streaming.wav\", wav, model.tts_model.sample_rate)\nprint(\"已保存：output_streaming.wav\")\n```\n\n### 3. CLI 使用\n\n安装完成后，入口命令为 `voxcpm`（或使用 `python -m voxcpm.cli`）。\n\n```bash\n# 1) 直接合成（单条文本）\nvoxcpm --text \"VoxCPM 是 ModelBest 推出的一款创新端到端 TTS 模型，旨在生成极具表现力的语音。\" --output out.wav\n\n# 2) 音色克隆（参考音频 + 文本转录）\nvoxcpm --text \"VoxCPM 是 ModelBest 推出的一款创新性端到端 TTS 模型，旨在生成极具表现力的语音。\" \\\n  --prompt-audio path\u002Fto\u002Fvoice.wav \\\n  --prompt-text \"参考文本转录\" \\\n  --output out.wav \\\n  # --denoise\n\n# （可选）音色克隆（参考音频 + 文本文件）\nvoxcpm --text \"VoxCPM 是 ModelBest 推出的一款创新性端到端 TTS 模型，旨在生成极具表现力的语音。\" \\\n  --prompt-audio path\u002Fto\u002Fvoice.wav \\\n  --prompt-file \"\u002Fpath\u002Fto\u002Ftext-file\" \\\n  --output out.wav \\\n  # --denoise\n\n# 3) 批量处理（每行一个文本）\nvoxcpm --input examples\u002Finput.txt --output-dir outs\n# （可选）批量 + 克隆\nvoxcpm --input examples\u002Finput.txt --output-dir outs \\\n  --prompt-audio path\u002Fto\u002Fvoice.wav \\\n  --prompt-text \"参考文本转录\" \\\n  # --denoise\n\n# 4) 推理参数（质量\u002F速度）\nvoxcpm --text \"...\" --output out.wav \\\n  --cfg-value 2.0 --inference-timesteps 10 --normalize\n\n# 5) 模型加载\n# 建议使用本地路径\nvoxcpm --text \"...\" --output out.wav --model-path \u002Fpath\u002Fto\u002FVoxCPM_model_dir\n# 或从 Hugging Face 加载（自动下载\u002F缓存）\nvoxcpm --text \"...\" --output out.wav \\\n  --hf-model-id openbmb\u002FVoxCPM1.5 --cache-dir ~\u002F.cache\u002Fhuggingface --local-files-only\n\n# 6) 去噪器控制\nvoxcpm --text \"...\" --output out.wav \\\n  --no-denoiser --zipenhancer-path iic\u002Fspeech_zipenhancer_ans_multiloss_16k_base\n\n# 7) 帮助信息\nvoxcpm --help\npython -m voxcpm.cli --help\n```\n\n### 4. 启动 Web 演示\n您可以通过运行 `python app.py` 来启动 UI 界面，从而进行音色克隆和语音合成。\n\n### 5. 微调\nVoxCPM1.5 支持全量微调（SFT）和 LoRA 微调两种方式，允许您基于自己的数据训练个性化的语音模型。详细操作请参阅 [微调指南](docs\u002Ffinetune.md)。\n\n**快速入门：**\n```bash\n# 全量微调\npython scripts\u002Ftrain_voxcpm_finetune.py \\\n    --config_path conf\u002Fvoxcpm_v1.5\u002Fvoxcpm_finetune_all.yaml\n\n# LoRA 微调\npython scripts\u002Ftrain_voxcpm_finetune.py \\\n    --config_path conf\u002Fvoxcpm_v1.5\u002Fvoxcpm_finetune_lora.yaml\n```\n\n## 📚 文档\n- **[使用指南](docs\u002Fusage_guide.md)** - 详细介绍如何有效使用 VoxCPM，包括文本输入模式、音色克隆技巧及参数调优。\n- **[微调指南](docs\u002Ffinetune.md)** - 完整的 SFT 和 LoRA 微调教程。\n- **[发布说明](docs\u002Frelease_note.md)** - 版本历史与更新内容。\n- **[性能基准](docs\u002Fperformance.md)** - 在公开基准上的详细性能对比。\n\n---\n\n## 📚 更多信息\n\n###  🌟 社区项目\n我们很高兴看到 VoxCPM 社区不断壮大！以下是一些由社区开发者打造的优秀项目和功能：\n- **[ComfyUI-VoxCPM](https:\u002F\u002Fgithub.com\u002Fwildminder\u002FComfyUI-VoxCPM)** ComfyUI 的 VoxCPM 插件。\n- **[ComfyUI-VoxCPMTTS](https:\u002F\u002Fgithub.com\u002F1038lab\u002FComfyUI-VoxCPMTTS)** ComfyUI 的 VoxCPM 扩展。\n- **[WebUI-VoxCPM](https:\u002F\u002Fgithub.com\u002Frsxdalv\u002Ftts_webui_extension.vox_cpm)** TTS WebUI 的模板扩展。\n- **[PR：支持流式 API（由 AbrahamSanders 提供）](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fpull\u002F26)**\n- **[VoxCPM-NanoVLLM](https:\u002F\u002Fgithub.com\u002Fa710128\u002Fnanovllm-voxcpm)** 将 NanoVLLM 与 VoxCPM 结合，实现 GPU 上更快、更高吞吐量的推理。\n- **[VoxCPM-ONNX](https:\u002F\u002Fgithub.com\u002Fbluryar\u002FVoxCPM-ONNX)** 导出 ONNX 格式的 VoxCPM，支持更快速的 CPU 推理。\n- **[VoxCPMANE](https:\u002F\u002Fgithub.com\u002F0seba\u002FVoxCPMANE)** 使用 Apple Neural Engine 后端服务器的 VoxCPM TTS。\n- **[PR：LoRA 微调 Web UI（由 Ayin1412 提供）](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fpull\u002F100)**\n- **[voxcpm_rs](https:\u002F\u002Fgithub.com\u002Fmadushan1000\u002Fvoxcpm_rs)** 用 Rust 重新实现的 VoxCPM-0.5B。\n\n*注：这些项目并非由 OpenBMB 官方维护。*\n\n*您是否用 VoxCPM 构建了什么酷炫的东西？我们非常乐意在此展示您的作品！请提交 issue 或 pull request 以添加您的项目。*\n\n### 📊 性能亮点\nVoxCPM 在公开的零样本 TTS 基准测试中表现出色。详细对比表格请参阅 [性能基准](docs\u002Fperformance.md)。\n\n## ⚠️ 风险与限制\n- 模型通用行为：尽管 VoxCPM 经过大规模数据训练，仍可能产生意外、有偏见或包含伪影的输出。\n- 音色克隆的潜在滥用：VoxCPM 强大的零样本音色克隆能力可以生成高度逼真的合成语音。这种技术可能被滥用于制作令人信服的深度伪造视频或音频，以达到冒充、欺诈或散布虚假信息的目的。用户不得利用该模型生成侵犯他人权益的内容。严禁将 VoxCPM 用于任何非法或不道德的目的。我们强烈建议，所有使用该模型生成并公开分享的内容都应明确标注为 AI 生成。\n- 当前技术局限性：尽管模型总体稳定，但在处理超长或极具表现力的输入时，偶尔可能出现不稳定现象。此外，当前版本对情感、语调等特定语音属性的直接控制能力有限。\n- 双语模型：VoxCPM 主要基于中文和英文数据训练。对于其他语言的支持并不保证，可能导致不可预测或低质量的音频输出。\n- 本模型仅用于研究和开发目的。未经严格测试和安全评估，不建议将其用于生产或商业用途。请负责任地使用 VoxCPM。\n\n---\n\n## 📝 待办事项\n敬请关注后续更新！\n- [x] 发布 VoxCPM 技术报告。\n- [x] 支持更高采样率（VoxCPM-1.5 已支持 44.1kHz）。\n- [x] 支持 SFT 和 LoRA 微调。\n- [ ] 多语言支持（除中文和英文外）。\n- [ ] 实现通过人类指令控制语音生成。\n\n## 📄 许可证\nVoxCPM 的模型权重和代码均采用 [Apache-2.0](LICENSE) 开源许可证。\n\n## 🙏 致谢\n我们衷心感谢以下工作和资源提供的灵感与贡献：\n\n- [DiTAR](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03930) 用于语音生成的扩散自回归骨干网络。\n- [MiniCPM-4](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM) 作为语言模型的基础。\n- [CosyVoice](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice) 实现的基于 Flow Matching 的 LocDiT。\n- [DAC](https:\u002F\u002Fgithub.com\u002Fdescriptinc\u002Fdescript-audio-codec) 提供的音频 VAE 骨干网络。\n\n## 机构\n本项目由以下机构共同开发：\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_readme_f6e2504f7447.png\" width=\"28px\"> [ModelBest](https:\u002F\u002Fmodelbest.cn\u002F)\n\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_readme_3afab49d3c6c.png\" width=\"28px\"> [THUHCSI](https:\u002F\u002Fgithub.com\u002Fthuhcsi)\n\n\n## ⭐ 星标历史\n [![星标历史图](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_readme_c53faa263493.png)](https:\u002F\u002Fstar-history.com\u002F#OpenBMB\u002FVoxCPM&Date)\n\n## 📚 引用\n\n如果您觉得我们的模型有帮助，请考虑引用我们的项目 📝 并给我们的项目加星 ⭐️！\n\n```bib\n@article{voxcpm2025,\n  title        = {VoxCPM：无需分词器的TTS，用于上下文感知语音生成和逼真的人声克隆},\n  author       = {周义轩、曾国阳、刘欣、李翔、于仁杰、王子洋、叶润川、孙伟岳、桂建诚、李可涵、吴志勇、刘志远},\n  journal      = {arXiv预印本 arXiv:2509.24650},\n  year         = {2025},\n}\n```","# VoxCPM 快速上手指南\n\nVoxCPM 是一款无需分词器（Tokenizer-Free）的端到端文本转语音（TTS）模型，支持上下文感知的 expressive 语音生成和高保真零样本声音克隆。\n\n## 环境准备\n\n*   **操作系统**: Linux \u002F macOS \u002F Windows\n*   **Python**: 3.8 及以上版本\n*   **GPU**: 推荐 NVIDIA GPU (如 RTX 4090)，显存建议 16GB 以上以获得最佳推理速度。\n*   **依赖库**: `torch`, `soundfile`, `numpy` 等（安装脚本时会自动处理）。\n\n## 安装步骤\n\n### 1. 安装核心库\n通过 PyPI 直接安装：\n```bash\npip install voxcpm\n```\n\n### 2. 下载模型与辅助组件（推荐国内加速）\n虽然首次运行会自动下载，但建议预先下载以提升稳定性。国内用户推荐使用 **ModelScope** 或配置 Hugging Face 镜像。\n\n**方案 A：使用 ModelScope 下载（推荐国内用户）**\n```python\nfrom modelscope import snapshot_download\n\n# 下载主模型 (VoxCPM1.5)\nsnapshot_download('OpenBMB\u002FVoxCPM1.5')\n\n# 下载语音增强和识别组件 (Web Demo 及去噪功能必需)\nsnapshot_download('iic\u002Fspeech_zipenhancer_ans_multiloss_16k_base')\nsnapshot_download('iic\u002FSenseVoiceSmall')\n```\n\n**方案 B：使用 Hugging Face (需配置镜像)**\n若网络环境允许，可设置镜像后下载：\n```bash\nexport HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n```\n```python\nfrom huggingface_hub import snapshot_download\nsnapshot_download(\"openbmb\u002FVoxCPM1.5\")\n```\n\n## 基本使用\n\n### 方法一：Python 代码调用\n\n以下示例展示如何加载模型并生成语音（非流式）：\n\n```python\nimport soundfile as sf\nimport numpy as np\nfrom voxcpm import VoxCPM\n\n# 加载模型 (自动读取本地缓存或从云端下载)\nmodel = VoxCPM.from_pretrained(\"openbmb\u002FVoxCPM1.5\")\n\n# 生成语音\nwav = model.generate(\n    text=\"VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.\",\n    prompt_wav_path=None,      # 可选：填入参考音频路径以实现声音克隆\n    prompt_text=None,          # 可选：参考音频对应的文本\n    cfg_value=2.0,             # 引导系数，越高越贴近提示音，但可能影响自然度\n    inference_timesteps=10,    # 推理步数，越高效果越好，越低速度越快\n    normalize=False,           # 是否启用外部文本归一化\n    denoise=False,             # 是否启用外部去噪 (会限制采样率为 16kHz)\n    retry_badcase=True,        # 开启坏案例重试机制\n    retry_badcase_max_times=3, # 最大重试次数\n    retry_badcase_ratio_threshold=6.0, # 坏案例检测阈值\n)\n\n# 保存结果\nsf.write(\"output.wav\", wav, model.tts_model.sample_rate)\nprint(\"saved: output.wav\")\n```\n\n**流式生成示例：**\n```python\nchunks = []\nfor chunk in model.generate_streaming(\n    text=\"Streaming text to speech is easy with VoxCPM!\",\n):\n    chunks.append(chunk)\nwav = np.concatenate(chunks)\nsf.write(\"output_streaming.wav\", wav, model.tts_model.sample_rate)\n```\n\n### 方法二：命令行工具 (CLI)\n\n安装完成后，可直接使用 `voxcpm` 命令。\n\n**1. 基础文本转语音**\n```bash\nvoxcpm --text \"VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.\" --output out.wav\n```\n\n**2. 声音克隆（参考音频 + 参考文本）**\n```bash\nvoxcpm --text \"目标合成文本\" \\\n  --prompt-audio path\u002Fto\u002Fvoice.wav \\\n  --prompt-text \"参考音频对应的文本\" \\\n  --output out.wav\n```\n\n**3. 指定本地模型路径**\n```bash\nvoxcpm --text \"...\" --output out.wav --model-path \u002Fpath\u002Fto\u002FVoxCPM_model_dir\n```\n\n**4. 查看帮助**\n```bash\nvoxcpm --help\n```\n\n### 方法三：启动 Web 界面\n运行以下命令启动图形化演示界面，支持在线声音克隆和参数调整：\n```bash\npython app.py\n```","一家专注于有声书制作的初创团队，正试图将大量经典文学作品快速转化为具有情感张力的多人广播剧。\n\n### 没有 VoxCPM 时\n- **语调机械生硬**：传统 TTS 工具无法理解上下文语境，朗读悲伤或紧张情节时依然保持平淡的播音腔，缺乏必要的情感起伏。\n- **克隆成本高昂**：为每个角色定制声音需要录制数小时的高质量素材进行模型训练，且难以捕捉说话人独特的口音和呼吸节奏。\n- **后期处理繁琐**：生成的音频往往需要人工逐句调整停顿和重音，甚至重新录制，导致制作周期长达数周。\n- **资源消耗巨大**：为了追求稍好的音质，必须依赖昂贵的云端算力集群，无法在本地开发机上实时预览效果。\n\n### 使用 VoxCPM 后\n- **情感自然流露**：VoxCPM 能深度理解文本语义，自动根据剧情推断出恰当的语调，让角色在惊恐时声音颤抖、在温馨时语速柔和。\n- **即时高保真克隆**：仅需提供一段几秒钟的角色参考音频，VoxCPM 即可实现零样本克隆，精准还原说话人的音色、方言口音及细微的情绪特征。\n- **工作流大幅提速**：端到端的连续空间生成技术消除了分词误差，输出音频流畅自然，团队可将原本数周的后期工作压缩至几小时内完成。\n- **本地实时合成**：凭借高效的架构，VoxCPM 在单张消费级 RTX 4090 显卡上即可实现超低延迟的流式合成，创作者能边写剧本边实时试听效果。\n\nVoxCPM 通过无分词的连续建模技术，真正实现了“懂内容、像真人”的语音生成，让高质量有声内容的规模化生产变得触手可及。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_VoxCPM_1da4dc47.png","OpenBMB","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FOpenBMB_02e4bd39.png","OpenBMB (Open Lab for Big Model Base) aims to build foundation models and systems towards AGI.",null,"openbmb@gmail.com","https:\u002F\u002Fwww.openbmb.cn","https:\u002F\u002Fgithub.com\u002FOpenBMB",[83],{"name":84,"color":85,"percentage":86},"Python","#3572A5",100,6239,750,"2026-04-03T11:07:57","Apache-2.0","未说明","需要 NVIDIA GPU（文中提及 RTX 4090），显存需求未明确说明，CUDA 版本未说明",{"notes":94,"python":91,"dependencies":95},"支持通过 PyPI 直接安装 (pip install voxcpm)。模型首次运行时会自动下载，也可手动从 Hugging Face 或 ModelScope 下载。提供 VoxCPM1.5 (800M 参数) 和 VoxCPM-0.5B (640M 参数) 两个版本。在单张 NVIDIA RTX 4090 上实时率 (RTF) 可达 0.15-0.17，支持流式合成。支持全量微调和 LoRA 微调。若使用去噪功能，采样率将被限制为 16kHz。",[96,97,98,99,100],"voxcpm","soundfile","numpy","huggingface_hub","modelscope",[55,13],[103,104,105,106,107,108,109,110,111,112,113],"audio","deeplearning","minicpm","python","pytorch","speech","speech-synthesis","text-to-speech","tts","tts-model","voice-cloning",7,"2026-03-27T02:49:30.150509","2026-04-06T08:09:05.647981",[118,123,127,132,137,141,146,151],{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},10544,"微调 LoRA 时，是否支持一个音色对应多个语种？","是的，在微调 LoRA 的情况下，应该是可以支持一个音色对应多个语种的。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fissues\u002F183",{"id":124,"question_zh":125,"answer_zh":126,"source_url":122},10545,"拥有大量数据（如 800 分钟音频、12 个声线）时，为了提升稳定性建议全参数微调还是 LoRA？推荐参数是多少？","对于 800 分钟的数据量，可以尝试全量微调以提升稳定性。关于训练步数，2000 步可能偏少，建议尝试增加到 20000 步（2w step），并观察 epoch 数。虽然官方暂未进行大规模全量微调测试，但可根据数据量多训一些步数。",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},10546,"合成音频时长不稳定（有时很短有时很长）是什么原因？","这通常是因为生成的音频尾部包含了很长的静音片段，导致总时长变长。这种情况在微调后的模型中也可能出现。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fissues\u002F168",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},10547,"为什么每次推理前都有 5-10 秒的预热时间？如何消除以服务化部署？","这个预热环节是为了在支持 torch compile 的设备上提前完成编译（compile）。该过程仅在模型加载（load）时发生一次。模型加载完成后，后续调用接口进行推理时不会再有这一步骤，因此封装服务时无需担心每次请求都预热。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fissues\u002F67",{"id":138,"question_zh":139,"answer_zh":140,"source_url":136},10548,"Mac (M4) 设备上流式播放语音时有明显的停滞卡顿感，如何解决？","这是因为 Mac 上的推理速度无法完全达到实时要求，导致生成速度跟不上播放速度。在 M4 芯片上，VoxCPM 的帧率约为 9.5 帧\u002F秒，而要实现流畅的实时播放，推理速度需要超过 12.5 帧\u002F秒。这是硬件性能瓶颈导致的。",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},10549,"合成音频时出现意外的“唱歌”行为或概率性异常，是什么原因？","这通常与模型本身的稳定性以及超参数 cfg_value 的设置有关。建议尝试升级到 VoxCPM 1.5 版本，该版本稳定性有所提升。如果问题依旧，可以调整 cfg_value 参数进行测试。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fissues\u002F136",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},10550,"使用 torch.compile 时报错或无法成功编译，需要什么环境配置？","成功编译通常需要特定的版本组合。在 RTX 4090 上验证可用的配置参考如下：\n- OS: Ubuntu 22.04\n- Python: 3.10.12\n- PyTorch: 2.5.1 (或 2.7.1+cu126)\n- CUDA: 12.6\n- Triton: 3.1.0 (或 3.3.1)\n- einops: 0.8.1\n确保安装了正确版本的 triton 和 einops 是关键。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fissues\u002F19",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},10551,"VoxCPM 是否支持高并发推理或 vLLM 加速？","原生版本在高并发方面可能存在限制，但社区已开源了基于 nanovllm 的适配版本，可用于提升并发性能。\n项目地址：https:\u002F\u002Fgithub.com\u002Fa710128\u002Fnanovllm-voxcpm\n使用方法：克隆该项目后运行 `pip install -e .` 安装。\n注意：受限于 nanovllm，目前仅支持 CUDA 设备。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fissues\u002F48",[157,162,167,172,177,182,186,190,194,198],{"id":158,"version":159,"summary_zh":160,"released_at":161},71096,"1.5.0","**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fcompare\u002F1.0.5...1.5.0\r\n\r\nWhat's New and What's Next? See our [Release Note](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fblob\u002Fmain\u002Fdocs\u002Frelease_note.md)","2025-12-05T14:48:03",{"id":163,"version":164,"summary_zh":165,"released_at":166},71097,"1.0.5","* Supports MPS devices.\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fcompare\u002F1.0.4...1.0.5","2025-10-09T05:22:50",{"id":168,"version":169,"summary_zh":170,"released_at":171},71098,"1.0.4","## What's Changed\r\n* add prompt-file option to set prompt text by @MayDomine in https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fpull\u002F17\r\n* Add a streaming API for VoxCPM by @AbrahamSanders in https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fpull\u002F26\r\n* **Update the minimum Python version to 3.10** to support Gradio 5.\r\n\r\n## New Contributors\r\n* @MayDomine made their first contribution in https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fpull\u002F17\r\n* @AbrahamSanders made their first contribution in https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fpull\u002F26\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fcompare\u002F1.0.3...1.0.4","2025-09-23T06:12:15",{"id":173,"version":174,"summary_zh":175,"released_at":176},71099,"1.0.3","**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fcompare\u002F1.0.2...1.0.3","2025-09-18T12:06:51",{"id":178,"version":179,"summary_zh":180,"released_at":181},71100,"1.0.2","**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fcompare\u002F1.0.1...1.0.2","2025-09-17T13:18:47",{"id":183,"version":184,"summary_zh":78,"released_at":185},71101,"1.0.1","2025-09-16T08:54:42",{"id":187,"version":188,"summary_zh":78,"released_at":189},71102,"1.0.0rc3","2025-09-16T06:14:04",{"id":191,"version":192,"summary_zh":78,"released_at":193},71103,"1.0.0rc2","2025-09-16T05:23:01",{"id":195,"version":196,"summary_zh":78,"released_at":197},71104,"1.0.0rc1","2025-09-16T05:13:34",{"id":199,"version":200,"summary_zh":78,"released_at":201},71105,"1.0.0","2025-09-16T08:06:46"]