[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-FunAudioLLM--CosyVoice":3,"tool-FunAudioLLM--CosyVoice":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":77,"owner_location":77,"owner_email":77,"owner_twitter":77,"owner_website":77,"owner_url":78,"languages":79,"stars":92,"forks":93,"last_commit_at":94,"license":95,"difficulty_score":10,"env_os":96,"env_gpu":97,"env_ram":98,"env_deps":99,"category_tags":111,"github_topics":112,"view_count":23,"oss_zip_url":77,"oss_zip_packed_at":77,"status":16,"created_at":132,"updated_at":133,"faqs":134,"releases":165},3778,"FunAudioLLM\u002FCosyVoice","CosyVoice","Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.","CosyVoice 是一款基于大语言模型的多语言语音生成开源项目，致力于提供从训练、推理到部署的全栈能力。它主要解决了传统语音合成在跨语言场景下音色不一致、情感不自然以及方言支持匮乏的难题，让用户能够轻松实现高质量的零样本语音克隆。\n\n无论是需要构建智能客服的开发者、研究多模态交互的科研人员，还是希望为视频内容配音的创作者，都能从中受益。CosyVoice 不仅支持中、英、日、韩等 9 种主流语言，还涵盖粤语、四川话等 18 种以上中方言，并能通过指令灵活控制语速、情绪和音量。其独特的技术亮点包括无需传统前端即可识别数字与符号的文本规范化能力、支持拼音与音素修正的发音“修补”功能，以及低至 150 毫秒的双向流式推理延迟。凭借在内容一致性、说话人相似度和韵律自然度上的卓越表现，CosyVoice 已成为当前开源领域极具竞争力的语音合成解决方案。","![SVG Banners](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_CosyVoice_readme_3ec6b8dca44a.png)\n\n## 👉🏻 CosyVoice 👈🏻\n\n**Fun-CosyVoice 3.0**: [Demos](https:\u002F\u002Ffunaudiollm.github.io\u002Fcosyvoice3\u002F); [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.17589); [Modelscope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FFunAudioLLM\u002FFun-CosyVoice3-0.5B-2512); [Huggingface](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FFun-CosyVoice3-0.5B-2512); [CV3-Eval](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCV3-Eval)\n\n**CosyVoice 2.0**: [Demos](https:\u002F\u002Ffunaudiollm.github.io\u002Fcosyvoice2\u002F); [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.10117); [Modelscope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FCosyVoice2-0.5B); [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FCosyVoice2-0.5B)\n\n**CosyVoice 1.0**: [Demos](https:\u002F\u002Ffun-audio-llm.github.io); [Paper](https:\u002F\u002Ffunaudiollm.github.io\u002Fpdf\u002FCosyVoice_v1.pdf); [Modelscope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FCosyVoice-300M); [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FCosyVoice-300M)\n\n## Highlight🔥\n\n**Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.\n### Key Features\n- **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects\u002Faccents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual\u002Fcross-lingual zero-shot voice cloning.\n- **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.\n- **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.\n- **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.\n- **Bi-Streaming**: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.\n- **Instruct Support**: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.\n\n\n## Roadmap\n\n- [x] 2025\u002F12\n\n    - [x] release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training\u002Finference script\n    - [x] release Fun-CosyVoice3-0.5B modelscope gradio space\n\n- [x] 2025\u002F08\n\n    - [x] Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support\n\n- [x] 2025\u002F07\n\n    - [x] release Fun-CosyVoice 3.0 eval set\n\n- [x] 2025\u002F05\n\n    - [x] add CosyVoice2-0.5B vllm support\n\n- [x] 2024\u002F12\n\n    - [x] 25hz CosyVoice2-0.5B released\n\n- [x] 2024\u002F09\n\n    - [x] 25hz CosyVoice-300M base model\n    - [x] 25hz CosyVoice-300M voice conversion function\n\n- [x] 2024\u002F08\n\n    - [x] Repetition Aware Sampling(RAS) inference for llm stability\n    - [x] Streaming inference mode support, including kv cache and sdpa for rtf optimization\n\n- [x] 2024\u002F07\n\n    - [x] Flow matching training support\n    - [x] WeTextProcessing support when ttsfrd is not available\n    - [x] Fastapi server and client\n\n## Evaluation\n\n| Model | Open-Source | Model Size | test-zh\u003Cbr>CER (%) ↓ | test-zh\u003Cbr>SS (%) ↑ | test-en\u003Cbr>WER (%) ↓ | test-en\u003Cbr>SS (%) ↑ | test-hard\u003Cbr>CER (%) ↓ | test-hard\u003Cbr>SS (%) ↑ |\n| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |\n| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |\n| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |\n| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |\n| Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |\n| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |\n| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |\n| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |\n| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |\n| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |\n| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |\n| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |\n| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |\n| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |\n| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |\n| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |\n\n\n## Install\n\n### Clone and install\n\n- Clone the repo\n    ``` sh\n    git clone --recursive https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice.git\n    # If you failed to clone the submodule due to network failures, please run the following command until success\n    cd CosyVoice\n    git submodule update --init --recursive\n    ```\n\n- Install Conda: please see https:\u002F\u002Fdocs.conda.io\u002Fen\u002Flatest\u002Fminiconda.html\n- Create Conda env:\n\n    ``` sh\n    conda create -n cosyvoice -y python=3.10\n    conda activate cosyvoice\n    pip install -r requirements.txt -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F --trusted-host=mirrors.aliyun.com\n\n    # If you encounter sox compatibility issues\n    # ubuntu\n    sudo apt-get install sox libsox-dev\n    # centos\n    sudo yum install sox sox-devel\n    ```\n\n### Model download\n\nWe strongly recommend that you download our pretrained `Fun-CosyVoice3-0.5B` `CosyVoice2-0.5B` `CosyVoice-300M` `CosyVoice-300M-SFT` `CosyVoice-300M-Instruct` model and `CosyVoice-ttsfrd` resource.\n\n``` python\n# modelscope SDK model download\nfrom modelscope import snapshot_download\nsnapshot_download('FunAudioLLM\u002FFun-CosyVoice3-0.5B-2512', local_dir='pretrained_models\u002FFun-CosyVoice3-0.5B')\nsnapshot_download('iic\u002FCosyVoice2-0.5B', local_dir='pretrained_models\u002FCosyVoice2-0.5B')\nsnapshot_download('iic\u002FCosyVoice-300M', local_dir='pretrained_models\u002FCosyVoice-300M')\nsnapshot_download('iic\u002FCosyVoice-300M-SFT', local_dir='pretrained_models\u002FCosyVoice-300M-SFT')\nsnapshot_download('iic\u002FCosyVoice-300M-Instruct', local_dir='pretrained_models\u002FCosyVoice-300M-Instruct')\nsnapshot_download('iic\u002FCosyVoice-ttsfrd', local_dir='pretrained_models\u002FCosyVoice-ttsfrd')\n\n# for oversea users, huggingface SDK model download\nfrom huggingface_hub import snapshot_download\nsnapshot_download('FunAudioLLM\u002FFun-CosyVoice3-0.5B-2512', local_dir='pretrained_models\u002FFun-CosyVoice3-0.5B')\nsnapshot_download('FunAudioLLM\u002FCosyVoice2-0.5B', local_dir='pretrained_models\u002FCosyVoice2-0.5B')\nsnapshot_download('FunAudioLLM\u002FCosyVoice-300M', local_dir='pretrained_models\u002FCosyVoice-300M')\nsnapshot_download('FunAudioLLM\u002FCosyVoice-300M-SFT', local_dir='pretrained_models\u002FCosyVoice-300M-SFT')\nsnapshot_download('FunAudioLLM\u002FCosyVoice-300M-Instruct', local_dir='pretrained_models\u002FCosyVoice-300M-Instruct')\nsnapshot_download('FunAudioLLM\u002FCosyVoice-ttsfrd', local_dir='pretrained_models\u002FCosyVoice-ttsfrd')\n```\n\nOptionally, you can unzip `ttsfrd` resource and install `ttsfrd` package for better text normalization performance.\n\nNotice that this step is not necessary. If you do not install `ttsfrd` package, we will use wetext by default.\n\n``` sh\ncd pretrained_models\u002FCosyVoice-ttsfrd\u002F\nunzip resource.zip -d .\npip install ttsfrd_dependency-0.1-py3-none-any.whl\npip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl\n```\n\n### Basic Usage\n\nWe strongly recommend using `Fun-CosyVoice3-0.5B` for better performance.\nFollow the code in `example.py` for detailed usage of each model.\n```sh\npython example.py\n```\n\n#### vLLM Usage\nCosyVoice2\u002F3 now supports **vLLM 0.11.x+ (V1 engine)** and **vLLM 0.9.0 (legacy)**.\nOlder vllm version(\u003C0.9.0) do not support CosyVoice inference, and versions in between (e.g., 0.10.x) are not tested.\n\nNotice that `vllm` has a lot of specific requirements. You can create a new env to in case your hardward do not support vllm and old env is corrupted.\n\n``` sh\nconda create -n cosyvoice_vllm --clone cosyvoice\nconda activate cosyvoice_vllm\n# for vllm==0.9.0\npip install vllm==v0.9.0 transformers==4.51.3 numpy==1.26.4 -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F --trusted-host=mirrors.aliyun.com\n# for vllm>=0.11.0\npip install vllm==v0.11.0 transformers==4.57.1 numpy==1.26.4 -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F --trusted-host=mirrors.aliyun.com\npython vllm_example.py\n```\n\n#### Start web demo\n\nYou can use our web demo page to get familiar with CosyVoice quickly.\n\nPlease see the demo website for details.\n\n``` python\n# change iic\u002FCosyVoice-300M-SFT for sft inference, or iic\u002FCosyVoice-300M-Instruct for instruct inference\npython3 webui.py --port 50000 --model_dir pretrained_models\u002FCosyVoice-300M\n```\n\n#### Advanced Usage\n\nFor advanced users, we have provided training and inference scripts in `examples\u002Flibritts`.\n\n#### Build for deployment\n\nOptionally, if you want service deployment,\nYou can run the following steps.\n\n``` sh\ncd runtime\u002Fpython\ndocker build -t cosyvoice:v1.0 .\n# change iic\u002FCosyVoice-300M to iic\u002FCosyVoice-300M-Instruct if you want to use instruct inference\n# for grpc usage\ndocker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \u002Fbin\u002Fbash -c \"cd \u002Fopt\u002FCosyVoice\u002FCosyVoice\u002Fruntime\u002Fpython\u002Fgrpc && python3 server.py --port 50000 --max_conc 4 --model_dir iic\u002FCosyVoice-300M && sleep infinity\"\ncd grpc && python3 client.py --port 50000 --mode \u003Csft|zero_shot|cross_lingual|instruct>\n# for fastapi usage\ndocker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \u002Fbin\u002Fbash -c \"cd \u002Fopt\u002FCosyVoice\u002FCosyVoice\u002Fruntime\u002Fpython\u002Ffastapi && python3 server.py --port 50000 --model_dir iic\u002FCosyVoice-300M && sleep infinity\"\ncd fastapi && python3 client.py --port 50000 --mode \u003Csft|zero_shot|cross_lingual|instruct>\n```\n\n#### Using Nvidia TensorRT-LLM for deployment\n\nUsing TensorRT-LLM to accelerate cosyvoice2 llm could give 4x acceleration comparing with huggingface transformers implementation.\nTo quick start:\n\n``` sh\ncd runtime\u002Ftriton_trtllm\ndocker compose up -d\n```\nFor more details, you could check [here](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice\u002Ftree\u002Fmain\u002Fruntime\u002Ftriton_trtllm)\n\n## Discussion & Communication\n\nYou can directly discuss on [Github Issues](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice\u002Fissues).\n\nYou can also scan the QR code to join our official Dingding chat group.\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_CosyVoice_readme_c4bd807139dd.png\" width=\"250px\">\n\n## Acknowledge\n\n1. We borrowed a lot of code from [FunASR](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR).\n2. We borrowed a lot of code from [FunCodec](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunCodec).\n3. We borrowed a lot of code from [Matcha-TTS](https:\u002F\u002Fgithub.com\u002Fshivammehta25\u002FMatcha-TTS).\n4. We borrowed a lot of code from [AcademiCodec](https:\u002F\u002Fgithub.com\u002Fyangdongchao\u002FAcademiCodec).\n5. We borrowed a lot of code from [WeNet](https:\u002F\u002Fgithub.com\u002Fwenet-e2e\u002Fwenet).\n\n## Citations\n\n``` bibtex\n@article{du2024cosyvoice,\n  title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},\n  author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},\n  journal={arXiv preprint arXiv:2407.05407},\n  year={2024}\n}\n\n@article{du2024cosyvoice,\n  title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},\n  author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},\n  journal={arXiv preprint arXiv:2412.10117},\n  year={2024}\n}\n\n@article{du2025cosyvoice,\n  title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},\n  author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},\n  journal={arXiv preprint arXiv:2505.17589},\n  year={2025}\n}\n\n@inproceedings{lyu2025build,\n  title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},\n  author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},\n  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n  pages={1--2},\n  year={2025},\n  organization={IEEE}\n}\n```\n\n## Disclaimer\nThe content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.\n","![SVG横幅](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_CosyVoice_readme_ce4c8ac6e96d.png)\n\n## 👉🏻 CosyVoice 👈🏻\n\n**Fun-CosyVoice 3.0**: [演示](https:\u002F\u002Ffunaudiollm.github.io\u002Fcosyvoice3\u002F)；[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.17589)；[Modelscope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FFunAudioLLM\u002FFun-CosyVoice3-0.5B-2512)；[Huggingface](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FFun-CosyVoice3-0.5B-2512)；[CV3-Eval](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCV3-Eval)\n\n**CosyVoice 2.0**: [演示](https:\u002F\u002Ffunaudiollm.github.io\u002Fcosyvoice2\u002F)；[论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.10117)；[Modelscope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FCosyVoice2-0.5B)；[HuggingFace](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FCosyVoice2-0.5B)\n\n**CosyVoice 1.0**: [演示](https:\u002F\u002Ffun-audio-llm.github.io)；[论文](https:\u002F\u002Ffunaudiollm.github.io\u002Fpdf\u002FCosyVoice_v1.pdf)；[Modelscope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FCosyVoice-300M)；[HuggingFace](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FCosyVoice-300M)\n\n## 亮点🔥\n\n**Fun-CosyVoice 3.0** 是一款基于大语言模型（LLM）的先进文本转语音（TTS）系统，在内容一致性、说话人相似度和韵律自然度方面均超越了其前代产品（CosyVoice 2.0）。它专为野外环境下的零样本多语言语音合成而设计。\n### 主要特性\n- **语言覆盖范围**：涵盖9种常用语言（中文、英文、日语、韩语、德语、西班牙语、法语、意大利语、俄语），以及18种以上的中国方言\u002F口音（广东话、闽南语、四川话、东北话、陕北话、山西话、上海话、天津话、山东话、宁夏话、甘肃话等），同时支持多语言\u002F跨语言的零样本语音克隆。\n- **内容一致性与自然度**：在内容一致性、说话人相似度和韵律自然度方面达到了最先进的水平。\n- **发音修复**：支持对汉语拼音和英语CMU音素的发音修复，提供了更高的可控性，因此非常适合生产环境使用。\n- **文本规范化**：无需传统的前端模块即可读取数字、特殊符号及各种文本格式。\n- **双向流式传输**：同时支持文本输入流和音频输出流，并且在保持高质量音频输出的同时，延迟低至150毫秒。\n- **指令支持**：支持多种指令，如语言、方言、情感、语速、音量等。\n\n\n## 路线图\n\n- [x] 2025年12月\n\n    - [x] 发布Fun-CosyVoice3-0.5B-2512基础模型、强化学习模型及其训练\u002F推理脚本\n    - [x] 发布Fun-CosyVoice3-0.5B Modelscope Gradio空间\n\n- [x] 2025年8月\n\n    - [x] 感谢NVIDIA Yuekai Zhang的贡献，增加Triton TRT-LLM运行时支持以及CosyVoice2 GRPO训练支持\n\n- [x] 2025年7月\n\n    - [x] 发布Fun-CosyVoice 3.0评估集\n\n- [x] 2025年5月\n\n    - [x] 增加CosyVoice2-0.5B VLLM支持\n\n- [x] 2024年12月\n\n    - [x] 发布25Hz版本的CosyVoice2-0.5B\n\n- [x] 2024年9月\n\n    - [x] 发布25Hz版本的CosyVoice-300M基础模型\n    - [x] 发布25Hz版本的CosyVoice-300M语音转换功能\n\n- [x] 2024年8月\n\n    - [x] 实现重复感知采样（RAS）推理，以提高LLM的稳定性\n    - [x] 支持流式推理模式，包括KV缓存和SDPA优化RTF\n\n- [x] 2024年7月\n\n    - [x] 支持流形匹配训练\n    - [x] 在TTSFRD不可用时支持WeTextProcessing\n    - [x] Fastapi服务器和客户端\n\n## 评估\n\n| 模型 | 开源 | 模型大小 | 测试中文\u003Cbr>CER (%) ↓ | 测试中文\u003Cbr>SS (%) ↑ | 测试英文\u003Cbr>WER (%) ↓ | 测试英文\u003Cbr>SS (%) ↑ | 测试困难场景\u003Cbr>CER (%) ↓ | 测试困难场景\u003Cbr>SS (%) ↑ |\n| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| 人类 | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |\n| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |\n| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |\n| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |\n| Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |\n| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |\n| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |\n| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |\n| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |\n| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |\n| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |\n| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |\n| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |\n| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |\n| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |\n| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |\n\n\n## 安装\n\n### 克隆并安装\n\n- 克隆仓库\n    ``` sh\n    git clone --recursive https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice.git\n    # 如果因网络问题未能成功克隆子模块，请反复执行以下命令直至成功\n    cd CosyVoice\n    git submodule update --init --recursive\n    ```\n\n- 安装Conda：请参阅 https:\u002F\u002Fdocs.conda.io\u002Fen\u002Flatest\u002Fminiconda.html\n- 创建Conda环境：\n\n    ``` sh\n    conda create -n cosyvoice -y python=3.10\n    conda activate cosyvoice\n    pip install -r requirements.txt -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F --trusted-host=mirrors.aliyun.com\n\n    # 如果遇到sox兼容性问题\n    # Ubuntu\n    sudo apt-get install sox libsox-dev\n    # CentOS\n    sudo yum install sox sox-devel\n    ```\n\n### 模型下载\n\n我们强烈建议您下载我们预训练的`Fun-CosyVoice3-0.5B`、`CosyVoice2-0.5B`、`CosyVoice-300M`、`CosyVoice-300M-SFT`、`CosyVoice-300M-Instruct`模型以及`CosyVoice-ttsfrd`资源。\n\n``` python\n# modelscope SDK模型下载\nfrom modelscope import snapshot_download\nsnapshot_download('FunAudioLLM\u002FFun-CosyVoice3-0.5B-2512', local_dir='pretrained_models\u002FFun-CosyVoice3-0.5B')\nsnapshot_download('iic\u002FCosyVoice2-0.5B', local_dir='pretrained_models\u002FCosyVoice2-0.5B')\nsnapshot_download('iic\u002FCosyVoice-300M', local_dir='pretrained_models\u002FCosyVoice-300M')\nsnapshot_download('iic\u002FCosyVoice-300M-SFT', local_dir='pretrained_models\u002FCosyVoice-300M-SFT')\nsnapshot_download('iic\u002FCosyVoice-300M-Instruct', local_dir='pretrained_models\u002FCosyVoice-300M-Instruct')\nsnapshot_download('iic\u002FCosyVoice-ttsfrd', local_dir='pretrained_models\u002FCosyVoice-ttsfrd')\n\n# 适用于海外用户的 Hugging Face SDK 模型下载\nfrom huggingface_hub import snapshot_download\nsnapshot_download('FunAudioLLM\u002FFun-CosyVoice3-0.5B-2512', local_dir='pretrained_models\u002FFun-CosyVoice3-0.5B')\nsnapshot_download('FunAudioLLM\u002FCosyVoice2-0.5B', local_dir='pretrained_models\u002FCosyVoice2-0.5B')\nsnapshot_download('FunAudioLLM\u002FCosyVoice-300M', local_dir='pretrained_models\u002FCosyVoice-300M')\nsnapshot_download('FunAudioLLM\u002FCosyVoice-300M-SFT', local_dir='pretrained_models\u002FCosyVoice-300M-SFT')\nsnapshot_download('FunAudioLLM\u002FCosyVoice-300M-Instruct', local_dir='pretrained_models\u002FCosyVoice-300M-Instruct')\nsnapshot_download('FunAudioLLM\u002FCosyVoice-ttsfrd', local_dir='pretrained_models\u002FCosyVoice-ttsfrd')\n```\n\n可选地，您可以解压 `ttsfrd` 资源并安装 `ttsfrd` 包，以获得更好的文本归一化性能。\n\n请注意，此步骤并非必需。如果您未安装 `ttsfrd` 包，我们将默认使用 wetext。\n\n``` sh\ncd pretrained_models\u002FCosyVoice-ttsfrd\u002F\nunzip resource.zip -d .\npip install ttsfrd_dependency-0.1-py3-none-any.whl\npip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl\n```\n\n### 基本用法\n\n我们强烈建议使用 `Fun-CosyVoice3-0.5B` 以获得更好的性能。\n请按照 `example.py` 中的代码了解各模型的详细用法。\n```sh\npython example.py\n```\n\n#### vLLM 使用\nCosyVoice2\u002F3 现在支持 **vLLM 0.11.x+（V1 引擎）** 和 **vLLM 0.9.0（旧版）**。\n较早版本的 vllm（\u003C0.9.0）不支持 CosyVoice 推理，而介于两者之间的版本（如 0.10.x）尚未经过测试。\n\n请注意，`vllm` 有许多特定要求。您可以创建一个新的环境，以防您的硬件不支持 vllm 或旧环境已损坏。\n\n``` sh\nconda create -n cosyvoice_vllm --clone cosyvoice\nconda activate cosyvoice_vllm\n# 对于 vllm==0.9.0\npip install vllm==v0.9.0 transformers==4.51.3 numpy==1.26.4 -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F --trusted-host=mirrors.aliyun.com\n# 对于 vllm>=0.11.0\npip install vllm==v0.11.0 transformers==4.57.1 numpy==1.26.4 -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F --trusted-host=mirrors.aliyun.com\npython vllm_example.py\n```\n\n#### 启动 Web 演示\n您可以通过我们的 Web 演示页面快速熟悉 CosyVoice。\n\n详情请参阅演示网站。\n\n``` python\n# 将 iic\u002FCosyVoice-300M-SFT 更改为 sft 推理，或将 iic\u002FCosyVoice-300M-Instruct 更改为 instruct 推理\npython3 webui.py --port 50000 --model_dir pretrained_models\u002FCosyVoice-300M\n```\n\n#### 高级用法\n\n对于高级用户，我们在 `examples\u002Flibritts` 中提供了训练和推理脚本。\n\n#### 构建用于部署\n可选地，如果您希望进行服务部署，可以执行以下步骤。\n\n``` sh\ncd runtime\u002Fpython\ndocker build -t cosyvoice:v1.0 .\n# 如果您想使用 instruct 推理，请将 iic\u002FCosyVoice-300M 更改为 iic\u002FCosyVoice-300M-Instruct\n# 对于 gRPC 使用\ndocker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \u002Fbin\u002Fbash -c \"cd \u002Fopt\u002FCosyVoice\u002FCosyVoice\u002Fruntime\u002Fpython\u002Fgrpc && python3 server.py --port 50000 --max_conc 4 --model_dir iic\u002FCosyVoice-300M && sleep infinity\"\ncd grpc && python3 client.py --port 50000 --mode \u003Csft|zero_shot|cross_lingual|instruct>\n# 对于 FastAPI 使用\ndocker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \u002Fbin\u002Fbash -c \"cd \u002Fopt\u002FCosyVoice\u002FCosyVoice\u002Fruntime\u002Fpython\u002Ffastapi && python3 server.py --port 50000 --model_dir iic\u002FCosyVoice-300M && sleep infinity\"\ncd fastapi && python3 client.py --port 50000 --mode \u003Csft|zero_shot|cross_lingual|instruct>\n```\n\n#### 使用 Nvidia TensorRT-LLM 进行部署\n使用 TensorRT-LLM 加速 CosyVoice2 LLM 可以比 Hugging Face Transformers 实现快 4 倍。\n快速入门：\n\n``` sh\ncd runtime\u002Ftriton_trtllm\ndocker compose up -d\n```\n更多详情，请参阅 [这里](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice\u002Ftree\u002Fmain\u002Fruntime\u002Ftriton_trtllm)\n\n## 讨论与交流\n\n您可以直接在 [Github Issues](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice\u002Fissues) 上讨论。\n\n您也可以扫描二维码加入我们的官方钉钉聊天群。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_CosyVoice_readme_c4bd807139dd.png\" width=\"250px\">\n\n## 致谢\n\n1. 我们借鉴了 [FunASR](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR) 的大量代码。\n2. 我们借鉴了 [FunCodec](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunCodec) 的大量代码。\n3. 我们借鉴了 [Matcha-TTS](https:\u002F\u002Fgithub.com\u002Fshivammehta25\u002FMatcha-TTS) 的大量代码。\n4. 我们借鉴了 [AcademiCodec](https:\u002F\u002Fgithub.com\u002Fyangdongchao\u002FAcademiCodec) 的大量代码。\n5. 我们借鉴了 [WeNet](https:\u002F\u002Fgithub.com\u002Fwenet-e2e\u002Fwenet) 的大量代码。\n\n## 引用\n\n``` bibtex\n@article{du2024cosyvoice,\n  title={Cosyvoice: 基于监督语义标记的可扩展多语言零样本文本到语音合成器},\n  author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},\n  journal={arXiv preprint arXiv:2407.05407},\n  year={2024}\n}\n\n@article{du2024cosyvoice,\n  title={CosyVoice 2: 基于大型语言模型的可扩展流式语音合成},\n  author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},\n  journal={arXiv preprint arXiv:2412.10117},\n  year={2024}\n}\n\n@article{du2025cosyvoice,\n  title={CosyVoice 3: 通过扩展和后训练实现野外语音生成},\n  author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},\n  journal={arXiv preprint arXiv:2505.17589},\n  year={2025}\n}\n\n@inproceedings{lyu2025build,\n  title={使用 CosyVoice 构建基于 LLM 的零样本流式 TTS 系统},\n  author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},\n  booktitle={ICASSP 2025-2025 IEEE 国际声学、语音和信号处理会议（ICASSP）},\n  pages={1--2},\n  year={2025},\n  organization={IEEE}\n}\n```\n\n## 免责声明\n以上内容仅供学术研究之用，旨在展示技术能力。部分示例来源于互联网。如有任何内容侵犯您的权益，请联系我们要求删除。","# CosyVoice 快速上手指南\n\nCosyVoice 是一款基于大语言模型（LLM）的先进文本转语音（TTS）系统，支持多语言、多方言零样本语音克隆，具备高自然度、低延迟流式合成及指令控制等特性。推荐使用最新发布的 **Fun-CosyVoice3-0.5B** 模型以获得最佳效果。\n\n## 环境准备\n\n*   **操作系统**: Linux (推荐 Ubuntu\u002FCentOS)\n*   **Python 版本**: 3.10\n*   **硬件要求**: NVIDIA GPU (需安装 CUDA 驱动)，若使用 Docker 部署需安装 `nvidia-container-toolkit`\n*   **前置依赖**:\n    *   Conda (用于环境管理)\n    *   Git\n    *   SoX (音频处理工具)\n\n## 安装步骤\n\n### 1. 克隆代码库\n```bash\ngit clone --recursive https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice.git\ncd CosyVoice\n# 若子模块克隆失败，请执行以下命令重试\ngit submodule update --init --recursive\n```\n\n### 2. 创建并激活 Conda 环境\n```bash\nconda create -n cosyvoice -y python=3.10\nconda activate cosyvoice\n```\n\n### 3. 安装依赖与系统库\n优先使用阿里云镜像源加速 Python 包安装：\n```bash\npip install -r requirements.txt -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F --trusted-host=mirrors.aliyun.com\n```\n\n安装系统级音频库 `sox`：\n*   **Ubuntu\u002FDebian**:\n    ```bash\n    sudo apt-get install sox libsox-dev\n    ```\n*   **CentOS**:\n    ```bash\n    sudo yum install sox sox-devel\n    ```\n\n### 4. 下载预训练模型\n推荐使用国内 ModelScope 平台下载模型资源（包含 Fun-CosyVoice3, CosyVoice2 及基础模型）：\n\n```python\nfrom modelscope import snapshot_download\n\n# 下载推荐的 Fun-CosyVoice3 模型\nsnapshot_download('FunAudioLLM\u002FFun-CosyVoice3-0.5B-2512', local_dir='pretrained_models\u002FFun-CosyVoice3-0.5B')\n\n# 下载 CosyVoice2 模型\nsnapshot_download('iic\u002FCosyVoice2-0.5B', local_dir='pretrained_models\u002FCosyVoice2-0.5B')\n\n# 下载基础模型及资源\nsnapshot_download('iic\u002FCosyVoice-300M', local_dir='pretrained_models\u002FCosyVoice-300M')\nsnapshot_download('iic\u002FCosyVoice-ttsfrd', local_dir='pretrained_models\u002FCosyVoice-ttsfrd')\n```\n\n*(可选) 提升文本归一化性能：*\n若需更好的数字和符号朗读效果，可安装 `ttsfrd` 包：\n```bash\ncd pretrained_models\u002FCosyVoice-ttsfrd\u002F\nunzip resource.zip -d .\npip install ttsfrd_dependency-0.1-py3-none-any.whl\npip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl\n```\n*注：若不安装此包，系统将默认使用 WeTextProcessing。*\n\n## 基本使用\n\n### 运行示例脚本\n安装完成后，直接运行官方提供的示例脚本即可体验最基础的合成功能：\n```bash\npython example.py\n```\n\n### 启动 Web 演示界面\n通过浏览器交互体验不同模式（SFT、零样本克隆、指令控制等）：\n```bash\n# 启动服务，端口设为 50000\npython3 webui.py --port 50000 --model_dir pretrained_models\u002FCosyVoice-300M\n```\n启动后在浏览器访问 `http:\u002F\u002Flocalhost:50000` 即可使用。\n\n### 进阶：vLLM 加速推理\n若需更高吞吐量的推理服务，可单独创建环境安装 vLLM（需 v0.9.0 或 v0.11.x+）：\n```bash\nconda create -n cosyvoice_vllm --clone cosyvoice\nconda activate cosyvoice_vllm\n\n# 安装 vLLM (以 0.11.0 为例)\npip install vllm==v0.11.0 transformers==4.57.1 numpy==1.26.4 -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F --trusted-host=mirrors.aliyun.com\n\n# 运行 vLLM 示例\npython vllm_example.py\n```","一家出海游戏公司需要为其新发布的 RPG 手游快速生成覆盖中、英、日、韩及多种中国方言的角色配音，以适配全球不同地区的玩家。\n\n### 没有 CosyVoice 时\n- **多语种录制成本高昂**：团队需分别聘请各国专业配音演员，不仅预算爆炸，且协调各国录音棚档期导致项目周期延长数周。\n- **方言与情感难以统一**：想要实现“四川话版”或“东北话版”角色台词时，极难找到既能说地道方言又能精准演绎特定情绪（如愤怒、悲伤）的演员。\n- **后期修改极其繁琐**：一旦剧本微调，必须重新召集原班人马进棚重录，无法做到即时更新，严重拖慢版本迭代速度。\n- **发音一致性差**：不同语言版本的同一角色声音特质割裂，缺乏统一的“人设感”，影响玩家沉浸体验。\n\n### 使用 CosyVoice 后\n- **零样本跨语言克隆**：仅需录入一段参考音频，CosyVoice 即可用该音色流畅合成 9 种主流语言及 18 种以上中国方言，瞬间完成全球化配音布局。\n- **指令精准控制演绎**：通过自然语言指令直接调整语速、音量及情感色彩，轻松生成带有地道口音且情绪饱满的台词，无需依赖特定演员。\n- **实时热更文本内容**：支持流式推理与发音修复（Pronunciation Inpainting），剧本修改后可秒级重新生成音频，完美适配敏捷开发流程。\n- **角色人设高度统一**：凭借卓越的说话人相似度技术，确保同一角色在所有语言和方言版本中保持音色特质一致，极大增强代入感。\n\nCosyVoice 将原本耗时数月的全球配音工程压缩至小时级，以极低实现了高保真、多情感的多语言角色声音规模化生产。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_CosyVoice_6b3847d7.png","FunAudioLLM","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FFunAudioLLM_a46ce040.png",null,"https:\u002F\u002Fgithub.com\u002FFunAudioLLM",[80,84,88],{"name":81,"color":82,"percentage":83},"Python","#3572A5",97,{"name":85,"color":86,"percentage":87},"Shell","#89e051",2.7,{"name":89,"color":90,"percentage":91},"Dockerfile","#384d54",0.3,20398,2327,"2026-04-05T07:17:05","Apache-2.0","Linux","部署服务时必需 NVIDIA GPU (Docker 命令包含 --runtime=nvidia)，支持 TensorRT-LLM 加速；具体显存大小和 CUDA 版本未在文中明确说明，但运行 0.5B\u002F300M 大模型通常建议 8GB+ 显存","未说明",{"notes":100,"python":101,"dependencies":102},"1. 推荐使用 Conda 创建 Python 3.10 环境。2. Linux 系统需安装 sox 及其开发库 (ubuntu: libsox-dev, centos: sox-devel)。3. 若需使用 vLLM 加速推理，需创建独立环境并安装特定版本的 vLLM (0.9.0 或 0.11.x+) 及对应的 transformers 和 numpy 版本。4. 可选安装 ttsfrd 包以获得更好的文本归一化效果，否则默认使用 WeTextProcessing。5. 提供 Docker 镜像用于 gRPC 或 FastAPI 服务部署，必须使用 NVIDIA Container Toolkit。6. 支持多种语言及方言的零样本语音克隆。","3.10",[103,104,105,106,107,108,109,110],"torch","transformers","numpy","vllm (可选，>=0.9.0)","sox","libsox-dev","ttsfrd (可选)","WeTextProcessing",[55,13,26],[113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131],"audio-generation","gpt-4o","text-to-speech","tts","cantonese","chatbot","chatgpt","chinese","english","fine-grained","fine-tuning","japanese","korean","multi-lingual","natural-language-generation","python","cosyvoice","cross-lingual","voice-cloning","2026-03-27T02:49:30.150509","2026-04-06T05:16:08.576405",[135,140,145,150,155,160],{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},17304,"在 Mac (M1\u002FM4) 或特定平台上无法安装 ttsfrd 报错怎么办？","官方提供的 ttsfrd wheel 包可能不支持某些平台（如 Mac M1\u002FM4 或非 Linux x86_64）。如果遇到安装失败或初始化报错（AssertionError: failed to initialize ttsfrd resource），可以卸载 ttsfrd 并让系统回退到使用 WeTextProcessing。执行命令：pip3 uninstall ttsfrd。当 ttsfrd 不可用时，程序会自动支持 WeTextProcessing 作为替代方案。","https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice\u002Fissues\u002F32",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},17305,"GPU 推理合成的语音质量差、有杂音或白噪声，而 CPU 正常，如何解决？","这通常是由于 CUDA 流同步问题导致的。在代码 cosyvoice\u002Fflow\u002Fflow_matching.py 的第 129 行附近，获取 estimator 后、进入 with stream 块之前，需要显式同步默认流。请在该位置添加一行代码：torch.cuda.current_stream().synchronize()。这是因为获取 data_ptr 时数据可能尚未准备好，提前同步默认流可解决此问题。","https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice\u002Fissues\u002F1328",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},17306,"音调较高时出现颤音或双重发音的问题有解决方法吗？","该问题与 HiFi-GAN 生成器有关。官方在 5 月后发布的更新代码及模型已修复此问题。建议更新项目代码，特别是参考 cosyvoice\u002Fhifigan\u002Fgenerator.py 的最新版本（commit 68100c267a0a4a01e88bb52511f64d1bd97c21fd 之后），并使用更新的模型基座以避免颤音和重影现象。","https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice\u002Fissues\u002F1287",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},17307,"CosyVoice 是否支持 vLLM 加速推理？效果如何？","是的，目前社区和维护者推荐使用 vLLM 进行加速适配。相比早期的 sglang 适配（可能存在 radix tree 失效收益不大的情况），vLLM 经过改进后并发性能更友好，部署更方便。目前 CosyVoice3 等项目已主要采用 vLLM 方案。","https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice\u002Fissues\u002F873",{"id":156,"question_zh":157,"answer_zh":158,"source_url":159},17308,"CosyVoice 2.0 论文中提到的 SFT（监督微调）是针对 LLM 模块还是 Flow Matching 模块？","根据社区讨论，论文中第 2.7 节（Multi-Speaker Fine-tuning）和第 2.8 节提到的 SFT 主要针对 TextSpeech LM（即 LLM 模块）。虽然文中提到加 tag 操作，但 TextSpeech LM 本身不直接处理音色嵌入（speaker-embedding），音色条件主要由 Flow 模型处理。目前的实践倾向于对 LLM 进行微调以提升指令遵循能力，具体效果需结合实验验证，但明确的是该部分描述侧重于 LM 而非 Flow 模型的条件增加。","https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice\u002Fissues\u002F787",{"id":161,"question_zh":162,"answer_zh":163,"source_url":164},17309,"CosyVoice 3.0 是否计划支持 TensorRT (TRT) 加速？","是的，支持 TensorRT 已在规划路线图中。维护者曾表示计划在发布后约一个月开始支持 CosyVoice 3.0 的 TRT 加速。目前 HuggingFace 仓库中可能缺少 config.json 文件，这是后续支持 TRT 所需的配置之一，请关注官方仓库的更新以获取完整的 TRT 支持文件。","https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice\u002Fissues\u002F1771",[]]