[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-index-tts--index-tts":3,"tool-index-tts--index-tts":62},[4,18,26,35,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,2,"2026-04-10T11:39:34",[14,15,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":32,"last_commit_at":41,"category_tags":42,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[43,13,15,14],"插件",{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[52,15,13,14],"语言模型",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,61],"视频",{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":65,"owner_name":73,"owner_avatar_url":74,"owner_bio":73,"owner_company":73,"owner_location":73,"owner_email":73,"owner_twitter":73,"owner_website":73,"owner_url":75,"languages":76,"stars":93,"forks":94,"last_commit_at":95,"license":96,"difficulty_score":10,"env_os":97,"env_gpu":98,"env_ram":99,"env_deps":100,"category_tags":109,"github_topics":111,"view_count":10,"oss_zip_url":73,"oss_zip_packed_at":73,"status":17,"created_at":119,"updated_at":120,"faqs":121,"releases":155},7069,"index-tts\u002Findex-tts","index-tts","An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System","IndexTTS2 是一款工业级的零样本语音合成系统，致力于生成自然且可控的人声。它主要解决了现有自回归大模型在语音合成中难以精确控制时长的问题，这一痛点曾严重制约视频配音等需要严格音画同步的应用场景。通过创新的时长控制机制，IndexTTS2 既能指定生成 token 数量以精准把控语音时长，也能自由生成并完美复现参考音频的韵律特征。\n\n该工具的独特亮点在于实现了情感表达与说话人身份的解耦。用户可以在零样本设置下，独立控制音色与情感：既准确还原目标音色，又完美复刻指定的情绪基调。此外，项目引入了基于文本描述的软指令机制（经 Qwen3 微调），让用户仅通过文字描述即可轻松引导语音的情感走向，大幅降低了情感控制的门槛。结合三阶段训练范式，IndexTTS2 在高度情感化的表达中依然保持了极高的清晰度和稳定性。\n\nIndexTTS2 非常适合需要高质量语音生成的开发者、研究人员以及内容创作者。无论是用于开发视频自动配音工具、构建交互式语音助手，还是进行多情感语音合成的学术研究，它都能提供超越当前主流模型的词错率、说话人相似度和情感保真度表现。","\n\n\u003Cdiv align=\"center\">\n\u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Findex-tts_index-tts_readme_d06e21d0c8f0.png' width=\"250\"\u002F>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\u003Ca href=\"docs\u002FREADME_zh.md\" style=\"font-size: 24px\">简体中文\u003C\u002Fa> | \n\u003Ca href=\"README.md\" style=\"font-size: 24px\">English\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n## The repository history has been reset. Please delete your local copy and re-clone.\n## （仓库历史已重置。请删除本地副本并重新克隆。）\n\n## 👉🏻 IndexTTS2 👈🏻\n\n\u003Ccenter>\u003Ch3>IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech\u003C\u002Fh3>\u003C\u002Fcenter>\n\n[![IndexTTS2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Findex-tts_index-tts_readme_16321094ebf9.png)](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Findex-tts_index-tts_readme_16321094ebf9.png)\n\n\n\u003Cdiv align=\"center\">\n  \u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21619'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArXiv-2506.21619-red?logo=arxiv'\u002F>\n  \u003C\u002Fa>\n  \u003Cbr\u002F>\n  \u003Ca href='https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-Code-orange?logo=github'\u002F>\n  \u003C\u002Fa>\n  \u003Ca href='https:\u002F\u002Findex-tts.github.io\u002Findex-tts2.github.io\u002F'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-Demo-orange?logo=github'\u002F>\n  \u003C\u002Fa>\n  \u003Cbr\u002F>\n  \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FIndexTeam\u002FIndexTTS-2-Demo'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-Demo-blue?logo=huggingface'\u002F>\n  \u003C\u002Fa>\n  \u003Ca href='https:\u002F\u002Fhuggingface.co\u002FIndexTeam\u002FIndexTTS-2'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-Model-blue?logo=huggingface' \u002F>\n  \u003C\u002Fa>\n  \u003Cbr\u002F>\n  \u003Ca href='https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002FIndexTeam\u002FIndexTTS-2-Demo'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Demo-purple?logo=modelscope'\u002F>\n  \u003C\u002F>\n  \u003Ca href='https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FIndexTeam\u002FIndexTTS-2'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-purple?logo=modelscope'\u002F>\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\n### Abstract\n\nExisting autoregressive large-scale text-to-speech (TTS) models have advantages in speech naturalness, but their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This becomes a significant limitation in applications requiring strict audio-visual synchronization, such as video dubbing.\n\nThis paper introduces IndexTTS2, which proposes a novel, general, and autoregressive model-friendly method for speech duration control.\n\nThe method supports two generation modes: one explicitly specifies the number of generated tokens to precisely control speech duration; the other freely generates speech in an autoregressive manner without specifying the number of tokens, while faithfully reproducing the prosodic features of the input prompt.\n\nFurthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion. In the zero-shot setting, the model can accurately reconstruct the target timbre (from the timbre prompt) while perfectly reproducing the specified emotional tone (from the style prompt).\n\nTo enhance speech clarity in highly emotional expressions, we incorporate GPT latent representations and design a novel three-stage training paradigm to improve the stability of the generated speech. Additionally, to lower the barrier for emotional control, we designed a soft instruction mechanism based on text descriptions by fine-tuning Qwen3, effectively guiding the generation of speech with the desired emotional orientation.\n\nFinally, experimental results on multiple datasets show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in terms of word error rate, speaker similarity, and emotional fidelity. Audio samples are available at: \u003Ca href=\"https:\u002F\u002Findex-tts.github.io\u002Findex-tts2.github.io\u002F\">IndexTTS2 demo page\u003C\u002Fa>.\n\n**Tips:** Please contact the authors for more detailed information. For commercial usage and cooperation, please contact \u003Cu>indexspeech@bilibili.com\u003C\u002Fu>.\n\n\n### Feel IndexTTS2\n\n\u003Cdiv align=\"center\">\n\n**IndexTTS2: The Future of Voice, Now Generating**\n\n[![IndexTTS2 Demo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Findex-tts_index-tts_readme_1c208472b84f.png)](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV136a9zqEk5)\n\n*Click the image to watch the IndexTTS2 introduction video.*\n\n\u003C\u002Fdiv>\n\n\n### Contact\n\nQQ Group：663272642(No.4) 1013410623(No.5)  \\\nDiscord：https:\u002F\u002Fdiscord.gg\u002FuT32E7KDmy  \\\nEmail：indexspeech@bilibili.com  \\\nYou are welcome to join our community! 🌏  \\\n欢迎大家来交流讨论！\n\n> [!CAUTION]\n> Thank you for your support of the bilibili indextts project!\n> Please note that the **only official channel** maintained by the core team is: [https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts](https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts).\n> ***Any other websites or services are not official***, and we cannot guarantee their security, accuracy, or timeliness.\n> For the latest updates, please always refer to this official repository.\n\n\n## 📣 Updates\n\n- `2025\u002F09\u002F08` 🔥🔥🔥  We release **IndexTTS-2** to the world!\n    - The first autoregressive TTS model with precise synthesis duration control, supporting both controllable and uncontrollable modes. \u003Ci>This functionality is not yet enabled in this release.\u003C\u002Fi>\n    - The model achieves highly expressive emotional speech synthesis, with emotion-controllable capabilities enabled through multiple input modalities.\n- `2025\u002F05\u002F14` 🔥🔥 We release **IndexTTS-1.5**, significantly improving the model's stability and its performance in the English language.\n- `2025\u002F03\u002F25` 🔥 We release **IndexTTS-1.0** with model weights and inference code.\n- `2025\u002F02\u002F12` 🔥 We submitted our paper to arXiv, and released our demos and test sets.\n\n\n## 🖥️ Neural Network Architecture\n\nArchitectural overview of IndexTTS2, our state-of-the art speech model:\n\n\u003Cpicture>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Findex-tts_index-tts_readme_ba661dfa1311.png\"  width=\"800\"\u002F>\n\u003C\u002Fpicture>\n\n\nThe key contributions of **IndexTTS2** are summarized as follows:\n\n - We propose a duration adaptation scheme for autoregressive TTS models. IndexTTS2 is the first autoregressive zero-shot TTS model to combine precise duration control with natural duration generation, and the method is scalable for any autoregressive large-scale TTS model.  \n - The emotional and speaker-related features are decoupled from the prompts, and a feature fusion strategy is designed to maintain semantic fluency and pronunciation clarity during emotionally rich expressions. Furthermore, a tool was developed for emotion control, utilizing natural language descriptions for the benefit of users.  \n - To address the lack of highly expressive speech data, we propose an effective training strategy, significantly enhancing the emotional expressiveness of zeroshot TTS to State-of-the-Art (SOTA) level.  \n - We will publicly release the code and pre-trained weights to facilitate future research and practical applications.  \n\n\n## Model Download\n\n| **HuggingFace**                                          | **ModelScope** |\n|----------------------------------------------------------|----------------------------------------------------------|\n| [😁 IndexTTS-2](https:\u002F\u002Fhuggingface.co\u002FIndexTeam\u002FIndexTTS-2) | [IndexTTS-2](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FIndexTeam\u002FIndexTTS-2) |\n| [IndexTTS-1.5](https:\u002F\u002Fhuggingface.co\u002FIndexTeam\u002FIndexTTS-1.5) | [IndexTTS-1.5](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FIndexTeam\u002FIndexTTS-1.5) |\n| [IndexTTS](https:\u002F\u002Fhuggingface.co\u002FIndexTeam\u002FIndex-TTS) | [IndexTTS](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FIndexTeam\u002FIndex-TTS) |\n\n\n## Usage Instructions\n\n### ⚙️ Environment Setup\n\n1. Ensure that you have both [git](https:\u002F\u002Fgit-scm.com\u002Fdownloads)\n   and [git-lfs](https:\u002F\u002Fgit-lfs.com\u002F) on your system.\n\nThe Git-LFS plugin must also be enabled on your current user account:\n\n```bash\ngit lfs install\n```\n\n2. Download this repository:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts.git && cd index-tts\ngit lfs pull  # download large repository files\n```\n\n3. Install the [uv package manager](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002Fgetting-started\u002Finstallation\u002F).\n   It is *required* for a reliable, modern installation environment.\n\n> [!TIP]\n> **Quick & Easy Installation Method:**\n> \n> There are many convenient ways to install the `uv` command on your computer.\n> Please check the link above to see all options. Alternatively, if you want\n> a very quick and easy method, you can install it as follows:\n> \n> ```bash\n> pip install -U uv\n> ```\n\n> [!WARNING]\n> We **only** support the `uv` installation method. Other tools, such as `conda`\n> or `pip`, don't provide any guarantees that they will install the correct\n> dependency versions. You will almost certainly have *random bugs, error messages,*\n> ***missing GPU acceleration**, and various other problems* if you don't use `uv`.\n> Please *do not report any issues* if you use non-standard installations, since\n> almost all such issues are invalid.\n> \n> Furthermore, `uv` is [up to 115x faster](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv\u002Fblob\u002Fmain\u002FBENCHMARKS.md)\n> than `pip`, which is another *great* reason to embrace the new industry-standard\n> for Python project management.\n\n4. Install required dependencies:\n\nWe use `uv` to manage the project's dependency environment. The following command\nwill *automatically* create a `.venv` project-directory and then installs the correct\nversions of Python and all required dependencies:\n\n```bash\nuv sync --all-extras\n```\n\nIf the download is slow, please try a *local mirror*, for example any of these\nlocal mirrors in China (choose one mirror from the list below):\n\n```bash\nuv sync --all-extras --default-index \"https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\"\n\nuv sync --all-extras --default-index \"https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fpypi\u002Fweb\u002Fsimple\"\n```\n\n> [!TIP]\n> **Available Extra Features:**\n> \n> - `--all-extras`: Automatically adds *every* extra feature listed below. You can\n>   remove this flag if you want to customize your installation choices.\n> - `--extra webui`: Adds WebUI support (recommended).\n> - `--extra deepspeed`: Adds DeepSpeed support (may speed up inference on some\n>   systems).\n\n> [!IMPORTANT]\n> **Important (Windows):** The DeepSpeed library may be difficult to install for\n> some Windows users. You can skip it by removing the `--all-extras` flag. If you\n> want any of the other extra features above, you can manually add their specific\n> feature flags instead.\n> \n> **Important (Linux\u002FWindows):** If you see an error about CUDA during the installation,\n> please ensure that you have installed NVIDIA's [CUDA Toolkit](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit)\n> version **12.8** (or newer) on your system.\n\n5. Download the required models via [uv tool](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002Fguides\u002Ftools\u002F#installing-tools):\n\nDownload via `huggingface-cli`:\n\n```bash\nuv tool install \"huggingface-hub[cli,hf_xet]\"\n\nhf download IndexTeam\u002FIndexTTS-2 --local-dir=checkpoints\n```\n\nOr download via `modelscope`:\n\n```bash\nuv tool install \"modelscope\"\n\nmodelscope download --model IndexTeam\u002FIndexTTS-2 --local_dir checkpoints\n```\n\n> [!IMPORTANT]\n> If the commands above aren't available, please carefully read the `uv tool`\n> output. It will tell you how to add the tools to your system's path.\n\n> [!NOTE]\n> In addition to the above models, some small models will also be automatically\n> downloaded when the project is run for the first time. If your network environment\n> has slow access to HuggingFace, it is recommended to execute the following\n> command before running the code:\n> \n> ```bash\n> export HF_ENDPOINT=\"https:\u002F\u002Fhf-mirror.com\"\n> ```\n\n\n#### 🖥️ Checking PyTorch GPU Acceleration\n\nIf you need to diagnose your environment to see which GPUs are detected,\nyou can use our included utility to check your system:\n\n```bash\nuv run tools\u002Fgpu_check.py\n```\n\n\n### 🔥 IndexTTS2 Quickstart\n\n#### 🌐 Web Demo\n\n```bash\nuv run webui.py\n```\n\nOpen your browser and visit `http:\u002F\u002F127.0.0.1:7860` to see the demo.\n\nYou can also adjust the settings to enable features such as FP16 inference (lower\nVRAM usage), DeepSpeed acceleration, compiled CUDA kernels for speed, etc. All\navailable options can be seen via the following command:\n\n```bash\nuv run webui.py -h\n```\n\nHave fun!\n\n> [!IMPORTANT]\n> It can be very helpful to use **FP16** (half-precision) inference. It is faster\n> and uses less VRAM, with a very small quality loss.\n> \n> **DeepSpeed** *may* also speed up inference on some systems, but it could also\n> make it slower. The performance impact is highly dependent on your specific\n> hardware, drivers and operating system. Please try with and without it,\n> to discover what works best on your personal system.\n> \n> Lastly, be aware that *all* `uv` commands will **automatically activate** the correct\n> per-project virtual environments. Do *not* manually activate any environments\n> before running `uv` commands, since that could lead to dependency conflicts!\n\n\n#### 📝 Using IndexTTS2 in Python\n\nTo run scripts, you *must* use the `uv run \u003Cfile.py>` command to ensure that\nthe code runs inside your current \"uv\" environment. It *may* sometimes also be\nnecessary to add the current directory to your `PYTHONPATH`, to help it find\nthe IndexTTS modules.\n\nExample of running a script via `uv`:\n\n```bash\nPYTHONPATH=\"$PYTHONPATH:.\" uv run indextts\u002Finfer_v2.py\n```\n\nHere are several examples of how to use IndexTTS2 in your own scripts:\n\n1. Synthesize new speech with a single reference audio file (voice cloning):\n\n```python\nfrom indextts.infer_v2 import IndexTTS2\ntts = IndexTTS2(cfg_path=\"checkpoints\u002Fconfig.yaml\", model_dir=\"checkpoints\", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)\ntext = \"Translate for me, what is a surprise!\"\ntts.infer(spk_audio_prompt='examples\u002Fvoice_01.wav', text=text, output_path=\"gen.wav\", verbose=True)\n```\n\n2. Using a separate, emotional reference audio file to condition the speech synthesis:\n\n```python\nfrom indextts.infer_v2 import IndexTTS2\ntts = IndexTTS2(cfg_path=\"checkpoints\u002Fconfig.yaml\", model_dir=\"checkpoints\", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)\ntext = \"酒楼丧尽天良，开始借机竞拍房间，哎，一群蠢货。\"\ntts.infer(spk_audio_prompt='examples\u002Fvoice_07.wav', text=text, output_path=\"gen.wav\", emo_audio_prompt=\"examples\u002Femo_sad.wav\", verbose=True)\n```\n\n3. When an emotional reference audio file is specified, you can optionally set\n   the `emo_alpha` to adjust how much it affects the output.\n   Valid range is `0.0 - 1.0`, and the default value is `1.0` (100%):\n\n```python\nfrom indextts.infer_v2 import IndexTTS2\ntts = IndexTTS2(cfg_path=\"checkpoints\u002Fconfig.yaml\", model_dir=\"checkpoints\", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)\ntext = \"酒楼丧尽天良，开始借机竞拍房间，哎，一群蠢货。\"\ntts.infer(spk_audio_prompt='examples\u002Fvoice_07.wav', text=text, output_path=\"gen.wav\", emo_audio_prompt=\"examples\u002Femo_sad.wav\", emo_alpha=0.9, verbose=True)\n```\n\n4. It's also possible to omit the emotional reference audio and instead provide\n   an 8-float list specifying the intensity of each emotion, in the following order:\n   `[happy, angry, sad, afraid, disgusted, melancholic, surprised, calm]`.\n   You can additionally use the `use_random` parameter to introduce stochasticity\n   during inference; the default is `False`, and setting it to `True` enables\n   randomness:\n\n> [!NOTE]\n> Enabling random sampling will reduce the voice cloning fidelity of the speech\n> synthesis.\n\n```python\nfrom indextts.infer_v2 import IndexTTS2\ntts = IndexTTS2(cfg_path=\"checkpoints\u002Fconfig.yaml\", model_dir=\"checkpoints\", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)\ntext = \"对不起嘛！我的记性真的不太好，但是和你在一起的事情，我都会努力记住的~\"\ntts.infer(spk_audio_prompt='examples\u002F09.wav', text=text, output_path=\"gen.wav\", emo_vector=[0, 0, 0.8, 0, 0, 0, 0, 0], use_random=False, verbose=True)\n```\n\n5. Alternatively, you can enable `use_emo_text` to guide the emotions based on\n   your provided `text` script. Your text script will then automatically\n   be converted into emotion vectors.\n   It's recommended to use `emo_alpha` around 0.6 (or lower) when using the text\n   emotion modes, for more natural sounding speech.\n   You can introduce randomness with `use_random` (default: `False`;\n   `True` enables randomness):\n\n```python\nfrom indextts.infer_v2 import IndexTTS2\ntts = IndexTTS2(cfg_path=\"checkpoints\u002Fconfig.yaml\", model_dir=\"checkpoints\", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)\ntext = \"快躲起来！是他要来了！他要来抓我们了！\"\ntts.infer(spk_audio_prompt='examples\u002Fvoice_12.wav', text=text, output_path=\"gen.wav\", emo_alpha=0.6, use_emo_text=True, use_random=False, verbose=True)\n```\n\n6. It's also possible to directly provide a specific text emotion description\n   via the `emo_text` parameter. Your emotion text will then automatically be\n   converted into emotion vectors. This gives you separate control of the text\n   script and the text emotion description:\n\n```python\nfrom indextts.infer_v2 import IndexTTS2\ntts = IndexTTS2(cfg_path=\"checkpoints\u002Fconfig.yaml\", model_dir=\"checkpoints\", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)\ntext = \"快躲起来！是他要来了！他要来抓我们了！\"\nemo_text = \"你吓死我了！你是鬼吗？\"\ntts.infer(spk_audio_prompt='examples\u002Fvoice_12.wav', text=text, output_path=\"gen.wav\", emo_alpha=0.6, use_emo_text=True, emo_text=emo_text, use_random=False, verbose=True)\n```\n\n> [!TIP]\n> **Pinyin Usage Notes:**\n> \n> IndexTTS2 still supports mixed modeling of Chinese characters and Pinyin.\n> When you need precise pronunciation control, please provide text with specific Pinyin annotations to activate the Pinyin control feature.\n> Note that Pinyin control does not work for every possible consonant–vowel combination; only valid Chinese Pinyin cases are supported.\n> For the full list of valid entries, please refer to `checkpoints\u002Fpinyin.vocab`.\n>\n> Example:\n> ```\n> 之前你做DE5很好，所以这一次也DEI3做DE2很好才XING2，如果这次目标完成得不错的话，我们就直接打DI1去银行取钱。\n> ```\n\n### Legacy: IndexTTS1 User Guide\n\nYou can also use our previous IndexTTS1 model by importing a different module:\n\n```python\nfrom indextts.infer import IndexTTS\ntts = IndexTTS(model_dir=\"checkpoints\",cfg_path=\"checkpoints\u002Fconfig.yaml\")\nvoice = \"examples\u002Fvoice_07.wav\"\ntext = \"大家好，我现在正在bilibili 体验 ai 科技，说实话，来之前我绝对想不到！AI技术已经发展到这样匪夷所思的地步了！比如说，现在正在说话的其实是B站为我现场复刻的数字分身，简直就是平行宇宙的另一个我了。如果大家也想体验更多深入的AIGC功能，可以访问 bilibili studio，相信我，你们也会吃惊的。\"\ntts.infer(voice, text, 'gen.wav')\n```\n\nFor more detailed information, see [README_INDEXTTS_1_5](archive\u002FREADME_INDEXTTS_1_5.md),\nor visit the IndexTTS1 repository at \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts\u002Ftree\u002Fv1.5.0\">index-tts:v1.5.0\u003C\u002Fa>.\n\n\n## Our Releases and Demos\n\n### IndexTTS2: [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21619); [[Demo]](https:\u002F\u002Findex-tts.github.io\u002Findex-tts2.github.io\u002F); [[ModelScope]](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002FIndexTeam\u002FIndexTTS-2-Demo); [[HuggingFace]](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FIndexTeam\u002FIndexTTS-2-Demo)\n\n### IndexTTS1: [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05512); [[Demo]](https:\u002F\u002Findex-tts.github.io\u002F); [[ModelScope]](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002FIndexTeam\u002FIndexTTS-Demo); [[HuggingFace]](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FIndexTeam\u002FIndexTTS)\n\n\n## Acknowledgements\n\n1. [tortoise-tts](https:\u002F\u002Fgithub.com\u002Fneonbjb\u002Ftortoise-tts)\n2. [XTTSv2](https:\u002F\u002Fgithub.com\u002Fcoqui-ai\u002FTTS)\n3. [BigVGAN](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN)\n4. [wenet](https:\u002F\u002Fgithub.com\u002Fwenet-e2e\u002Fwenet\u002Ftree\u002Fmain)\n5. [icefall](https:\u002F\u002Fgithub.com\u002Fk2-fsa\u002Ficefall)\n6. [maskgct](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002FAmphion\u002Ftree\u002Fmain\u002Fmodels\u002Ftts\u002Fmaskgct)\n7. [seed-vc](https:\u002F\u002Fgithub.com\u002FPlachtaa\u002Fseed-vc)\n\n## Contributors in Bilibili\nWe sincerely thank colleagues from different roles at Bilibili, whose combined efforts made the IndexTTS series possible.\n\n### Core Authors\n - **Wei Deng** - Core author; Initiated the IndexTTS project, led the development of the IndexTTS1 data pipeline, model architecture design and training, as well as iterative optimization of the IndexTTS series of models, focusing on fundamental capability building and performance optimization.\n - **Siyi Zhou** – Core author; in IndexTTS2, led model architecture design and training pipeline optimization, focusing on key features such as multilingual and emotional synthesis.\n - **Jingchen Shu** - Core author; worked on overall architecture design, cross-lingual modeling solutions, and training strategy optimization, driving model iteration.\n - **Xun Zhou** - Core author; worked on cross-lingual data processing and experiments, explored multilingual training strategies, and contributed to audio quality improvement and stability evaluation.\n - **Jinchao Wang** - Core author; worked on model development and deployment, building the inference framework and supporting system integration.\n - **Yiquan Zhou** - Core author; contributed to model experiments and validation, and proposed and implemented text-based emotion control.\n - **Yi He** - Core author; contributed to model experiments and validation.\n - **Lu Wang** – Core author; worked on data processing and model evaluation, supporting model training and performance verification.\n\n### Technical Contributors\n - **Yining Wang** - Supporting contributor; contributed to open-source code implementation and maintenance, supporting feature adaptation and community release.\n - **Yong Wu** - Supporting contributor; worked on data processing and experimental support, ensuring data quality and efficiency for model training and iteration.\n - **Yaqin Huang** – Supporting contributor; contributed to systematic model evaluation and effect tracking, providing feedback to support iterative improvements.\n - **Yunhan Xu** – Supporting contributor; provided guidance in recording and data collection, while also offering feedback from a product and operations perspective to improve usability and practical application.\n - **Yuelang Sun** – Supporting contributor; provided professional support in audio recording and data collection, ensuring high-quality data for model training and evaluation.\n - **Yihuang Liang** - Supporting contributor; worked on systematic model evaluation and project promotion, helping IndexTTS expand its reach and engagement.\n\n### Technical Guidance\n - **Huyang Sun** - Provided strong support for the IndexTTS project, ensuring strategic alignment and resource backing.\n - **Bin Xia** - Contributed to the review, optimization, and follow-up of technical solutions, focusing on ensuring model effectiveness.\n\n\n## 📚 Citation\n\n🌟 If you find our work helpful, please leave us a star and cite our paper.\n\n\nIndexTTS2:\n\n```\n@article{zhou2025indextts2,\n  title={IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech},\n  author={Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu},\n  journal={arXiv preprint arXiv:2506.21619},\n  year={2025}\n}\n```\n\n\nIndexTTS:\n\n```\n@article{deng2025indextts,\n  title={IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System},\n  author={Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang},\n  journal={arXiv preprint arXiv:2502.05512},\n  year={2025},\n  doi={10.48550\u002FarXiv.2502.05512},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05512}\n}\n```\n","\u003Cdiv align=\"center\">\n\u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Findex-tts_index-tts_readme_d06e21d0c8f0.png' width=\"250\"\u002F>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\u003Ca href=\"docs\u002FREADME_zh.md\" style=\"font-size: 24px\">简体中文\u003C\u002Fa> | \n\u003Ca href=\"README.md\" style=\"font-size: 24px\">English\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n## 仓库历史已重置。请删除本地副本并重新克隆。\n## （仓库历史已重置。请删除本地副本并重新克隆。）\n\n## 👉🏻 IndexTTS2 👈🏻\n\n\u003Ccenter>\u003Ch3>IndexTTS2：情感丰富且时长可控的自回归零样本文本转语音技术突破\u003C\u002Fh3>\u003C\u002Fcenter>\n\n[![IndexTTS2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Findex-tts_index-tts_readme_16321094ebf9.png)](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Findex-tts_index-tts_readme_16321094ebf9.png)\n\n\n\u003Cdiv align=\"center\">\n  \u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21619'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArXiv-2506.21619-red?logo=arxiv'\u002F>\n  \u003C\u002Fa>\n  \u003Cbr\u002F>\n  \u003Ca href='https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-Code-orange?logo=github'\u002F>\n  \u003C\u002Fa>\n  \u003Ca href='https:\u002F\u002Findex-tts.github.io\u002Findex-tts2.github.io\u002F'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-Demo-orange?logo=github'\u002F>\n  \u003C\u002Fa>\n  \u003Cbr\u002F>\n  \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FIndexTeam\u002FIndexTTS-2-Demo'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-Demo-blue?logo=huggingface'\u002F>\n  \u003C\u002Fa>\n  \u003Ca href='https:\u002F\u002Fhuggingface.co\u002FIndexTeam\u002FIndexTTS-2'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-Model-blue?logo=huggingface' \u002F>\n  \u003C\u002Fa>\n  \u003Cbr\u002F>\n  \u003Ca href='https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002FIndexTeam\u002FIndexTTS-2-Demo'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Demo-purple?logo=modelscope'\u002F>\n  \u003C\u002F>\n  \u003Ca href='https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FIndexTeam\u002FIndexTTS-2'>\n    \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-purple?logo=modelscope'\u002F>\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\n### 摘要\n\n现有的自回归大规模文本转语音（TTS）模型在语音自然度方面具有优势，但其逐字处理的生成机制使得精确控制合成语音的时长变得困难。这在需要严格音画同步的应用场景中，例如视频配音，成为显著的限制。\n\n本文介绍了IndexTTS2，提出了一种新颖、通用且与自回归模型兼容的语音时长控制方法。\n\n该方法支持两种生成模式：一种明确指定生成的音素数量以精确控制语音时长；另一种则无需指定音素数量，以自回归方式自由生成语音，同时忠实再现输入提示中的韵律特征。\n\n此外，IndexTTS2实现了情感表达与说话人身份的解耦，从而能够独立控制音色和情感。在零样本场景下，模型可以准确地重建目标音色（来自音色提示），同时完美还原指定的情感基调（来自风格提示）。\n\n为提升高度情感化表达下的语音清晰度，我们引入了GPT潜在表征，并设计了一种全新的三阶段训练范式，以增强生成语音的稳定性。另外，为了降低情感控制的门槛，我们基于通义千问的微调，设计了一套基于文本描述的软指令机制，有效引导生成符合预期情感倾向的语音。\n\n最后，多组实验结果表明，IndexTTS2在词错误率、说话人相似度以及情感保真度等方面均优于当前最先进的零样本TTS模型。音频示例可在以下链接查看：[IndexTTS2演示页](https:\u002F\u002Findex-tts.github.io\u002Findex-tts2.github.io\u002F)。\n\n**提示**：如需更详细的信息，请联系作者。商业使用及合作事宜，请发送邮件至\u003Cu>indexspeech@bilibili.com\u003C\u002Fu>。\n\n\n### 感受IndexTTS2\n\n\u003Cdiv align=\"center\">\n\n**IndexTTS2：语音的未来，现已生成**\n\n[![IndexTTS2演示](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Findex-tts_index-tts_readme_1c208472b84f.png)](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV136a9zqEk5)\n\n*点击图片观看IndexTTS2介绍视频。*\n\n\u003C\u002Fdiv>\n\n\n### 联系方式\n\nQQ群：663272642(No.4) 1013410623(No.5)  \\\nDiscord：https:\u002F\u002Fdiscord.gg\u002FuT32E7KDmy  \\\n邮箱：indexspeech@bilibili.com  \\\n欢迎大家加入我们的社区！ 🌏  \\\n欢迎大家来交流讨论！\n\n> [!CAUTION]\n> 感谢您对bilibili indextts项目的大力支持！\n> 请注意，核心团队维护的**唯一官方渠道**是：[https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts](https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts)。\n> ***其他任何网站或服务均非官方***，我们无法保证其安全性、准确性或及时性。\n> 如需最新动态，请务必参考此官方仓库。\n\n\n## 📣 更新\n\n- `2025\u002F09\u002F08` 🔥🔥🔥 我们向全球发布**IndexTTS-2**！\n    - 首个具备精确合成时长控制功能的自回归TTS模型，支持可控与不可控两种模式。\u003Ci>本次发布尚未启用该功能。\u003C\u002Fi>\n    - 该模型实现了高度情感化的语音合成，并通过多种输入模态实现了情感可控能力。\n- `2025\u002F05\u002F14` 🔥🔥 我们发布了**IndexTTS-1.5**，显著提升了模型的稳定性和英语表现。\n- `2025\u002F03\u002F25` 🔥 我们发布了包含模型权重和推理代码的**IndexTTS-1.0**。\n- `2025\u002F02\u002F12` 🔥 我们将论文提交至arXiv，并公开了演示和测试数据集。\n\n\n## 🖥️ 神经网络架构\n\n我们最先进语音模型——IndexTTS2的架构概览：\n\n\u003Cpicture>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Findex-tts_index-tts_readme_ba661dfa1311.png\"  width=\"800\"\u002F>\n\u003C\u002Fpicture>\n\n\n**IndexTTS2**的主要贡献总结如下：\n\n - 我们提出了一种适用于自回归TTS模型的时长适配方案。IndexTTS2是首个将精确时长控制与自然时长生成相结合的自回归零样本TTS模型，且该方法可扩展应用于任何自回归大规模TTS模型。  \n - 将情感和说话人相关特征从输入提示中解耦，并设计了一种特征融合策略，以在情感丰富的表达中保持语义流畅和发音清晰。此外，我们还开发了一款利用自然语言描述进行情感控制的工具，方便用户操作。  \n - 针对缺乏高度情感化语音数据的问题，我们提出了一种有效的训练策略，显著提升了零样本TTS的情感表现力，达到当前最先进水平（SOTA）。  \n - 我们将公开代码和预训练权重，以促进未来的研究和实际应用。\n\n## 模型下载\n\n| **HuggingFace**                                          | **ModelScope** |\n|----------------------------------------------------------|----------------------------------------------------------|\n| [😁 IndexTTS-2](https:\u002F\u002Fhuggingface.co\u002FIndexTeam\u002FIndexTTS-2) | [IndexTTS-2](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FIndexTeam\u002FIndexTTS-2) |\n| [IndexTTS-1.5](https:\u002F\u002Fhuggingface.co\u002FIndexTeam\u002FIndexTTS-1.5) | [IndexTTS-1.5](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FIndexTeam\u002FIndexTTS-1.5) |\n| [IndexTTS](https:\u002F\u002Fhuggingface.co\u002FIndexTeam\u002FIndex-TTS) | [IndexTTS](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FIndexTeam\u002FIndex-TTS) |\n\n\n## 使用说明\n\n### ⚙️ 环境设置\n\n1. 请确保您的系统已安装 [git](https:\u002F\u002Fgit-scm.com\u002Fdownloads)\n   和 [git-lfs](https:\u002F\u002Fgit-lfs.com\u002F)。\n\n同时，还需在当前用户账户中启用 Git-LFS 插件：\n\n```bash\ngit lfs install\n```\n\n2. 克隆本仓库：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts.git && cd index-tts\ngit lfs pull  # 下载大型文件\n```\n\n3. 安装 [uv 包管理器](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002Fgetting-started\u002Finstallation\u002F)。\n   这是构建可靠、现代化开发环境的*必要*条件。\n\n> [!TIP]\n> **快速简便的安装方法：**\n> \n> 您可以通过多种方式在本地安装 `uv` 命令行工具。请参阅上方链接以获取所有选项。或者，如果您希望采用更快速简便的方式，可以直接运行：\n> \n> ```bash\n> pip install -U uv\n> ```\n\n> [!WARNING]\n> 我们仅支持使用 `uv` 进行安装。其他工具如 `conda` 或 `pip` 并不能保证正确安装所需的依赖版本。如果不使用 `uv`，您几乎肯定会遇到*随机性错误、报错信息、缺失 GPU 加速*以及其他各种问题。因此，请勿就非标准安装方式提出任何问题，因为这些问题大多属于无效范畴。\n\n此外，`uv` 的安装速度比 `pip` 快达 **115 倍**（参考：[BENCHMARKS.md](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv\u002Fblob\u002Fmain\u002FBENCHMARKS.md)），这也是选择这一行业新标准来管理 Python 项目的一大理由。\n\n4. 安装所需依赖：\n\n我们使用 `uv` 来管理项目的依赖环境。以下命令将*自动*创建一个 `.venv` 目录，并安装正确版本的 Python 及所有必需的依赖项：\n\n```bash\nuv sync --all-extras\n```\n\n如果下载速度较慢，可以尝试使用国内镜像源，例如以下任一中国镜像：\n\n```bash\nuv sync --all-extras --default-index \"https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\"\n\nuv sync --all-extras --default-index \"https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fpypi\u002Fweb\u002Fsimple\"\n```\n\n> [!TIP]\n> **可选附加功能：**\n> \n> - `--all-extras`：自动启用下方列出的所有附加功能。若需自定义安装内容，可移除该参数。\n> - `--extra webui`：添加 WebUI 支持（推荐）。\n> - `--extra deepspeed`：添加 DeepSpeed 支持（可能加速部分系统的推理过程）。\n\n> [!IMPORTANT]\n> **Windows 用户注意：** 对于部分 Windows 用户而言，DeepSpeed 库的安装可能存在困难。您可以移除 `--all-extras` 参数以跳过该组件。若仍希望启用其他附加功能，则可单独添加相应的标志。\n\n> **Linux\u002FWindows 用户注意：** 如果在安装过程中出现关于 CUDA 的错误，请确保已在系统上安装 NVIDIA 的 [CUDA Toolkit](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit)，且版本为 **12.8** 或更高。\n\n5. 通过 `uv tool` 下载所需模型：\n\n使用 `huggingface-cli` 下载：\n\n```bash\nuv tool install \"huggingface-hub[cli,hf_xet]\"\n\nhf download IndexTeam\u002FIndexTTS-2 --local-dir=checkpoints\n```\n\n或使用 `modelscope` 下载：\n\n```bash\nuv tool install \"modelscope\"\n\nmodelscope download --model IndexTeam\u002FIndexTTS-2 --local_dir checkpoints\n```\n\n> [!IMPORTANT]\n> 若上述命令无法执行，请仔细阅读 `uv tool` 的输出信息。它会指导您如何将这些工具添加到系统路径中。\n\n> [!NOTE]\n> 除了上述模型外，首次运行项目时还会自动下载一些小型模型。如果您的网络访问 HuggingFace 较为缓慢，建议在运行代码前先执行以下命令：\n> \n> ```bash\n> export HF_ENDPOINT=\"https:\u002F\u002Fhf-mirror.com\"\n> ```\n\n\n#### 🖥️ 检查 PyTorch GPU 加速状态\n\n若需诊断当前环境并查看已检测到的 GPU 设备，可使用我们提供的工具进行检查：\n\n```bash\nuv run tools\u002Fgpu_check.py\n```\n\n### 🔥 IndexTTS2 快速入门\n\n#### 🌐 网页演示\n\n```bash\nuv run webui.py\n```\n\n打开浏览器并访问 `http:\u002F\u002F127.0.0.1:7860` 即可查看演示。\n\n你还可以调整设置，启用诸如 FP16 推理（降低显存占用）、DeepSpeed 加速、编译后的 CUDA 内核以提升速度等功能。所有可用选项可通过以下命令查看：\n\n```bash\nuv run webui.py -h\n```\n\n祝你玩得开心！\n\n> [!重要提示]\n> 使用 **FP16**（半精度）推理会非常有帮助。它不仅速度更快，还能减少显存占用，且对音质的影响极小。\n> \n> **DeepSpeed** *可能*会在某些系统上加速推理，但也可能导致速度变慢。其性能影响高度依赖于你的具体硬件、驱动程序和操作系统。请尝试开启和关闭 DeepSpeed，以确定哪种方式在你的设备上效果最佳。\n> \n> 最后，请注意，所有 `uv` 命令都会 **自动激活** 项目所需的正确虚拟环境。在运行 `uv` 命令之前，请勿手动激活任何虚拟环境，否则可能会导致依赖冲突！\n\n\n#### 📝 在 Python 中使用 IndexTTS2\n\n要运行脚本，*必须*使用 `uv run \u003Cfile.py>` 命令，以确保代码在当前的 “uv” 环境中执行。有时，你还需要将当前目录添加到 `PYTHONPATH` 中，以便找到 IndexTTS 模块。\n\n通过 `uv` 运行脚本的示例：\n\n```bash\nPYTHONPATH=\"$PYTHONPATH:.\" uv run indextts\u002Finfer_v2.py\n```\n\n以下是几种在自定义脚本中使用 IndexTTS2 的示例：\n\n1. 使用单个参考音频文件合成新语音（语音克隆）：\n\n```python\nfrom indextts.infer_v2 import IndexTTS2\ntts = IndexTTS2(cfg_path=\"checkpoints\u002Fconfig.yaml\", model_dir=\"checkpoints\", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)\ntext = \"Translate for me, what is a surprise!\"\ntts.infer(spk_audio_prompt='examples\u002Fvoice_01.wav', text=text, output_path=\"gen.wav\", verbose=True)\n```\n\n2. 使用单独的情感参考音频文件来调节语音合成：\n\n```python\nfrom indextts.infer_v2 import IndexTTS2\ntts = IndexTTS2(cfg_path=\"checkpoints\u002Fconfig.yaml\", model_dir=\"checkpoints\", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)\ntext = \"酒楼丧尽天良，开始借机竞拍房间，哎，一群蠢货。\"\ntts.infer(spk_audio_prompt='examples\u002Fvoice_07.wav', text=text, output_path=\"gen.wav\", emo_audio_prompt=\"examples\u002Femo_sad.wav\", verbose=True)\n```\n\n3. 当指定了情感参考音频时，你可以选择性地设置 `emo_alpha` 来调整其对输出的影响程度。有效范围为 `0.0 - 1.0`，默认值为 `1.0`（100%）：\n\n```python\nfrom indextts.infer_v2 import IndexTTS2\ntts = IndexTTS2(cfg_path=\"checkpoints\u002Fconfig.yaml\", model_dir=\"checkpoints\", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)\ntext = \"酒楼丧尽天良，开始借机竞拍房间，哎，一群蠢货。\"\ntts.infer(spk_audio_prompt='examples\u002Fvoice_07.wav', text=text, output_path=\"gen.wav\", emo_audio_prompt=\"examples\u002Femo_sad.wav\", emo_alpha=0.9, verbose=True)\n```\n\n4. 也可以不提供情感参考音频，而是直接给出一个包含 8 个浮点数的列表，按顺序指定每种情感的强度：`[happy, angry, sad, afraid, disgusted, melancholic, surprised, calm]`。此外，你还可以使用 `use_random` 参数来引入随机性；默认值为 `False`，将其设置为 `True` 则会启用随机性：\n\n> [!注释]\n> 启用随机采样会降低语音合成的克隆保真度。\n\n```python\nfrom indextts.infer_v2 import IndexTTS2\ntts = IndexTTS2(cfg_path=\"checkpoints\u002Fconfig.yaml\", model_dir=\"checkpoints\", use_fp16=False, use_cuda_kernel=False, use_deepspeed=False)\ntext = \"对不起嘛！我的记性真的不太好，但是和你在一起的事情，我都会努力记住的~\"\ntts.infer(spk_audio_prompt='examples\u002F09.wav', text=text，output_path=\"gen.wav\", emo_vector=[0, 0, 0.8, 0, 0, 0, 0, 0], use_random=False，verbose=True)\n```\n\n5. 或者，你可以启用 `use_emo_text`，根据你提供的文本内容来引导情感。此时，文本将自动转换为情感向量。建议在使用文本情感模式时，将 `emo_alpha` 设置为 0.6 左右（或更低），以获得更自然的语音效果。你还可以通过 `use_random` 引入随机性（默认为 `False`；设置为 `True` 则启用随机性）：\n\n```python\nfrom indextts.infer_v2 import IndexTTS2\ntts = IndexTTS2(cfg_path=\"checkpoints\u002Fconfig.yaml\", model_dir=\"checkpoints\", use_fp16=False，use_cuda_kernel=False，use_deepspeed=False)\ntext = \"快躲起来！是他要来了！他要来抓我们了！\"\ntts.infer(spk_audio_prompt='examples\u002Fvoice_12.wav', text=text，output_path=\"gen.wav”，emo_alpha=0.6，use_emo_text=True，use_random=False，verbose=True)\n```\n\n6. 你也可以直接通过 `emo_text` 参数提供特定的情感描述。系统会自动将该描述转换为情感向量，从而实现对文本内容和情感描述的独立控制：\n\n```python\nfrom indextts.infer_v2 import IndexTTS2\ntts = IndexTTS2(cfg_path=\"checkpoints\u002Fconfig.yaml\", model_dir=\"checkpoints\", use_fp16=False，use_cuda_kernel=False，use_deepspeed=False)\ntext = \"快躲起来！是他要来了！他要来抓我们了！\"\nemo_text = \"你吓死我了！你是鬼吗？\"\ntts.infer(spk_audio_prompt='examples\u002Fvoice_12.wav', text=text，output_path=\"gen.wav”，emo_alpha=0.6，use_emo_text=True，emo_text=emo_text，use_random=False，verbose=True)\n```\n\n> [!提示]\n> **拼音使用说明：**\n> \n> IndexTTS2 仍然支持汉字与拼音的混合建模。当你需要精确控制发音时，请提供带有具体拼音标注的文本，以激活拼音控制功能。请注意，拼音控制并非适用于所有辅音—元音组合；仅支持有效的中文拼音情况。完整支持的条目列表请参阅 `checkpoints\u002Fpinyin.vocab`。\n>\n> 示例：\n> ```\n> 之前你做DE5很好，所以这一次也DEI3做DE2很好才XING2，如果这次目标完成得不错的话，我们就直接打DI1去银行取钱。\n> ```\n\n### 旧版：IndexTTS1 使用指南\n\n你也可以通过导入不同的模块来使用我们之前的 IndexTTS1 模型：\n\n```python\nfrom indextts.infer import IndexTTS\ntts = IndexTTS(model_dir=\"checkpoints\",cfg_path=\"checkpoints\u002Fconfig.yaml\")\nvoice = \"examples\u002Fvoice_07.wav\"\ntext = \"大家好，我现在正在bilibili 体验 ai 科技，说实话，来之前我绝对想不到！AI技术已经发展到这样匪夷所思的地步了！比如说，现在正在说话的其实是B站为我现场复刻的数字分身，简直就是平行宇宙的另一个我了。如果大家也想体验更多深入的AIGC功能，可以访问 bilibili studio，相信我，你们也会吃惊的。\"\ntts.infer(voice, text, 'gen.wav')\n```\n\n如需更多信息，请参阅 [README_INDEXTTS_1_5](archive\u002FREADME_INDEXTTS_1_5.md)，或访问 IndexTTS1 仓库：\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts\u002Ftree\u002Fv1.5.0\">index-tts:v1.5.0\u003C\u002Fa>。\n\n\n## 我们的发布与演示\n\n### IndexTTS2：[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21619)；[[演示]](https:\u002F\u002Findex-tts.github.io\u002Findex-tts2.github.io\u002F)；[[ModelScope]](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002FIndexTeam\u002FIndexTTS-2-Demo)；[[HuggingFace]](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FIndexTeam\u002FIndexTTS-2-Demo)\n\n### IndexTTS1：[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05512)；[[演示]](https:\u002F\u002Findex-tts.github.io\u002F)；[[ModelScope]](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002FIndexTeam\u002FIndexTTS-Demo)；[[HuggingFace]](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FIndexTeam\u002FIndexTTS)\n\n\n## 致谢\n\n1. [tortoise-tts](https:\u002F\u002Fgithub.com\u002Fneonbjb\u002Ftortoise-tts)\n2. [XTTSv2](https:\u002F\u002Fgithub.com\u002Fcoqui-ai\u002FTTS)\n3. [BigVGAN](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN)\n4. [wenet](https:\u002F\u002Fgithub.com\u002Fwenet-e2e\u002Fwenet\u002Ftree\u002Fmain)\n5. [icefall](https:\u002F\u002Fgithub.com\u002Fk2-fsa\u002Ficefall)\n6. [maskgct](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002FAmphion\u002Ftree\u002Fmain\u002Fmodels\u002Ftts\u002Fmaskgct)\n7. [seed-vc](https:\u002F\u002Fgithub.com\u002FPlachtaa\u002Fseed-vc)\n\n## Bilibili 的贡献者\n我们衷心感谢 Bilibili 各个岗位的同事们，正是大家的共同努力才使得 IndexTTS 系列得以实现。\n\n### 核心作者\n - **Wei Deng** - 核心作者；发起并主导了 IndexTTS 项目，负责 IndexTTS1 数据流水线、模型架构设计与训练，以及 IndexTTS 系列模型的迭代优化工作，专注于基础能力构建与性能提升。\n - **Siyi Zhou** – 核心作者；在 IndexTTS2 中，主导了模型架构设计与训练流水线优化，重点聚焦多语言及情感合成等核心功能。\n - **Jingchen Shu** - 核心作者；参与整体架构设计、跨语言建模方案及训练策略优化，推动模型迭代。\n - **Xun Zhou** - 核心作者；负责跨语言数据处理与实验，探索多语言训练策略，并为音频质量提升及稳定性评估作出贡献。\n - **Jinchao Wang** - 核心作者；从事模型开发与部署工作，搭建推理框架并支持系统集成。\n - **Yiquan Zhou** - 核心作者；参与模型实验与验证，并提出和实现了基于文本的情感控制方法。\n - **Yi He** - 核心作者；参与模型实验与验证工作。\n - **Lu Wang** – 核心作者；负责数据处理与模型评估，支持模型训练与性能验证。\n\n### 技术支持人员\n - **Yining Wang** - 支持性贡献者；参与开源代码的实现与维护，支持功能适配与社区发布。\n - **Yong Wu** - 支持性贡献者；负责数据处理与实验支持，确保模型训练与迭代所需的数据质量和效率。\n - **Yaqin Huang** – 支持性贡献者；参与系统的模型评估与效果跟踪，提供反馈以支持迭代改进。\n - **Yunhan Xu** – 支持性贡献者；在录音与数据采集方面提供指导，同时从产品与运营角度提出反馈，以提升易用性和实际应用价值。\n - **Yuelang Sun** – 支持性贡献者；在音频录制与数据采集方面提供专业支持，确保用于模型训练与评估的高质量数据。\n - **Yihuang Liang** - 支持性贡献者；负责系统的模型评估与项目推广，帮助 IndexTTS 扩大影响力与用户参与度。\n\n### 技术指导\n - **Huyang Sun** - 为 IndexTTS 项目提供了强有力的支持，确保战略方向一致及资源保障。\n - **Bin Xia** - 参与技术方案的评审、优化及后续跟进工作，重点关注模型的有效性。\n\n\n## 📚 引用\n\n🌟 如果您觉得我们的工作有所帮助，请为我们点亮星标并引用我们的论文。\n\n\nIndexTTS2：\n\n```\n@article{zhou2025indextts2,\n  title={IndexTTS2：一种突破性的情感丰富且时长可控的自回归零样本文语转换系统},\n  author={Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu},\n  journal={arXiv 预印本 arXiv:2506.21619},\n  year={2025}\n}\n```\n\n\nIndexTTS：\n\n```\n@article{deng2025indextts,\n  title={IndexTTS：一个工业级可控且高效的零样本文语转换系统},\n  author={Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang},\n  journal={arXiv 预印本 arXiv:2502.05512},\n  year={2025},\n  doi={10.48550\u002FarXiv.2502.05512},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05512}\n}\n```","# IndexTTS2 快速上手指南\n\nIndexTTS2 是一款突破性的自回归零样本语音合成模型，支持精确的语音时长控制和高表现力的情感合成，并能实现音色与情感的解耦控制。\n\n## 1. 环境准备\n\n### 系统要求\n- **操作系统**: Linux 或 Windows\n- **GPU**: 推荐 NVIDIA 显卡，需安装 **CUDA Toolkit 12.8** 或更高版本以启用加速。\n- **前置工具**:\n  - [Git](https:\u002F\u002Fgit-scm.com\u002Fdownloads)\n  - [Git LFS](https:\u002F\u002Fgit-lfs.com\u002F) (用于下载大模型文件)\n\n### 前置检查\n确保已安装 Git 并启用 LFS：\n```bash\ngit lfs install\n```\n\n## 2. 安装步骤\n\n### 第一步：克隆仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts.git && cd index-tts\ngit lfs pull\n```\n\n### 第二步：安装依赖管理工具 uv\n官方强烈推荐并使用 `uv` 进行环境管理（比 pip 快且能避免依赖冲突）。\n```bash\npip install -U uv\n```\n\n### 第三步：创建虚拟环境并安装依赖\n使用国内镜像源加速下载（推荐阿里云或清华源）：\n\n**通用命令（包含所有额外功能，如 WebUI 和 DeepSpeed）：**\n```bash\nuv sync --all-extras --default-index \"https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\"\n```\n*注：Windows 用户若安装 DeepSpeed 失败，可去掉 `--all-extras` 参数，仅安装基础依赖和 webui：`uv sync --extra webui --default-index \"https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\"`*\n\n### 第四步：下载模型权重\n通过 `uv tool` 安装下载工具并从国内镜像或官方源拉取模型。\n\n**方案 A：使用 ModelScope（国内推荐）**\n```bash\nuv tool install \"modelscope\"\nmodelscope download --model IndexTeam\u002FIndexTTS-2 --local_dir checkpoints\n```\n\n**方案 B：使用 HuggingFace（需配置镜像）**\n若网络访问 HF 较慢，请先设置镜像环境变量：\n```bash\nexport HF_ENDPOINT=\"https:\u002F\u002Fhf-mirror.com\"\n```\n然后执行：\n```bash\nuv tool install \"huggingface-hub[cli,hf_xet]\"\nhf download IndexTeam\u002FIndexTTS-2 --local-dir=checkpoints\n```\n\n> **提示**：首次运行时可能还会自动下载一些小型辅助模型，请保持网络通畅。\n\n## 3. 基本使用\n\n### 启动 Web 演示界面 (推荐)\n这是最简单的使用方式，提供图形化界面进行语音合成、情感控制和时长调整。\n\n```bash\nuv run webui.py\n```\n\n启动成功后，在浏览器访问：\n`http:\u002F\u002F127.0.0.1:7860`\n\n### 进阶选项\n- **降低显存占用\u002F加速**: 在 WebUI 界面中勾选 **FP16** (半精度推理)，可显著减少显存使用并提升速度，且音质损失极小。\n- **查看帮助**: 如需命令行参数详情，运行：\n  ```bash\n  uv run webui.py -h\n  ```\n\n### 验证 GPU 加速\n如需确认环境是否正确识别了 GPU，可运行检测脚本：\n```bash\nuv run tools\u002Fgpu_check.py\n```","某短视频制作团队正在为一款情感类纪录片进行多语种配音，需要确保不同语言版本的旁白在语速、情绪和口型上严格同步。\n\n### 没有 index-tts 时\n- **口型对不上**：传统 TTS 模型无法精确控制语音时长，导致生成的音频与视频画面中人物的嘴部动作严重错位，后期需人工逐帧调整，耗时极长。\n- **情绪表达僵硬**：难以在复刻特定说话人音色的同时，精准注入“悲伤”或“激昂”等复杂情感，声音听起来像毫无感情的读稿机器。\n- **音色一致性差**：切换不同语种配音时，说话人的音色特征发生漂移，观众感觉像是换了一个人在说话，破坏了角色连贯性。\n- **试错成本高**：为了找到合适的语速和情感组合，往往需要反复生成数十个版本并人工筛选，严重拖慢项目交付进度。\n\n### 使用 index-tts 后\n- **精准时长控制**：利用 index-tts 的显式令牌数量指定功能，直接生成与视频片段毫秒级同步的音频，彻底解决了口型不同步的难题。\n- **情感与音色解耦**：通过独立的风格提示词，index-tts 能在完美保留原说话人音色的基础上，生动还原剧本要求的细腻情感，表现力大幅提升。\n- **零样本跨语种复用**：仅需一段参考音频，index-tts 即可在零样本设置下让同一“声音”流畅演绎多种语言，确保了多语种版本的角色统一性。\n- **文本指令引导**：借助基于 Qwen3 微调的软指令机制，制作人员只需输入“带着哽咽的语气”等自然语言描述，即可一次性生成高质量音频，无需反复试错。\n\nindex-tts 通过工业级的时长控制与情感解耦能力，将视频配音从繁琐的“手工修补”转变为高效的“精准生成”，极大提升了多媒体内容的生产质量与效率。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Findex-tts_index-tts_93477413.png",null,"https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Findex-tts_faab71ec.png","https:\u002F\u002Fgithub.com\u002Findex-tts",[77,81,85,89],{"name":78,"color":79,"percentage":80},"Python","#3572A5",98.4,{"name":82,"color":83,"percentage":84},"Cuda","#3A4E3A",0.9,{"name":86,"color":87,"percentage":88},"C","#555555",0.6,{"name":90,"color":91,"percentage":92},"C++","#f34b7d",0.1,19976,2455,"2026-04-13T02:23:14","NOASSERTION","Linux, Windows","需要 NVIDIA GPU，需安装 CUDA Toolkit 12.8 或更高版本以支持加速；支持 FP16 推理以降低显存占用（具体显存大小未说明，建议使用较大显存显卡）","未说明",{"notes":101,"python":102,"dependencies":103},"必须使用 uv 包管理器进行环境搭建和依赖安装，官方不支持 conda 或 pip 直接安装，否则可能导致缺少 GPU 加速或出现随机错误。Windows 用户若安装 DeepSpeed 困难可跳过该选项。首次运行会自动下载额外的小模型，若网络访问 HuggingFace 缓慢，建议配置国内镜像源或设置 HF_ENDPOINT 环境变量。","由 uv 自动管理（版本未明确指定，但需兼容 PyTorch 及 DeepSpeed）",[104,105,106,107,108],"torch","deepspeed (可选)","huggingface-hub","modelscope","webui (可选)",[15,110],"音频",[112,113,114,115,116,117,118],"bigvgan","cross-lingual","indextts","text-to-speech","tts","voice-clone","zero-shot-tts","2026-03-27T02:49:30.150509","2026-04-13T17:44:19.011507",[122,127,132,137,142,147,151],{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},31798,"在 Windows 上安装时遇到 deepspeed 构建失败或 setuptools 缺失错误怎么办？","Windows 环境下直接安装 deepspeed 常因缺少编译环境而失败。建议的解决方案是：\n1. 使用 WSL2 (Windows Subsystem for Linux) 进行安装和运行，步骤与 Linux 基本一致，且更稳定快速。\n2. 如果必须在 Windows 原生环境运行，尝试使用 cmd 或 PowerShell 而不是 git-bash 启动，以避免 DLL 兼容性问题。\n3. 确保在激活的虚拟环境中安装依赖，有时需要手动安装 setuptools 或预编译的 deepspeed wheel。","https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts\u002Fissues\u002F392",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},31799,"生成的音频语速不均匀、短句之间停顿异常或长文本生成不稳定如何解决？","这是由于单次推理的文本过长导致的。解决方案如下：\n1. 不要一次性输入几百字的长文本。\n2. 建议在代码或预处理阶段，以长停顿标点（如句号、问号、感叹号、分号）为界将文本分句。\n3. 控制每个句子的长度最好不要超过 50 个字。\n4. 以句子为单位分别进行推理，最后将生成的音频片段按顺序拼接起来，即可获得稳定的效果。","https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts\u002Fissues\u002F10",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},31800,"生成的声音不像参考音源，或者音质奇怪、速度慢怎么办？","如果输出声音与参考音源差异巨大，请检查以下几点：\n1. 确认参考音频的质量和清晰度，嘈杂的录音会影响克隆效果。\n2. 尝试使用 WSL2 (Linux 子系统) 运行，许多用户反馈 Linux 环境下的推理比 Windows 原生更可靠且速度更快。\n3. 检查是否使用了正确的模型权重文件，确保 GPT 和声码器权重已正确加载。\n4. 对于英语语音，尝试调整情感控制参数（如使用 \"Melancholic\" 滑块，值设在 0.2-0.65 之间）或使用 \"Same as the voice reference\" 模式以获得更好的音色保留。","https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts\u002Fissues\u002F410",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},31801,"启动 WebUI 时卡在 'Failed to load custom CUDA kernel for BigVGAN' 怎么办？","此问题通常发生在 Windows 上使用 git-bash 启动时，存在 DLL 兼容性冲突。解决方法：\n1. 不要在 git-bash 中运行启动命令。\n2. 改用 Windows 原生的 cmd (命令提示符) 或 PowerShell 终端来执行 `uv run webui.py` 或启动脚本。\n3. 拉取最新的代码版本，部分内核加载问题可能已在更新中修复。","https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts\u002Fissues\u002F172",{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},31802,"git clone 仓库时提示 LFS 配额用尽 (exceeded its LFS budget) 导致文件下载失败怎么办？","这是因为仓库的 Git LFS 带宽或存储配额已用完。临时解决方案包括：\n1. 等待仓库所有者增加配额或购买更多容量。\n2. 尝试手动从其他来源下载缺失的大文件（如 examples 目录下的 wav 文件），如果项目提供了备用下载链接（如 HuggingFace 或百度网盘）。\n3. 如果是开发者，可以联系维护者询问是否有镜像仓库或替代下载方式。","https:\u002F\u002Fgithub.com\u002Findex-tts\u002Findex-tts\u002Fissues\u002F316",{"id":148,"question_zh":149,"answer_zh":150,"source_url":126},31803,"如何优化英语语音生成的情感表现和自然度？","针对英语语音生成，推荐以下三种设置以提升效果：\n1. 最佳音色保留：如果输入也是英语说话人，选择 \"Same as the voice reference\" 模式。\n2. 情感混合：选择 \"Use emotion reference audio\" 并提供额外的情感参考音频，通过调整 \"Emotion control weight\" 找到最自然的混合比例。\n3. 手动微调：使用 \"Use emotion vectors\" 模式进行完全手动控制。生成英语时，将 \"Melancholic\" 滑块设置在 0.2 到 0.65 之间通常效果显著，同时配合 \"Emotion weight\" 滑块控制整体强度。",{"id":152,"question_zh":153,"answer_zh":154,"source_url":131},31804,"文本中的特殊符号（如 '-'、';'）被误读成汉字或怪音怎么办？","这是前端文本归一化（Text Normalization）的问题。例如 '-' 被错误地朗读为“减”。\n1. 这是一个已知的前端 Badcase，维护者计划统一修复。\n2. 紧急情况下，用户可以自行修改 `indextts\u002Finfer.py` (约第 63 行入口) 相关的文本处理逻辑，添加自定义的正则替换规则，将特殊符号替换为对应的读音或直接移除。\n3. 建议对中英文符号进行全面测试，并在推理前对文本进行预处理清洗。",[156],{"id":157,"version":158,"summary_zh":73,"released_at":159},239025,"v1.5.0","2025-09-01T10:38:01"]