[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-SUC-DriverOld--so-vits-svc-Deployment-Documents":3,"tool-SUC-DriverOld--so-vits-svc-Deployment-Documents":61},[4,18,28,37,45,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":24,"last_commit_at":25,"category_tags":26,"status":17},9989,"n8n","n8n-io\u002Fn8n","n8n 是一款面向技术团队的公平代码（fair-code）工作流自动化平台，旨在让用户在享受低代码快速构建便利的同时，保留编写自定义代码的灵活性。它主要解决了传统自动化工具要么过于封闭难以扩展、要么完全依赖手写代码效率低下的痛点，帮助用户轻松连接 400 多种应用与服务，实现复杂业务流程的自动化。\n\nn8n 特别适合开发者、工程师以及具备一定技术背景的业务人员使用。其核心亮点在于“按需编码”：既可以通过直观的可视化界面拖拽节点搭建流程，也能随时插入 JavaScript 或 Python 代码、调用 npm 包来处理复杂逻辑。此外，n8n 原生集成了基于 LangChain 的 AI 能力，支持用户利用自有数据和模型构建智能体工作流。在部署方面，n8n 提供极高的自由度，支持完全自托管以保障数据隐私和控制权，也提供云端服务选项。凭借活跃的社区生态和数百个现成模板，n8n 让构建强大且可控的自动化系统变得简单高效。",184740,2,"2026-04-19T23:22:26",[16,14,13,15,27],"插件",{"id":29,"name":30,"github_repo":31,"description_zh":32,"stars":33,"difficulty_score":10,"last_commit_at":34,"category_tags":35,"status":17},10095,"AutoGPT","Significant-Gravitas\u002FAutoGPT","AutoGPT 是一个旨在让每个人都能轻松使用和构建 AI 的强大平台，核心功能是帮助用户创建、部署和管理能够自动执行复杂任务的连续型 AI 智能体。它解决了传统 AI 应用中需要频繁人工干预、难以自动化长流程工作的痛点，让用户只需设定目标，AI 即可自主规划步骤、调用工具并持续运行直至完成任务。\n\n无论是开发者、研究人员，还是希望提升工作效率的普通用户，都能从 AutoGPT 中受益。开发者可利用其低代码界面快速定制专属智能体；研究人员能基于开源架构探索多智能体协作机制；而非技术背景用户也可直接选用预置的智能体模板，立即投入实际工作场景。\n\nAutoGPT 的技术亮点在于其模块化“积木式”工作流设计——用户通过连接功能块即可构建复杂逻辑，每个块负责单一动作，灵活且易于调试。同时，平台支持本地自托管与云端部署两种模式，兼顾数据隐私与使用便捷性。配合完善的文档和一键安装脚本，即使是初次接触的用户也能在几分钟内启动自己的第一个 AI 智能体。AutoGPT 正致力于降低 AI 应用门槛，让人人都能成为 AI 的创造者与受益者。",183572,"2026-04-20T04:47:55",[13,36,27,14,15],"语言模型",{"id":38,"name":39,"github_repo":40,"description_zh":41,"stars":42,"difficulty_score":10,"last_commit_at":43,"category_tags":44,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":46,"name":47,"github_repo":48,"description_zh":49,"stars":50,"difficulty_score":24,"last_commit_at":51,"category_tags":52,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",161692,"2026-04-20T11:33:57",[14,13,36],{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":24,"last_commit_at":59,"category_tags":60,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":87,"forks":88,"last_commit_at":89,"license":90,"difficulty_score":91,"env_os":92,"env_gpu":93,"env_ram":94,"env_deps":95,"category_tags":101,"github_topics":102,"view_count":24,"oss_zip_url":79,"oss_zip_packed_at":79,"status":17,"created_at":107,"updated_at":108,"faqs":109,"releases":140},10210,"SUC-DriverOld\u002Fso-vits-svc-Deployment-Documents","so-vits-svc-Deployment-Documents","So-VITS-SVC 本地部署使用帮助文档，提供Colab笔记本 So-VITS-SVC Local Deployment Document and provide Colab notebook","so-vits-svc-Deployment-Documents 是一套专为 SoftVC VITS 歌声转换项目打造的本地部署指南，旨在帮助用户轻松搭建和运行 so-vits-svc 模型。它详细记录了从环境配置、依赖安装到数据预处理、模型训练及推理的全过程，并提供了可直接运行的 Google Colab 笔记本，让用户无需在本地复杂配置即可快速体验。\n\n这套文档主要解决了开源歌声转换项目在落地过程中常见的“门槛高、易报错、文档分散”等痛点。通过分步骤的教程和常见错误解决方案，它大幅降低了用户将算法转化为实际应用的时间成本。此外，文档还特别强调了法律合规性，提醒使用者遵守相关法律法规，尊重声音权益。\n\n该资源非常适合希望尝试歌声转换技术的普通爱好者、需要本地化部署方案的开发者，以及从事音频算法研究的研究人员。对于不想手动配置环境的用户，文档也推荐了成熟的整合包作为替代方案。其技术亮点在于对 4.1 版本的深度适配、详尽的编码器解析以及对旧版本用户的分支指引，确保了不同需求层次的用户都能找到适合自己的路径。无论是想制作个人翻唱作品，还是进行二次开发，这份文档都是不可或缺的实用手册。","\u003Cdiv align=\"center\">\n\n# SoftVC VITS Singing Voice Conversion Local Deployment Tutorial\n[![Open in Google Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FSUC-DriverOld\u002Fso-vits-svc-Deployment-Documents\u002Fblob\u002F4.1\u002Fsovits4_for_colab.ipynb) \u003Cbr>\nEnglish | [简体中文](README_zh_CN.md) \u003Cbr>\nThis help document provides detailed installation, debugging, and inference tutorials for the project [so-vits-svc](https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc). You can also directly refer to the official [README](https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc#readme) documentation. \u003Cbr>\nWritten by Sucial. [Bilibili](https:\u002F\u002Fspace.bilibili.com\u002F445022409) | [Github](https:\u002F\u002Fgithub.com\u002FSUC-DriverOld)\n\n\u003C\u002Fdiv>\n\n---\n\n✨ **Click to view: [Accompanying Video Tutorial](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Hr4y197Cy\u002F) | [UVR5 Vocal Separation Tutorial](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1F4421c7qU\u002F) (Note: The accompanying video may be outdated. Refer to the latest tutorial documentation for accurate information!)**\n\n✨ **Related Resources: [Official README Documentation](https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc) | [Common Error Solutions](https:\u002F\u002Fwww.yuque.com\u002Fumoubuton\u002Fueupp5\u002Fieinf8qmpzswpsvr) | [羽毛布団](https:\u002F\u002Fspace.bilibili.com\u002F3493141443250876)**\n\n> [!IMPORTANT]\n>\n> 1. **Important! Read this first!** If you do not want to configure the environment manually or are looking for an integration package, please use the integration package by [羽毛布団](https:\u002F\u002Fspace.bilibili.com\u002F3493141443250876).\n> 2. **About old version tutorials**: For the so-vits-svc3.0 version tutorial, please switch to the [3.0 branch](https:\u002F\u002Fgithub.com\u002FSUC-DriverOld\u002Fso-vits-svc-Chinese-Detaild-Documents\u002Ftree\u002F3.0). This branch is no longer being updated!\n> 3. **Continuous improvement of the documentation**: If you encounter errors not mentioned in this document, you can ask questions in the issues section. For project bugs, please report issues to the original project. If you want to improve this tutorial, feel free to submit a PR!\n\n# Tutorial Index\n\n- [SoftVC VITS Singing Voice Conversion Local Deployment Tutorial](#softvc-vits-singing-voice-conversion-local-deployment-tutorial)\n- [Tutorial Index](#tutorial-index)\n- [0. Before You Use](#0-before-you-use)\n    - [Any country, region, organization, or individual using this project must comply with the following laws:](#any-country-region-organization-or-individual-using-this-project-must-comply-with-the-following-laws)\n      - [《民法典》](#民法典)\n      - [第一千零一十九条](#第一千零一十九条)\n      - [第一千零二十四条](#第一千零二十四条)\n      - [第一千零二十七条](#第一千零二十七条)\n      - [《中华人民共和国宪法》|《中华人民共和国刑法》|《中华人民共和国民法典》|《中华人民共和国合同法》](#中华人民共和国宪法中华人民共和国刑法中华人民共和国民法典中华人民共和国合同法)\n  - [0.1 Usage Regulations](#01-usage-regulations)\n  - [0.2 Hardware Requirements](#02-hardware-requirements)\n  - [0.3 Preparation](#03-preparation)\n- [1. Environment Dependencies](#1-environment-dependencies)\n  - [1.1 so-vits-svc4.1 Source Code](#11-so-vits-svc41-source-code)\n  - [1.2 Python](#12-python)\n  - [1.3 Pytorch](#13-pytorch)\n  - [1.4 Installation of Other Dependencies](#14-installation-of-other-dependencies)\n  - [1.5 FFmpeg](#15-ffmpeg)\n- [2. Configuration and Training](#2-configuration-and-training)\n  - [2.1 Issues Regarding Compatibility with the 4.0 Model](#21-issues-regarding-compatibility-with-the-40-model)\n  - [2.2 Pre-downloaded Model Files](#22-pre-downloaded-model-files)\n    - [2.2.1 Mandatory Items](#221-mandatory-items)\n      - [Detailed Explanation of Each Encoder](#detailed-explanation-of-each-encoder)\n    - [2.2.2 Pre-trained Base Model (Strongly Recommended)](#222-pre-trained-base-model-strongly-recommended)\n    - [2.2.3 Optional Items (Choose as Needed)](#223-optional-items-choose-as-needed)\n  - [2.3 Data Preparation](#23-data-preparation)\n  - [2.4 Data Preprocessing](#24-data-preprocessing)\n    - [2.4.0 Audio Slicing](#240-audio-slicing)\n    - [2.4.1 Resampling to 44100Hz Mono](#241-resampling-to-44100hz-mono)\n    - [2.4.2 Automatic Dataset Splitting and Configuration File Generation](#242-automatic-dataset-splitting-and-configuration-file-generation)\n      - [Using Loudness Embedding](#using-loudness-embedding)\n    - [2.4.3 Modify Configuration Files as Needed](#243-modify-configuration-files-as-needed)\n      - [config.json](#configjson)\n      - [diffusion.yaml](#diffusionyaml)\n    - [2.4.3 Generating Hubert and F0](#243-generating-hubert-and-f0)\n      - [Pros and Cons of Each F0 Predictor](#pros-and-cons-of-each-f0-predictor)\n  - [2.5 Training](#25-training)\n    - [2.5.1 Main Model Training (Required)](#251-main-model-training-required)\n    - [2.5.2 Diffusion Model (Optional)](#252-diffusion-model-optional)\n    - [2.5.3 Tensorboard](#253-tensorboard)\n- [3. Inference](#3-inference)\n  - [3.1 Command-line Inference](#31-command-line-inference)\n  - [3.2 webUI Inference](#32-webui-inference)\n- [4. Optional Enhancements](#4-optional-enhancements)\n  - [4.1 Automatic F0 Prediction](#41-automatic-f0-prediction)\n  - [4.2 Clustering Timbre Leakage Control](#42-clustering-timbre-leakage-control)\n  - [4.3 Feature Retrieval](#43-feature-retrieval)\n  - [4.4 Vocoder Fine-tuning](#44-vocoder-fine-tuning)\n  - [4.5 Directories for Saved Models](#45-directories-for-saved-models)\n- [5. Other Optional Features](#5-other-optional-features)\n  - [5.1 Model Compression](#51-model-compression)\n  - [5.2 Voice Mixing](#52-voice-mixing)\n    - [5.2.1 Static Voice Mixing](#521-static-voice-mixing)\n    - [5.2.2 Dynamic Voice Mixing](#522-dynamic-voice-mixing)\n  - [5.3 Onnx Export](#53-onnx-export)\n- [6. Simple Mixing and Exporting Finished Product](#6-simple-mixing-and-exporting-finished-product)\n    - [Use Audio Host Software to Process Inferred Audio](#use-audio-host-software-to-process-inferred-audio)\n- [Appendix: Common Errors and Solutions](#appendix-common-errors-and-solutions)\n  - [About Out of Memory (OOM)](#about-out-of-memory-oom)\n  - [Common Errors and Solutions When Installing Dependencies](#common-errors-and-solutions-when-installing-dependencies)\n  - [Common Errors During Dataset Preprocessing and Model Training](#common-errors-during-dataset-preprocessing-and-model-training)\n  - [Errors When Using WebUI\\*\\*](#errors-when-using-webui)\n- [Acknowledgements](#acknowledgements)\n\n# 0. Before You Use\n\n### Any country, region, organization, or individual using this project must comply with the following laws:\n\n#### 《[民法典](http:\u002F\u002Fgongbao.court.gov.cn\u002FDetails\u002F51eb6750b8361f79be8f90d09bc202.html)》\n\n#### 第一千零一十九条\n\n任何组织或者个人**不得**以丑化、污损，或者利用信息技术手段伪造等方式侵害他人的肖像权。**未经**肖像权人同意，**不得**制作、使用、公开肖像权人的肖像，但是法律另有规定的除外。**未经**肖像权人同意，肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。对自然人声音的保护，参照适用肖像权保护的有关规定。\n**对自然人声音的保护，参照适用肖像权保护的有关规定**\n\n#### 第一千零二十四条\n\n【名誉权】民事主体享有名誉权。任何组织或者个人**不得**以侮辱、诽谤等方式侵害他人的名誉权。\n\n#### 第一千零二十七条\n\n【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象，含有侮辱、诽谤内容，侵害他人名誉权的，受害人有权依法请求该行为人承担民事责任。行为人发表的文学、艺术作品不以特定人为描述对象，仅其中的情节与该特定人的情况相似的，不承担民事责任。\n\n#### 《[中华人民共和国宪法](http:\u002F\u002Fwww.gov.cn\u002Fguoqing\u002F2018-03\u002F22\u002Fcontent_5276318.htm)》|《[中华人民共和国刑法](http:\u002F\u002Fgongbao.court.gov.cn\u002FDetails\u002Ff8e30d0689b23f57bfc782d21035c3.html?sw=中华人民共和国刑法)》|《[中华人民共和国民法典](http:\u002F\u002Fgongbao.court.gov.cn\u002FDetails\u002F51eb6750b8361f79be8f90d09bc202.html)》|《[中华人民共和国合同法](http:\u002F\u002Fwww.npc.gov.cn\u002Fzgrdw\u002Fnpc\u002Flfzt\u002Frlyw\u002F2016-07\u002F01\u002Fcontent_1992739.htm)》\n\n## 0.1 Usage Regulations\n\n> [!WARNING]\n>\n> 1. **This tutorial is for communication and learning purposes only. Do not use it for illegal activities, violations of public order, or other unethical purposes. Out of respect for the providers of audio sources, do not use this for inappropriate purposes.**\n> 2. **Continuing to use this tutorial implies agreement with the related regulations described herein. This tutorial fulfills its obligation to provide guidance and is not responsible for any subsequent issues that may arise.**\n> 3. **Please resolve dataset authorization issues yourself. Do not use unauthorized datasets for training! Any issues arising from the use of unauthorized datasets are your own responsibility and have no connection to the repository, the repository maintainers, the svc develop team, or the tutorial authors.**\n\nSpecific usage regulations are as follows:\n\n- The content of this tutorial represents personal views only and does not represent the views of the so-vits-svc team or the original authors.\n- This tutorial assumes the use of the repository maintained by the so-vits-svc team. Please comply with the open-source licenses of any open-source code involved.\n- Any videos based on sovits made and posted on video platforms must clearly indicate in the description the source of the input vocals or audio used for the voice converter. For example, if using someone else's video or audio as the input source after vocal separation, a clear link to the original video or music must be provided. If using your own voice or audio synthesized by other vocal synthesis engines, this must also be indicated in the description.\n- Ensure the data sources used to create datasets are legal and compliant, and that data providers are aware of what you are creating and the potential consequences. You are solely responsible for any infringement issues arising from the input sources. When using other commercial vocal synthesis software as input sources, ensure you comply with the software's usage terms. Note that many vocal synthesis engine usage terms explicitly prohibit using them as input sources for conversion!\n- Cloud training and inference may involve financial costs. If you are a minor, please obtain permission and understanding from your guardian before proceeding. This tutorial is not responsible for any subsequent issues arising from unauthorized use.\n- Local training (especially on less powerful hardware) may require prolonged high-load operation of the device. Ensure proper maintenance and cooling measures for your device.\n- Due to equipment reasons, this tutorial has only been tested on Windows systems. For Mac and Linux, ensure you have some problem-solving capability.\n- Continuing to use this repository implies agreement with the related regulations described in the README. This README fulfills its obligation to provide guidance and is not responsible for any subsequent issues that may arise.\n\n## 0.2 Hardware Requirements\n\n1. Training **must** be conducted using a GPU! For inference, which can be done via **command-line inference** or **WebUI inference**, either a CPU or GPU can be used if speed is not a primary concern.\n2. If you plan to train your own model, prepare an **NVIDIA graphics card with at least 6GB of dedicated memory**.\n3. Ensure your computer's virtual memory is set to **at least 30GB**, and it is best if it is set on an SSD, otherwise it will be very slow.\n4. For cloud training, it is recommended to use [Google Colab](https:\u002F\u002Fcolab.google\u002F), you can configure it according to our provided [sovits4_for_colab.ipynb](.\u002Fsovits4_for_colab.ipynb).\n\n## 0.3 Preparation\n\n1. Prepare at least 30 minutes (the more, the better!) of **clean vocals** as your training set, with **no background noise and no reverb**. It is best to maintain a **consistent timbre** while singing, ensure a **wide vocal range (the vocal range of the training set determines the range of the trained model!)**, and have an **appropriate loudness**. If possible, perform **loudness matching** using audio processing software such as Audition.\n2. **Important!** Download the necessary **base model** for training in advance. Refer to [2.2.2 Pre-trained Base Model](#222-pre-trained-baseline-models-highly-recommended).\n3. For inference: Prepare **dry vocals** with **background noise \u003C30dB** and preferably **without reverb or harmonies**.\n\n> [!NOTE]\n>\n> **Note 1**: Both singing and speaking can be used as training sets, but using speech may lead to **issues with high and low notes during inference (commonly known as range issues\u002Fmuted sound)**, as the vocal range of the training set largely determines the vocal range of the trained model. Therefore, if your final goal is to achieve singing, it is recommended to use singing vocals as your training set.\n>\n> **Note 2**: When using a male voice model to infer songs sung by female singers, if there is noticeable muting, try lowering the pitch (usually by 12 semitones, or one octave). Similarly, when using a female voice model to infer songs sung by male singers, you can try raising the pitch.\n>\n> **✨ Latest Recommendation as of 2024.3.8 ✨**: Currently, the [GPT-SoVITS](https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS) project's TTS (Text-to-Speech) compared to so-vits-svc's TTS requires a smaller training set, has faster training speed, and yields better results. Therefore, if you want to use the speech synthesis function, please switch to [GPT-SoVITS](https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS). Consequently, it is recommended to use singing vocals as the training set for this project.\n\n# 1. Environment Dependencies\n\n✨ **Required environment for this project**: [NVIDIA-CUDA](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit) | [Python](https:\u002F\u002Fwww.python.org\u002F) = 3.8.9 (this version is recommended) | [Pytorch](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F) | [FFmpeg](https:\u002F\u002Fffmpeg.org\u002F)\n\n✨ **You can also try using my script for one-click environment setup and webUI launch: [so-vits-svc-webUI-QuickStart-bat](https:\u002F\u002Fgithub.com\u002FSUC-DriverOld\u002Fso-vits-svc-webUI-QuickStart-bat)**\n\n## 1.1 so-vits-svc4.1 Source Code\n\nYou can download or clone the source code using one of the following methods:\n\n1. **Download the source code ZIP file from the Github project page**: Go to the [so-vits-svc official repository](https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc), click the green `Code` button at the top right, and select `Download ZIP` to download the compressed file. If you need the code from another branch, switch to that branch first. After downloading, extract the ZIP file to any directory, which will serve as your working directory.\n\n2. **Clone the source code using git**: Use the following command:\n\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc.git\n   ```\n\n## 1.2 Python\n\n- Go to the [Python official website](https:\u002F\u002Fwww.python.org\u002F) to download Python 3.8 and **add it to the system environment PATH**. Detailed installation methods and adding Path are omitted here, as they can be easily found online.\n\n```bash\n# Conda configuration method, replace YOUR_ENV_NAME with the name of the virtual environment you want to create.\nconda create -n YOUR_ENV_NAME python=3.8 -y\nconda activate YOUR_ENV_NAME\n# Ensure you are in this virtual environment before executing any commands!\n```\n\n- After installation, enter `python` in the command prompt. If the output is similar to the following, the installation was successful:\n\n  ```bash\n  Python 3.8.9 (tags\u002Fv3.8.9:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)] on win32\n  Type \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n  >>>\n  ```\n\n**Regarding the Python version**: After testing, we found that Python 3.8.9 can stably run this project (though higher versions may also work).\n\n## 1.3 Pytorch\n\n> [!IMPORTANT]\n>\n> ✨ We highly recommend you to install Pytorch 11.7 or 11.8. The version over 12.0 may not be compatible with the current project.\n\n- We need to **separately install** `torch`, `torchaudio`, `torchvision` libraries. Go directly to the [Pytorch official website](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F), choose the desired version, and copy the command displayed in the \"Run this Command\" section to the console to install. You can download older versions of Pytorch from [here](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Fprevious-versions\u002F).\n\n- After installing `torch`, `torchaudio`, `torchvision`, use the following command in the cmd console to check if torch can successfully call CUDA. If the last line shows `True`, it's successful; if it shows `False`, it's unsuccessful and you need to reinstall the correct version.\n\n```bash\npython\n# Press Enter to run\nimport torch\n# Press Enter to run\nprint(torch.cuda.is_available())\n# Press Enter to run\n```\n\n> [!NOTE]\n>\n> 1. If you need to specify the version of `torch` manually, simply add the version number afterward. For example, `pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu117`.\n> 2. When installing Pytorch for CUDA 11.7, you may encounter an error `ERROR: Package 'networkx' requires a different Python: 3.8.9 not in '>=3.9'`. In this case, first execute `pip install networkx==3.0`, and then proceed with the Pytorch installation to avoid similar errors.\n> 3. Due to version updates, you may not be able to copy the download link for Pytorch 11.7. In this case, you can directly copy the installation command below to install Pytorch 11.7. Alternatively, you can download older versions from [here](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Fprevious-versions\u002F).\n\n```bash\npip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu117\n```\n\n## 1.4 Installation of Other Dependencies\n\n> [!IMPORTANT]\n> ✨ Before starting the installation of other dependencies, **make sure to download and install** [Visual Studio 2022](https:\u002F\u002Fvisualstudio.microsoft.com\u002F) or [Microsoft C++ Build Tools](https:\u002F\u002Fvisualstudio.microsoft.com\u002Fzh-hans\u002Fvisual-cpp-build-tools\u002F) (the latter has a smaller size). **Select and install the component package: \"Desktop development with C++\"**, then execute the modification and wait for the installation to complete.\n\n- Right-click in the folder obtained from [1.1](#11-so-vits-svc41-source-code) and select **Open in Terminal**. Use the following command to first update `pip`, `wheel`, and `setuptools`.\n\n```bash\npip install --upgrade pip==23.3.2 wheel setuptools\n```\n\n- Execute the following command to install libraries (**if errors occur, please try multiple times until there are no errors, and all dependencies are installed**). Note that there are three `requirements` txt files in the project folder; here, select `requirements_win.txt`.\n\n```bash\npip install -r requirements_win.txt\n```\n\n- After ensuring the installation is **correct and error-free**, use the following command to update `fastapi`, `gradio`, and `pydantic` dependencies:\n\n```bash\npip install --upgrade fastapi==0.84.0\npip install --upgrade pydantic==1.10.12\npip install --upgrade gradio==3.41.2\n```\n\n## 1.5 FFmpeg\n\n- Go to the [FFmpeg official website](https:\u002F\u002Fffmpeg.org\u002F) to download FFmpeg. Unzip it to any location and add the path to the environment variables. Navigate to `.\\ffmpeg\\bin` (detailed installation methods and adding Path are omitted here, as they can be easily found online).\n\n- After installation, enter `ffmpeg -version` in the cmd console. If the output is similar to the following, the installation was successful:\n\n```bash\nffmpeg version git-2020-08-12-bb59bdb Copyright (c) 2000-2020 the FFmpeg developers\nbuilt with gcc 10.2.1 (GCC) 20200805\nconfiguration: a bunch of configuration details here\nlibavutil      56. 58.100 \u002F 56. 58.100\nlibavcodec     58.100.100 \u002F 58.100.100\n...\n```\n\n# 2. Configuration and Training\n\n✨ This section is the most crucial part of the entire tutorial document. It references the [official documentation](https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc#readme) and includes some explanations and clarifications for better understanding.\n\n✨ Before diving into the content of the second section, please ensure that your computer's virtual memory is set to **30GB or above**, preferably on a solid-state drive (SSD). You can search online for specific instructions on how to do this.\n\n## 2.1 Issues Regarding Compatibility with the 4.0 Model\n\n- You can ensure support for the 4.0 model by modifying the `config.json` of the 4.0 model. You need to add the `speech_encoder` field under the `model` section in the `config.json`, as shown below:\n\n```bash\n  \"model\":\n  {\n    # Other contents omitted\n\n    # \"ssl_dim\", fill in either 256 or 768, which should match the value below \"speech_encoder\"\n    \"ssl_dim\": 256,\n    # Number of speakers\n    \"n_speakers\": 200,\n    # or \"vec768l12\", but please note that the value here should match \"ssl_dim\" above. That is, 256 corresponds to vec256l9, and 768 corresponds to vec768l12.\n    \"speech_encoder\":\"vec256l9\"\n    # If you're unsure whether your model is vec768l12 or vec256l9, you can confirm by checking the value of the \"gin_channels\" field.\n\n    # Other contents omitted\n  }\n```\n\n## 2.2 Pre-downloaded Model Files\n\n### 2.2.1 Mandatory Items\n\n> [!WARNING]\n>\n> **You must select one of the following encoders to use:**\n>\n> - \"vec768l12\"\n> - \"vec256l9\"\n> - \"vec256l9-onnx\"\n> - \"vec256l12-onnx\"\n> - \"vec768l9-onnx\"\n> - \"vec768l12-onnx\"\n> - \"hubertsoft-onnx\"\n> - \"hubertsoft\"\n> - \"whisper-ppg\"\n> - \"cnhubertlarge\"\n> - \"dphubert\"\n> - \"whisper-ppg-large\"\n> - \"wavlmbase+\"\n\n| Encoder                  | Download Link                                                                                                                                                                                                        | Location                                                                    | Description                                                             |\n| ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------- |\n| contentvec (Recommended) | [checkpoint_best_legacy_500.pt](https:\u002F\u002Fibm.box.com\u002Fs\u002Fz1wgl1stco8ffooyatzdwsqn2psd9lrr)                                                                                                                              | Place in `pretrain` directory                                               | `vec768l12` and `vec256l9` require this encoder                         |\n|                          | [hubert_base.pt](https:\u002F\u002Fhuggingface.co\u002Flj1995\u002FVoiceConversionWebUI\u002Fresolve\u002Fmain\u002Fhubert_base.pt)                                                                                                                     | Rename to checkpoint_best_legacy_500.pt, then place in `pretrain` directory | Same effect as the above `checkpoint_best_legacy_500.pt` but only 199MB |\n| hubertsoft               | [hubert-soft-0d54a1f4.pt](https:\u002F\u002Fgithub.com\u002Fbshall\u002Fhubert\u002Freleases\u002Fdownload\u002Fv0.1\u002Fhubert-soft-0d54a1f4.pt)                                                                                                           | Place in `pretrain` directory                                               | Used by so-vits-svc3.0                                                  |\n| Whisper-ppg              | [medium.pt](https:\u002F\u002Fopenaipublic.azureedge.net\u002Fmain\u002Fwhisper\u002Fmodels\u002F345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1\u002Fmedium.pt)                                                                       | Place in `pretrain` directory                                               | Compatible with `whisper-ppg`                                           |\n| whisper-ppg-large        | [large-v2.pt](https:\u002F\u002Fopenaipublic.azureedge.net\u002Fmain\u002Fwhisper\u002Fmodels\u002F81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524\u002Flarge-v2.pt)                                                                   | Place in `pretrain` directory                                               | Compatible with `whisper-ppg-large`                                     |\n| cnhubertlarge            | [chinese-hubert-large-fairseq-ckpt.pt](https:\u002F\u002Fhuggingface.co\u002FTencentGameMate\u002Fchinese-hubert-large\u002Fresolve\u002Fmain\u002Fchinese-hubert-large-fairseq-ckpt.pt)                                                                | Place in `pretrain` directory                                               | -                                                                       |\n| dphubert                 | [DPHuBERT-sp0.75.pth](https:\u002F\u002Fhuggingface.co\u002Fpyf98\u002FDPHuBERT\u002Fresolve\u002Fmain\u002FDPHuBERT-sp0.75.pth)                                                                                                                        | Place in `pretrain` directory                                               | -                                                                       |\n| WavLM                    | [WavLM-Base+.pt](https:\u002F\u002Fvalle.blob.core.windows.net\u002Fshare\u002Fwavlm\u002FWavLM-Base+.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D) | Place in `pretrain` directory                                               | Download link might be problematic, unable to download                  |\n| OnnxHubert\u002FContentVec    | [MoeSS-SUBModel](https:\u002F\u002Fhuggingface.co\u002FNaruseMioShirakana\u002FMoeSS-SUBModel\u002Ftree\u002Fmain)                                                                                                                                 | Place in `pretrain` directory                                               | -                                                                       |\n\n#### Detailed Explanation of Each Encoder\n\n| Encoder Name                   | Advantages                                                         | Disadvantages                     |\n| ------------------------------ | ------------------------------------------------------------------ | --------------------------------- |\n| `vec768l12` (Most Recommended) | Best voice fidelity, large base model, supports loudness embedding | Weak articulation                 |\n| `vec256l9`                     | No particular advantages                                           | Does not support diffusion models |\n| `hubertsoft`                   | Strong articulation                                                | Voice leakage                     |\n| `whisper-ppg`                  | Strongest articulation                                             | Voice leakage, high VRAM usage    |\n\n### 2.2.2 Pre-trained Base Model (Strongly Recommended)\n\n- Pre-trained base model files: `G_0.pth`, `D_0.pth`. Place in the `logs\u002F44k` directory.\n\n- Diffusion model pre-trained base model file: `model_0.pt`. Place in the `logs\u002F44k\u002Fdiffusion` directory.\n\nThe diffusion model references the Diffusion Model from [DDSP-SVC](https:\u002F\u002Fgithub.com\u002Fyxlllc\u002FDDSP-SVC), and the base model is compatible with the diffusion model from [DDSP-SVC](https:\u002F\u002Fgithub.com\u002Fyxlllc\u002FDDSP-SVC). Some of the provided base model files are from the integration package of “[羽毛布団](https:\u002F\u002Fspace.bilibili.com\u002F3493141443250876)”, to whom we express our gratitude.\n\n**Provide 4.1 training base models, please download them yourself (requires external network conditions)**\n\n| Encoder Type                        | Main Model Base                                                                                                                                                                                                                  | Diffusion Model Base                                                                                                 | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |\n| ----------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| vec768l12                           | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec768l12\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec768l12\u002FD_0.pth)                           | [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002F768l12\u002Fmodel_0.pt)      | If only training for 100 steps diffusion, i.e., `k_step_max = 100`, use [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002F768l12\u002Fmax100\u002Fmodel_0.pt) for the diffusion model                                                                                                                                                                                                                                                                                         |\n| vec768l12 (with loudness embedding) | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec768l12\u002Fvol_emb\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec768l12\u002Fvol_emb\u002FD_0.pth)           | [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002F768l12\u002Fmodel_0.pt)      | If only training for 100 steps diffusion, i.e., `k_step_max = 100`, use [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002F768l12\u002Fmax100\u002Fmodel_0.pt) for the diffusion model                                                                                                                                                                                                                                                                                         |\n| vec256l9                            | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec256l9\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec256l9\u002FD_0.pth)                             | Not supported                                                                                                        | -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |\n| hubertsoft                          | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fhubertsoft\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fhubertsoft\u002FD_0.pth)                         | [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002Fhubertsoft\u002Fmodel_0.pt)  | -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |\n| whisper-ppg                         | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fwhisper-ppg\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fwhisper-ppg\u002FD_0.pth)                       | [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002Fwhisper-ppg\u002Fmodel_0.pt) | -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |\n| tiny (vec768l12_vol_emb)            | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Ftiny\u002Fvec768l12_vol_emb\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Ftiny\u002Fvec768l12_vol_emb\u002FD_0.pth) | -                                                                                                                    | TINY is based on the original So-VITS model with reduced network parameters, using Depthwise Separable Convolution and FLOW shared parameter technology, significantly reducing model size and improving inference speed. TINY is designed for real-time conversion; reduced parameters mean its conversion effect is theoretically inferior to the original model. Real-time conversion GUI for So-VITS is under development. Until then, if there's no special need, training TINY model is not recommended. |\n\n> [!WARNING]\n>\n> Pre-trained models for other encoders not mentioned are not provided. Please train without base models, which may significantly increase training difficulty!\n\n**Base Model and Support**\n\n| Standard Base | Loudness Embedding | Loudness Embedding + TINY | Full Diffusion | 100-Step Shallow Diffusion |\n| ------------- | ------------------ | ------------------------- | -------------- | -------------------------- |\n| Vec768L12     | Supported          | Supported                 | Supported      | Supported                  |\n| Vec256L9      | Supported          | Not Supported             | Not Supported  | Not Supported              |\n| hubertsoft    | Supported          | Not Supported             | Supported      | Not Supported              |\n| whisper-ppg   | Supported          | Not Supported             | Supported      | Not Supported              |\n\n### 2.2.3 Optional Items (Choose as Needed)\n\n**1. NSF-HIFIGAN**\n\nIf using the `NSF-HIFIGAN enhancer` or `shallow diffusion`, you need to download the pre-trained NSF-HIFIGAN model provided by [OpenVPI]. If not needed, you can skip this.\n\n- Pre-trained NSF-HIFIGAN vocoder:\n  - Version 2022.12: [nsf_hifigan_20221211.zip](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002Fvocoders\u002Freleases\u002Fdownload\u002Fnsf-hifigan-v1\u002Fnsf_hifigan_20221211.zip);\n  - Version 2024.02: [nsf_hifigan_44.1k_hop512_128bin_2024.02.zip](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002Fvocoders\u002Freleases\u002Fdownload\u002Fnsf-hifigan-44.1k-hop512-128bin-2024.02\u002Fnsf_hifigan_44.1k_hop512_128bin_2024.02.zip)\n- After extracting, place the four files in the `pretrain\u002Fnsf_hifigan` directory.\n- If you download version 2024.02 of the vocoder, rename `model.ckpt` to `model`, i.e., remove the file extension.\n\n**2. RMVPE**\n\nIf using the `rmvpe` F0 predictor, you need to download the pre-trained RMVPE model.\n\n- Download the model [rmvpe.zip](https:\u002F\u002Fgithub.com\u002Fyxlllc\u002FRMVPE\u002Freleases\u002Fdownload\u002F230917\u002Frmvpe.zip), which is currently recommended.\n- Extract `rmvpe.zip`, rename the `model.pt` file to `rmvpe.pt`, and place it in the `pretrain` directory.\n\n**3. FCPE (Preview Version)**\n\n[FCPE](https:\u002F\u002Fgithub.com\u002FCNChTu\u002FMelPE) (Fast Context-based Pitch Estimator) is a new F0 predictor developed independently by svc-develop-team, designed specifically for real-time voice conversion. It will become the preferred F0 predictor for Sovits real-time voice conversion in the future.\n\nIf using the `fcpe` F0 predictor, you need to download the pre-trained FCPE model.\n\n- Download the model [fcpe.pt](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fylzz1997\u002Frmvpe_pretrain_model\u002Fresolve\u002Fmain\u002Ffcpe.pt).\n- Place it in the `pretrain` directory.\n\n## 2.3 Data Preparation\n\n1. Organize the dataset into the `dataset_raw` directory according to the following file structure.\n\n```\ndataset_raw\n├───speaker0\n│   ├───xxx1-xxx1.wav\n│   ├───...\n│   └───Lxx-0xx8.wav\n└───speaker1\n    ├───xx2-0xxx2.wav\n    ├───...\n    └───xxx7-xxx007.wav\n```\n\n2. You can customize the names of the speakers.\n\n```\ndataset_raw\n└───suijiSUI\n    ├───1.wav\n    ├───...\n    └───25788785-20221210-200143-856_01_(Vocals)_0_0.wav\n```\n\n3. Additionally, you need to create and edit `config.json` in `dataset_raw`\n\n```json\n{\n  \"n_speakers\": 10,\n\n  \"spk\": {\n    \"speaker0\": 0,\n    \"speaker1\": 1\n  }\n}\n```\n\n- `\"n_speakers\": 10`: The number represents the number of speakers, starting from 1, and needs to correspond to the number below\n- `\"speaker0\": 0`: \"speaker0\" refers to the speaker's name, which can be changed. The numbers 0, 1, 2... represent the speaker count, starting from 0.\n\n## 2.4 Data Preprocessing\n\n### 2.4.0 Audio Slicing\n\n- Slice the audio into `5s - 15s` segments. Slightly longer segments are acceptable, but excessively long segments may lead to out-of-memory errors during training or preprocessing.\n\n- You can use [audio-slicer-GUI](https:\u002F\u002Fgithub.com\u002Fflutydeer\u002Faudio-slicer) or [audio-slicer-CLI](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002Faudio-slicer) for assistance in slicing. Generally, you only need to adjust the `Minimum Interval`. For regular speech material, the default value is usually sufficient, while for singing material, you may adjust it to `100` or even `50`.\n\n- After slicing, manually handle audio that is too long (over 15 seconds) or too short (under 4 seconds). Short audio can be concatenated into multiple segments, while long audio can be manually split.\n\n> [!WARNING]\n>\n> **If you are training with the Whisper-ppg sound encoder, all slices must be less than 30s in length.**\n\n### 2.4.1 Resampling to 44100Hz Mono\n\nUse the following command (skip this step if loudness matching has already been performed):\n\n```bash\npython resample.py\n```\n\n> [!NOTE]\n>\n> Although this project provides a script `resample.py` for resampling, converting to mono, and loudness matching, the default loudness matching matches to 0db, which may degrade audio quality. Additionally, the loudness normalization package `pyloudnorm` in Python cannot apply level limiting, which may lead to clipping. It is recommended to consider using professional audio processing software such as `Adobe Audition` for loudness matching. You can also use a loudness matching tool I developed, [Loudness Matching Tool](https:\u002F\u002Fgithub.com\u002FAI-Hobbyist\u002FLoudness-Matching-Tool). If you have already performed loudness matching with other software, you can add `--skip_loudnorm` when running the above command to skip the loudness matching step. For example:\n\n```bash\npython resample.py --skip_loudnorm\n```\n\n### 2.4.2 Automatic Dataset Splitting and Configuration File Generation\n\nUse the following command (skip this step if loudness embedding is required):\n\n```bash\npython preprocess_flist_config.py --speech_encoder vec768l12\n```\n\nThe `speech_encoder` parameter has seven options, as explained in **[2.2.1 Required Items and Explanation of Each Encoder](#detailed-explanation-of-each-encoder)**. If you omit the `speech_encoder` parameter, the default value is `vec768l12`.\n\n```\nvec768l12\nvec256l9\nhubertsoft\nwhisper-ppg\nwhisper-ppg-large\ncnhubertlarge\ndphubert\n```\n\n#### Using Loudness Embedding\n\n- When using loudness embedding, the trained model will match the loudness of the input source. Otherwise, it will match the loudness of the training set.\n- If using loudness embedding, you need to add the `--vol_aug` parameter, for example:\n\n```bash\npython preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug\n```\n\n### 2.4.3 Modify Configuration Files as Needed\n\n#### config.json\n\n- `vocoder_name`: Select a vocoder, default is `nsf-hifigan`.\n- `log_interval`: How often to output logs, default is `200`.\n- `eval_interval`: How often to perform validation and save the model, default is `800`.\n- `epochs`: Total number of training epochs, default is `10000`. Training will automatically stop after reaching this number of epochs.\n- `learning_rate`: Learning rate, it's recommended to keep the default value.\n- `batch_size`: The amount of data loaded onto the GPU for each training step, adjust to a size lower than the GPU memory capacity.\n- `all_in_mem`: Load all dataset into memory. Enable this if disk IO is too slow on some platforms and the memory capacity is much larger than the dataset size.\n- `keep_ckpts`: Number of recent models to keep during training, `0` to keep all. Default is to keep only the last `3` models.\n\n**Vocoder Options**\n\n```\nnsf-hifigan\nnsf-snake-hifigan\n```\n\n#### diffusion.yaml\n\n- `cache_all_data`: Load all dataset into memory. Enable this if disk IO is too slow on some platforms and the memory capacity is much larger than the dataset size.\n- `duration`: Duration of audio slices during training. Adjust according to GPU memory size. **Note: This value must be less than the shortest duration of audio in the training set!**\n- `batch_size`: The amount of data loaded onto the GPU for each training step, adjust to a size lower than the GPU memory capacity.\n- `timesteps`: Total steps of the diffusion model, default is 1000. A complete Gaussian diffusion has a total of 1000 steps.\n- `k_step_max`: During training, only `k_step_max` steps of diffusion can be trained to save training time. Note that this value must be less than `timesteps`. `0` means training the entire diffusion model. **Note: If not training the entire diffusion model, the model can't be used for inference with only the diffusion model!**\n\n### 2.4.3 Generating Hubert and F0\n\nUse the following command (skip this step if training shallow diffusion):\n\n```bash\n# The following command uses rmvpe as the f0 predictor, you can manually modify it\npython preprocess_hubert_f0.py --f0_predictor rmvpe\n```\n\nThe `f0_predictor` parameter has six options, and some F0 predictors require downloading additional preprocessing models. Please refer to **[2.2.3 Optional Items (Choose According to Your Needs)](#223-optional-items-choose-as-needed)** for details.\n\n```\ncrepe\ndio\npm\nharvest\nrmvpe (recommended!)\nfcpe\n```\n\n#### Pros and Cons of Each F0 Predictor\n\n| Predictor | Pros                                                                                          | Cons                                                                     |\n| --------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |\n| pm        | Fast, low resource consumption                                                                | Prone to producing breathy voice                                         |\n| crepe     | Rarely produces breathy voice                                                                 | High memory consumption, may produce out-of-tune voice                   |\n| dio       | -                                                                                             | May produce out-of-tune voice                                            |\n| harvest   | Better performance in the lower pitch range                                                   | Inferior performance in other pitch ranges                               |\n| rmvpe     | Almost flawless, currently the most accurate predictor                                        | Virtually no drawbacks (extremely low-pitched sounds may be problematic) |\n| fcpe      | Developed by the SVC team, currently the fastest predictor, with accuracy comparable to crepe | -                                                                        |\n\n> [!NOTE]\n>\n> 1. If the training set is too noisy, use crepe to process f0.\n> 2. If you omit the `f0_predictor` parameter, the default value is rmvpe.\n\n**If shallow diffusion functionality is required (optional), add the --use_diff parameter, for example:**\n\n```bash\n# The following command uses rmvpe as the f0 predictor, you can manually modify it\npython preprocess_hubert_f0.py --f0_predictor rmvpe --use_diff\n```\n\n**If the processing speed is slow, or if your dataset is large, you can add the `--num_processes` parameter:**\n\n```bash\n# The following command uses rmvpe as the f0 predictor, you can manually change it\npython preprocess_hubert_f0.py --f0_predictor rmvpe --num_processes 8\n# All workers will be automatically assigned to multiple threads\n```\n\nAfter completing the above steps, the `dataset` directory generated is the preprocessed data, and you can delete the `dataset_raw` folder as needed.\n\n## 2.5 Training\n\n### 2.5.1 Main Model Training (Required)\n\nUse the following command to train the main model. You can also use the same command to resume training if it pauses.\n\n```bash\npython train.py -c configs\u002Fconfig.json -m 44k\n```\n\n### 2.5.2 Diffusion Model (Optional)\n\n- A major update in So-VITS-SVC 4.1 is the introduction of Shallow Diffusion mechanism, which converts the original output audio of SoVITS into Mel spectrograms, adds noise, and performs shallow diffusion processing before outputting the audio through the vocoder. Testing has shown that **the quality of the output is significantly enhanced after the original output audio undergoes shallow diffusion processing, addressing issues such as electronic noise and background noise**.\n- If shallow diffusion functionality is required, you need to train the diffusion model. Before training, ensure that you have downloaded and correctly placed `NSF-HIFIGAN` (**refer to [2.2.3 Optional Items](#223-optional-items-choose-as-needed)**), and added the `--use_diff` parameter when preprocessing to generate Hubert and F0 (**refer to [2.4.3 Generating Hubert and F0](#243-generating-hubert-and-f0)**).\n\nTo train the diffusion model, use the following command:\n\n```bash\npython train_diff.py -c configs\u002Fdiffusion.yaml\n```\n\nAfter the model training is complete, the model files are saved in the `logs\u002F44k` directory, and the diffusion model is saved in `logs\u002F44k\u002Fdiffusion`.\n\n> [!IMPORTANT]\n>\n> **How do you know when the model is trained well?**\n>\n> 1. This is a very boring and meaningless question. It's like asking a teacher how to make your child study well. Except for yourself, no one can answer this question.\n> 2. The model training is related to the quality and duration of your dataset, the selected encoder, f0 algorithm, and even some supernatural mystical factors. Even if you have a finished model, the final conversion effect depends on your input source and inference parameters. This is not a linear process, and there are too many variables involved. So, if you have to ask questions like \"Why doesn't my model look like it?\" or \"How do you know when the model is trained well?\", I can only say WHO F\\*\\*KING KNOWS?\n> 3. But that doesn't mean there's no way. You just have to pray and worship. I don't deny that praying and worshiping is an effective method, but you can also use some scientific tools, such as Tensorboard, etc. [2.5.3 Tensorboard](#253-tensorboard) below will teach you how to use Tensorboard to assist in understanding the training status. **Of course, the most powerful tool is actually within yourself. How do you know when a acoustic model is trained well? Put on your headphones and let your ears tell you.**\n\n**Relationship between Epoch and Step**:\n\nDuring training, a model will be saved every specified number of steps (default is 800 steps, corresponding to the `eval_interval` value) based on the setting in your `config.json`.\nIt's important to distinguish between epochs and steps: 1 Epoch means all samples in the training set have been involved in one learning process, while 1 Step means one learning step has been taken. Due to the existence of `batch_size`, each learning step can contain several samples. Therefore, the conversion between Epoch and Step is as follows:\n\n$$\nEpoch = \\frac{Step}{(\\text{Number of samples in the dataset}{\\div}batch\\_size)}\n$$\n\nThe training will end after 10,000 epochs by default (you can increase or decrease the upper limit by modifying the value of the `epoch` field in `config.json`), but typically, good results can be achieved after a few hundred epochs. When you feel that training is almost complete, you can interrupt the training by pressing `Ctrl + C` in the training terminal. After interruption, as long as you haven't reprocessed the training set, you can **resume training from the most recent saved point**.\n\n### 2.5.3 Tensorboard\n\nYou can use Tensorboard to visualize the trends of loss function values during training, listen to audio samples, and assist in judging the training status of the model. **However, for the So-VITS-SVC project, the loss function values (loss) do not have practical reference significance (you don't need to compare or study the value itself), the real reference is still listening to the audio output after inference with your ears!**\n\n- Use the following command to open Tensorboard:\n\n```bash\ntensorboard --logdir=.\u002Flogs\u002F44k\n```\n\nTensorboard generates logs based on the default evaluation every 200 steps during training. If training has not reached 200 steps, no images will appear in Tensorboard. The value of 200 can be modified by changing the value of `log_interval` in `config.json`.\n\n- Explanation of Losses\n\nYou don't need to understand the specific meanings of each loss. In general:\n\n- `loss\u002Fg\u002Ff0`, `loss\u002Fg\u002Fmel`, and `loss\u002Fg\u002Ftotal` should oscillate and eventually converge to some value.\n- `loss\u002Fg\u002Fkl` should oscillate at a low level.\n- `loss\u002Fg\u002Ffm` should continue to rise in the middle of training, and in the later stages, the upward trend should slow down or even start to decline.\n\n> [!IMPORTANT]\n>\n> ✨ Observing the trends of loss curves can help you judge the training status of the model. However, losses alone cannot be the sole criterion for judging the training status of the model, **and in fact, their reference value is not very significant. You still need to judge whether the model is trained well by listening to the audio output with your ears**.\n\n> [!WARNING]\n>\n> 1. For small datasets (30 minutes or even smaller), it is not recommended to train for too long when loading the base model. This is to make the best use of the advantages of the base model. The best results can be achieved in thousands or even hundreds of steps.\n> 2. The audio samples in Tensorboard are generated based on your validation set and **cannot represent the final performance of the model**.\n\n# 3. Inference\n\n✨ Before inference, please prepare the dry audio you need for inference, ensuring it has no background noise\u002Freverb and is of good quality. You can use [UVR5](https:\u002F\u002Fgithub.com\u002FAnjok07\u002Fultimatevocalremovergui\u002Freleases\u002Ftag\u002Fv5.6) for processing to obtain the dry audio. Additionally, I've also created a [UVR5 vocal separation tutorial](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1F4421c7qU\u002F).\n\n## 3.1 Command-line Inference\n\nPerform inference using inference_main.py\n\n```bash\n# Example\npython inference_main.py -m \"logs\u002F44k\u002FG_30400.pth\" -c \"configs\u002Fconfig.json\" -n \"your_inference_audio.wav\" -t 0 -s \"speaker\"\n```\n\n**Required Parameters:**\n\n- `-m` | `--model_path`: Path to the model\n- `-c` | `--config_path`: Path to the configuration file\n- `-n` | `--clean_names`: List of wav file names, placed in the raw folder\n- `-t` | `--trans`: Pitch adjustment, supports positive and negative (in semitones)\n- `-s` | `--spk_list`: Name of the target speaker for synthesis\n- `-cl` | `--clip`: Audio forced clipping, default 0 for automatic clipping, unit in seconds\u002Fs.\n\n> [!NOTE]\n>\n> **Audio Clipping**\n>\n> - During inference, the clipping tool will split the uploaded audio into several small segments based on silence sections, and then combine them after inference to form the complete audio. This approach benefits from lower GPU memory usage for small audio segments, thus enabling the segmentation of long audio for inference to avoid GPU memory overflow. The clipping threshold parameter controls the minimum full-scale decibel value, and anything lower will be considered as silence and removed. Therefore, when the uploaded audio is noisy, you can set this parameter higher (e.g., -30), whereas for cleaner audio, a lower value (e.g., -50) can be set to avoid cutting off breath sounds and faint voices.\n>\n> - A recent test by the development team suggests that smaller clipping thresholds (e.g., -50) improve the clarity of the output, although the principle behind this is currently unclear.\n>\n> **Forced Clipping** `-cl` | `--clip`\n>\n> - During inference, the clipping tool may sometimes produce overly long audio segments when continuous vocal sections exist without silence for an extended period, potentially causing GPU memory overflow. The automatic audio clipping feature sets a maximum duration for audio segmentation. After the initial segmentation, if there are audio segments longer than this duration, they will be forcibly re-segmented at this duration to avoid memory overflow issues.\n> - Forced clipping may result in cutting off audio in the middle of a word, leading to discontinuity in the synthesized voice. You need to set the crossfade length for forced clipping in advanced settings to mitigate this issue.\n\n**Optional Parameters: See Next Section for Specifics**\n\n- `-lg` | `--linear_gradient`: Crossfade length of two audio clips, adjust this value if there are discontinuities in the voice after forced clipping, recommended to use default value 0, unit in seconds\n- `-f0p` | `--f0_predictor`: Choose F0 predictor, options are crepe, pm, dio, harvest, rmvpe, fcpe, default is pm (Note: crepe uses mean filter for original F0), refer to the advantages and disadvantages of different F0 predictors in [2.4.3 F0 Predictor Advantages and Disadvantages](#243-generating-hubert-and-f0)\n- `-a` | `--auto_predict_f0`: Automatically predict pitch during voice conversion, do not enable this when converting singing voices as it may severely mis-tune\n- `-cm` | `--cluster_model_path`: Path to clustering model or feature retrieval index, leave empty to automatically set to the default path of each solution model, fill in randomly if no clustering or feature retrieval is trained\n- `-cr` | `--cluster_infer_ratio`: Ratio of clustering solution or feature retrieval, range 0-1, defaults to 0 if no clustering model or feature retrieval is trained\n- `-eh` | `--enhance`: Whether to use the NSF_HIFIGAN enhancer, this option has a certain sound quality enhancement effect on models with a limited training set, but has a negative effect on well-trained models, default is off\n- `-shd` | `--shallow_diffusion`: Whether to use shallow diffusion, enabling this can solve some electronic sound problems, default is off, when this option is enabled, the NSF_HIFIGAN enhancer will be disabled\n- `-usm` | `--use_spk_mix`: Whether to use speaker blending\u002Fdynamic voice blending\n- `-lea` | `--loudness_envelope_adjustment`: Ratio of input source loudness envelope replacement to output loudness envelope fusion, the closer to 1, the more the output loudness envelope is used\n- `-fr` | `--feature_retrieval`: Whether to use feature retrieval, if a clustering model is used, it will be disabled, and the cm and cr parameters will become the index path and mixing ratio of feature retrieval\n\n> [!NOTE]\n>\n> **Clustering Model\u002FFeature Retrieval Mixing Ratio** `-cr` | `--cluster_infer_ratio`\n>\n> - This parameter controls the proportion of linear involvement when using clustering models\u002Ffeature retrieval models. Clustering models and feature retrieval models can both slightly improve timbre similarity, but at the cost of reducing accuracy in pronunciation (feature retrieval has slightly better pronunciation than clustering). The range of this parameter is 0-1, where 0 means it is not enabled, and the closer to 1, the more similar the timbre and the blurrier the pronunciation.\n> - Clustering models and feature retrieval share this parameter. When loading models, the model used will be controlled by this parameter.\n> - **Note that when clustering models or feature retrieval models are not loaded, please keep this parameter as 0, otherwise an error will occur.**\n\n**Shallow Diffusion Settings:**\n\n- `-dm` | `--diffusion_model_path`: Diffusion model path\n- `-dc` | `--diffusion_config_path`: Diffusion model configuration file path\n- `-ks` | `--k_step`: Number of diffusion steps, larger values are closer to the result of the diffusion model, default is 100\n- `-od` | `--only_diffusion`: Pure diffusion mode, this mode does not load the sovits model and performs inference based only on the diffusion model\n- `-se` | `--second_encoding`: Secondary encoding, the original audio will be encoded a second time before shallow diffusion, a mysterious option, sometimes it works well, sometimes it doesn't\n\n> [!NOTE]\n>\n> **About Shallow Diffusion Steps** `-ks` | `--k_step`\n>\n> The complete Gaussian diffusion takes 1000 steps. When the number of shallow diffusion steps reaches 1000, the output result at this point is entirely the output result of the diffusion model, and the So-VITS model will be suppressed. The higher the number of shallow diffusion steps, the closer it is to the output result of the diffusion model. **If you only want to use shallow diffusion to remove electronic noise while preserving the timbre of the So-VITS model as much as possible, the number of shallow diffusion steps can be set to 30-50**\n\n> [!WARNING]\n>\n> If using the `whisper-ppg` voice encoder for inference, `--clip` should be set to 25, `-lg` should be set to 1. Otherwise, inference will not work properly.\n\n## 3.2 webUI Inference\n\nUse the following command to open the webUI interface, **upload and load the model, fill in the inference as needed according to the instructions, upload the inference audio, and start the inference.**\n\nThe detailed explanation of the inference parameters is the same as the [3.1 Command-line Inference](#31-command-line-inference) parameters, but moved to the interactive interface with simple instructions.\n\n```bash\npython webUI.py\n```\n\n> [!WARNING]\n>\n> **Be sure to check [Command-line Inference](#31-command-line-inference) to understand the meanings of specific parameters. Pay special attention to the reminders in NOTE and WARNING!**\n\nThe webUI also has a built-in **text-to-speech (TTS)** function:\n\n- Text-to-speech uses Microsoft's edge_TTS service to generate a piece of original speech, and then converts the voice of this speech to the target voice using So-VITS. So-VITS can only achieve voice conversion (SVC) for singing voices, and does not have any **native** text-to-speech (TTS) function! Since the speech generated by Microsoft's edge_TTS is relatively stiff and lacks emotion, all converted audio will also reflect this. **If you need a TTS function with emotions, please visit the [GPT-SoVITS](https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS) project.**\n- Currently, text-to-speech supports a total of 55 languages, covering most common languages. The program will automatically recognize the language based on the text entered in the text box and convert it.\n- Automatic recognition can only recognize the language, and certain languages may have different accents, speakers. If automatic recognition is used, the program will randomly select one from the speakers that fit the language and specified gender for conversion. If your target language has multiple accents or speakers (e.g., English), it is recommended to manually specify one speaker with a specific accent. If a speaker is manually specified, the previously manually selected gender will be suppressed.\n\n# 4. Optional Enhancements\n\n✨ If you are satisfied with the previous effects, or didn't quite understand what's being discussed below, you can ignore the following content without affecting model usage (these optional enhancements have relatively minor effects, and may only have some effect on specific data, but in most cases, the effect is not very noticeable).\n\n## 4.1 Automatic F0 Prediction\n\nDuring model training, an F0 predictor is trained, which is an automatic pitch shifting function that can match the model pitch to the source pitch during inference, useful for voice conversion where it can better match the pitch. **However, do not enable this feature when converting singing voices!!! It will severely mis-tune!!!**\n\n- Command-line Inference: Set `auto_predict_f0` to `true` in `inference_main`.\n- WebUI Inference: Check the corresponding option.\n\n## 4.2 Clustering Timbre Leakage Control\n\nClustering schemes can reduce timbre leakage, making the model output more similar to the target timbre (though not very obvious). However, using clustering alone can reduce the model's pronunciation clarity (making it unclear), this model adopts a fusion approach, allowing linear control of the proportion of clustering schemes and non-clustering schemes. In other words, you can manually adjust the ratio between \"similar to target timbre\" and \"clear pronunciation\" to find a suitable balance.\n\nUsing clustering does not require any changes to the existing steps mentioned earlier, just need to train an additional clustering model, although the effect is relatively limited, the training cost is also relatively low.\n\n- Training Method:\n\n```bash\n# Train using CPU:\npython cluster\u002Ftrain_cluster.py\n# Or train using GPU:\npython cluster\u002Ftrain_cluster.py --gpu\n```\n\nAfter training, the model output will be saved in `logs\u002F44k\u002Fkmeans_10000.pt`\n\n- During Command-line Inference:\n  - Specify `cluster_model_path` in `inference_main.py`\n  - Specify `cluster_infer_ratio` in `inference_main.py`, where `0` means not using clustering at all, `1` means only using clustering. Usually, setting `0.5` is sufficient.\n- During WebUI Inference:\n  - Upload and load the clustering model.\n  - Set the clustering model\u002Ffeature retrieval mixing ratio, between 0-1, where 0 means not using clustering\u002Ffeature retrieval at all. Using clustering\u002Ffeature retrieval can improve timbre similarity but may result in reduced pronunciation clarity (if used, it's recommended to set around 0.5).\n\n## 4.3 Feature Retrieval\n\nSimilar to clustering schemes, feature retrieval can also reduce timbre leakage, with slightly better pronunciation clarity than clustering, but it may reduce inference speed. Adopting a fusion approach, it allows linear control of the proportion of feature retrieval and non-feature retrieval.\n\n- Training Process: After generating hubert and f0, execute:\n\n```bash\npython train_index.py -c configs\u002Fconfig.json\n```\n\nAfter training, the model output will be saved in `logs\u002F44k\u002Ffeature_and_index.pkl`\n\n- During Command-line Inference:\n  - Specify `--feature_retrieval` first, and the clustering scheme will automatically switch to the feature retrieval scheme\n  - Specify `cluster_model_path` in `inference_main.py` as the model output file\n  - Specify `cluster_infer_ratio` in `inference_main.py`, where `0` means not using feature retrieval at all, `1` means only using feature retrieval. Usually, setting `0.5` is sufficient.\n- During WebUI Inference:\n  - Upload and load the clustering model\n  - Set the clustering model\u002Ffeature retrieval mixing ratio, between 0-1, where 0 means not using clustering\u002Ffeature retrieval at all. Using clustering\u002Ffeature retrieval can improve timbre similarity but may result in reduced pronunciation clarity (if used, it's recommended to set around 0.5)\n\n## 4.4 Vocoder Fine-tuning\n\nWhen using the diffusion model in So-VITS, the Mel spectrogram enhanced by the diffusion model is output as the final audio through the vocoder. The vocoder plays a decisive role in the sound quality of the output audio. So-VITS-SVC currently uses the [NSF-HiFiGAN community vocoder](https:\u002F\u002Fopenvpi.github.io\u002Fvocoders\u002F). In fact, you can also fine-tune this vocoder model with your own dataset to better suit your model task in the **diffusion process** of So-VITS.\n\nThe [SingingVocoders](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002FSingingVocoders) project provides methods for fine-tuning the vocoder. In the Diffusion-SVC project, **using a fine-tuned vocoder can greatly enhance the output sound quality**. You can also train a fine-tuned vocoder with your own dataset and use it in this integration package.\n\n1. Train a fine-tuned vocoder using [SingingVocoders](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002FSingingVocoders) and obtain its model and configuration files.\n2. Place the model and configuration files under `pretrain\u002F{fine-tuned vocoder name}\u002F`.\n3. Modify the diffusion model configuration file `diffusion.yaml` of the corresponding model as follows:\n\n```yaml\nvocoder:\n  ckpt: pretrain\u002Fnsf_hifigan\u002Fmodel.ckpt # This line is the path to your fine-tuned vocoder model\n  type: nsf-hifigan # This line is the type of your fine-tuned vocoder, do not modify if unsure\n```\n\n1. Following [3.2 webUI Inference](#32-webui-inference), upload the diffusion model and the **modified diffusion model configuration file** to use the fine-tuned vocoder.\n\n> [!WARNING]\n>\n> **Currently, only the NSF-HiFiGAN vocoder supports fine-tuning.**\n\n## 4.5 Directories for Saved Models\n\nUp to the previous section, a total of 4 types of models that can be trained have been covered. The following table summarizes these four types of models and their configuration files.\n\nIn the webUI, in addition to uploading and loading models, you can also read local model files. You just need to put these models into a folder first, and then put the folder into the `trained` folder. Click \"Refresh Local Model List\", and the webUI will recognize it. Then manually select the model you want to load for loading. **Note**: Automatic loading of local models may not work properly for the (optional) models in the table below.\n\n| File                                          | Filename and Extension  | Location             |\n| --------------------------------------------- | ----------------------- | -------------------- |\n| So-VITS Model                                 | `G_xxxx.pth`            | `logs\u002F44k`           |\n| So-VITS Model Configuration File              | `config.json`           | `configs`            |\n| Diffusion Model (Optional)                    | `model_xxxx.pt`         | `logs\u002F44k\u002Fdiffusion` |\n| Diffusion Model Configuration File (Optional) | `diffusion.yaml`        | `configs`            |\n| Kmeans Clustering Model (Optional)            | `kmeans_10000.pt`       | `logs\u002F44k`           |\n| Feature Retrieval Model (Optional)            | `feature_and_index.pkl` | `logs\u002F44k`           |\n\n# 5. Other Optional Features\n\n✨ This part is less important compared to the previous sections. Except for [5.1 Model Compression](#51-model-compression), which is a more convenient feature, the probability of using the other optional features is relatively low. Therefore, only references to the official documentation and brief descriptions are provided here.\n\n## 5.1 Model Compression\n\nThe generated models contain information needed for further training. If you are **sure not to continue training**, you can remove this part of the information from the model to obtain a final model that is about 1\u002F3 of the size.\n\nUse `compress_model.py`\n\n```bash\n# For example, if I want to compress a model named G_30400.pth under the logs\u002F44k\u002F directory, and the configuration file is configs\u002Fconfig.json, I can run the following command\npython compress_model.py -c=\"configs\u002Fconfig.json\" -i=\"logs\u002F44k\u002FG_30400.pth\" -o=\"logs\u002F44k\u002Frelease.pth\"\n# The compressed model is saved in logs\u002F44k\u002Frelease.pth\n```\n\n> [!WARNING]\n>\n> **Note: Compressed models cannot be further trained!**\n\n## 5.2 Voice Mixing\n\n### 5.2.1 Static Voice Mixing\n\n**Refer to the static voice mixing feature in the `webUI.py` file under the Tools\u002FExperimental Features.**\n\nThis feature can combine multiple voice models into one voice model (convex combination or linear combination of multiple model parameters), thus creating voice characteristics that do not exist in reality.\n\n**Note:**\n\n1. This feature only supports single-speaker models.\n2. If you forcibly use multi-speaker models, you need to ensure that the number of speakers in multiple models is the same, so that voices under the same SpaekerID can be mixed.\n3. Ensure that the `model` field in the config.json of all models to be mixed is the same.\n4. The output mixed model can use any config.json of the models to be mixed, but the clustering model will not be available.\n5. When uploading models in batches, it is better to put the models in a folder and upload them together.\n6. It is recommended to adjust the mixing ratio between 0 and 100. Other numbers can also be adjusted, but unknown effects may occur in linear combination mode.\n7. After mixing, the file will be saved in the project root directory with the filename output.pth.\n8. Convex combination mode will execute Softmax on the mixing ratio to ensure that the sum of mixing ratios is 1, while linear combination mode will not.\n\n### 5.2.2 Dynamic Voice Mixing\n\n**Refer to the introduction of dynamic voice mixing in the `spkmix.py` file.**\n\nRules for mixing role tracks:\n\n- Speaker ID: \\[\\[Start Time 1, End Time 1, Start Value 1, End Value 1], [Start Time 2, End Time 2, Start Value 2, End Value 2]]\n- The start time must be the same as the end time of the previous one, and the first start time must be 0, and the last end time must be 1 (the time range is 0-1).\n- All roles must be filled in, and roles that are not used can be filled with \\[\\[0., 1., 0., 0.]].\n- The fusion value can be filled arbitrarily. Within the specified time range, it changes linearly from the start value to the end value. The internal will automatically ensure that the linear combination is 1 (convex combination condition), so you can use it with confidence.\n\nUse the `--use_spk_mix` parameter during command line inference to enable dynamic voice mixing. Check the \"Dynamic Voice Mixing\" option box during webUI inference.\n\n## 5.3 Onnx Export\n\nUse `onnx_export.py`. Currently, only [MoeVoiceStudio](https:\u002F\u002Fgithub.com\u002FNaruseMioShirakana\u002FMoeVoiceStudio) requires the use of onnx models. For more detailed operations and usage methods, please refer to the [MoeVoiceStudio](https:\u002F\u002Fgithub.com\u002FNaruseMioShirakana\u002FMoeVoiceStudio) repository instructions.\n\n- Create a new folder: `checkpoints` and open it\n- In the `checkpoints` folder, create a folder as the project folder, named after your project, such as `aziplayer`\n- Rename your model to `model.pth` and the configuration file to `config.json`, and place them in the `aziplayer` folder you just created\n- Change `\"NyaruTaffy\"` in `path = \"NyaruTaffy\"` in `onnx_export.py` to your project name, `path = \"aziplayer\" (onnx_export_speaker_mix, for onnx export supporting role mixing)`\n- Run `python onnx_export.py`\n- Wait for execution to complete. A `model.onnx` will be generated in your project folder, which is the exported model.\n\nNote: Use onnx models provided by [MoeVoiceStudio](https:\u002F\u002Fgithub.com\u002FNaruseMioShirakana\u002FMoeVoiceStudio) for Hubert Onnx models. Currently, it cannot be exported independently (Hubert in fairseq has many operators not supported by onnx and involves constants, which will cause errors or problems with the input and output shapes and results during export).\n\n# 6. Simple Mixing and Exporting Finished Product\n\n### Use Audio Host Software to Process Inferred Audio\n\nPlease refer to the [corresponding video tutorial](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Hr4y197Cy\u002F) or other professional mixing tutorials for details on how to handle and enhance the inferred audio using audio host software.\n\n# Appendix: Common Errors and Solutions\n\n✨ **Some error solutions are credited to [羽毛布団](https:\u002F\u002Fspace.bilibili.com\u002F3493141443250876)'s [related column](https:\u002F\u002Fwww.bilibili.com\u002Fread\u002Fcv22206231) | [related documentation](https:\u002F\u002Fwww.yuque.com\u002Fumoubuton\u002Fueupp5\u002Fieinf8qmpzswpsvr)**\n\n## About Out of Memory (OOM)\n\nIf you encounter an error like this in the terminal or WebUI:\n\n```bash\nOutOfMemoryError: CUDA out of memory. Tried to allocate XX GiB (GPU 0: XX GiB total capacity; XX GiB already allocated; XX MiB Free; XX GiB reserved in total by PyTorch)\n```\n\nDon't doubt it, your GPU memory or virtual memory is insufficient. The following steps provide a 100% solution to the problem. Follow these steps to resolve the issue. Please avoid asking this question in various places as the solution is well-documented.\n\n1. In the error message, find if `XX GiB already allocated` is followed by `0 bytes free`. If it shows `0 bytes free`, follow steps 2, 3, and 4. If it shows `XX MiB free` or `XX GiB free`, follow step 5.\n2. If the out of memory occurs during preprocessing:\n   - Use a GPU-friendly F0 predictor (from highest to lowest friendliness: pm >= harvest >= rmvpe ≈ fcpe >> crepe). It is recommended to use rmvpe or fcpe first.\n   - Set multi-process preprocessing to 1.\n3. If the out of memory occurs during training:\n   - Check if there are any excessively long clips in the dataset (more than 20 seconds).\n   - Reduce the batch size.\n   - Use a project with lower resource requirements.\n   - Rent a GPU with larger memory from platforms like Google Colab for training.\n4. If the out of memory occurs during inference:\n   - Ensure the source audio (dry vocal) is clean (no residual reverb, accompaniment, or harmony) as dirty sources can hinder automatic slicing. Refer to the [UVR5 vocal separation tutorial](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1F4421c7qU\u002F) for best practices.\n   - Increase the slicing threshold (e.g., change from -40 to -30; avoid going too high as it can cut the audio abruptly).\n   - Set forced slicing, starting from 60 seconds and decreasing by 10 seconds each time until inference succeeds.\n   - Use CPU for inference, which will be slower but won't encounter out of memory issues.\n5. If there is still available memory but the out of memory error persists, increase your virtual memory to at least 50G.\n\nThese steps should help you manage and resolve out of memory errors effectively, ensuring smooth operation during preprocessing, training, and inference.\n\n## Common Errors and Solutions When Installing Dependencies\n\n**1. Error When Installing PyTorch with CUDA=11.7**\n\n```\nERROR: Package 'networkx' requires a different Python: 3.8.9 not in '>=3.9\n```\n\n**Solutions:**\n\n- **Upgrade Python to 3.10:**\n- **（Recommend）Keep the Python version the same:** First, install `networkx` with version 3.0 before installing PyTorch.\n\n```bash\npip install networkx==3.0\n# Then proceed with the installation of PyTorch.\n```\n\n**2. Dependency Not Found**\n\nIf you encounter errors similar to:\n\n```bash\nERROR: Could not find a version that satisfies the requirement librosa==0.9.1 (from versions: none)\nERROR: No matching distribution found for librosa==0.9.1\n# Key characteristics of the error:\nNo matching distribution found for xxxxx\nCould not find a version that satisfies the requirement xxxx\n```\n\n**Solution:** Change the installation source. Add a download source when manually installing the dependency.\n\nUse the command `pip install [package_name] -i [source_url]`. For example, to download `librosa` version 0.9.1 from the Alibaba source, use the following command:\n\n```bash\npip install librosa==0.9.1 -i http:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\n```\n\n**3. Certain dependencies cannot be installed due to a high pip version**\n\nOn June 21, 2024, pip was updated to version 24.1. Simply using `pip install --upgrade pip` will update pip to version 24.1. However, some dependencies require pip 23.0 to be installed, necessitating a manual downgrade of the pip version. It is currently known that hydra-core, omegaconf, and fastapi are affected by this. The specific error encountered during installation is as follows:\n\n```bash\nPlease use pip\u003C24.1 if you need to use this version.\nINFO: pip is looking at multiple versions of hydra-core to determine which version is compatible with other requirements. This could take a while.\nERROR: Cannot install -r requirements.txt (line 20) and fairseq because these package versions have conflicting dependencies.\n\nThe conflict is caused by:\n    fairseq 0.12.2 depends on omegaconf\u003C2.1\n    hydra-core 1.0.7 depends on omegaconf\u003C2.1 and >=2.0.5\n\nTo fix this you could try to:\n1. loosen the range of package versions you've specified\n2. remove package versions to allow pip to attempt to solve the dependency conflict\n```\n\nThe solution is to limit the pip version before installing dependencies as described in [1.5 Installation of Other Dependencies](#15-installation-of-other-dependencies). Use the following command to limit the pip version:\n\n```bash\npip install --upgrade pip==23.3.2 wheel setuptools\n```\n\nAfter running this command, proceed with the installation of other dependencies.\n\n## Common Errors During Dataset Preprocessing and Model Training\n\n**1. Error: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position xx`**\n\n- Ensure that dataset filenames do not contain non-Western characters such as Chinese or Japanese, especially Chinese punctuation marks like brackets, commas, colons, semicolons, quotes, etc. After renaming, **reprocess the dataset** and then proceed with training.\n\n**2. Error: `The expand size of the tensor (768) must match the existing size (256) at non-singleton dimension 0.`**\n\n- Delete all contents under `dataset\u002F44k` and redo the preprocessing steps.\n\n**3.Error: `RuntimeError: DataLoader worker (pid(s) 13920) exited unexpectedly`**\n\n```bash\nraise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e\nRuntimeError: DataLoader worker (pid(s) 13920) exited unexpectedly\n```\n\n- Reduce the `batchsize` value, increase virtual memory, and restart the computer to clear GPU memory until the `batchsize` value and virtual memory are appropriate and do not cause errors.\n\n**4. Error: `torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 3221225477`**\n\n- Increase virtual memory and reduce the `batchsize` value until the `batchsize` value and virtual memory are appropriate and do not cause errors.\n\n**5. Error: `AssertionError: CPU training is not allowed.`**\n\n- **No solution:** Training without an NVIDIA GPU is not supported. For beginners, the straightforward answer is that training without an NVIDIA GPU is not feasible.\n\n**6. Error: `FileNotFoundError: No such file or directory: 'pretrain\u002Frmvpe.pt'`**\n\n- If you run `python preprocess_hubert_f0.py --f0_predictor rmvpe --use_diff` and encounter `FileNotFoundError: No such file or directory: 'pretrain\u002Frmvpe.pt'`, this is because the official documentation updated the rmvpe preprocessor for F0 processing. Refer to the tutorial documentation [#2.2.3](#223-optional-items-choose-as-needed) to download the preprocessing model `rmvpe.pt` and place it in the corresponding directory.\n\n**7. Error: \"Page file is too small to complete the operation.\"**\n\n- **Solution:** Increase the virtual memory. You can find detailed instructions online for your specific operating system.\n\n## Errors When Using WebUI\\*\\*\n\n**1. Errors When Starting or Loading Models in WebUI**\n\n- **Error When Starting WebUI:** `ImportError: cannot import name 'Schema' from 'pydantic'`\n- **Error When Loading Models in WebUI:** `AttributeError(\"'Dropdown' object has no attribute 'update'\")`\n- **Errors Related to Dependencies:** If the error involves `fastapi`, `gradio`, or `pydantic`.\n\n**Solution:**\n\n- Some dependencies need specific versions. After installing `requirements_win.txt`, enter the following commands in the command prompt to update the packages:\n\n```bash\npip install --upgrade fastapi==0.84.0\npip install --upgrade gradio==3.41.2\npip install --upgrade pydantic==1.10.12\n```\n\n**2. Error: `Given groups=1, weight of size [xxx, 256, xxx], expected input[xxx, 768, xxx] to have 256 channels, but got 768 channels instead`**\n\nor **Error: Encoder and model dimensions do not match in the configuration file**\n\n- **Cause:** A v1 branch model is using a `vec768` configuration file, or vice versa.\n- **Solution:** Check the `ssl_dim` setting in your configuration file. If `ssl_dim` is 256, the `speech_encoder` should be `vec256|9`. If it is 768, it should be `vec768|12`.\n- For detailed instructions, refer to [#2.1](#21-issues-regarding-compatibility-with-the-40-model).\n\n**3. Error: `'HParams' object has no attribute 'xxx'`**\n\n- **Cause:** Usually, this indicates that the timbre cannot be found and the configuration file does not match the model.\n- **Solution:** Open the configuration file and scroll to the bottom to check if it includes the timbre you trained.\n\n# Acknowledgements\n\nWe would like to extend our heartfelt thanks to the following individuals and organizations whose contributions and resources have made this project possible:\n\n- **so-vits-svc** | [so-vits-svc GitHub Repository](https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc)\n- **GPT-SoVITS** | [GPT-SoVITS GitHub Repository](https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS)\n- **SingingVocoders** | [SingingVocoders GitHub Repository](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002FSingingVocoders)\n- **MoeVoiceStudio** | [MoeVoiceStudio GitHub Repository](https:\u002F\u002Fgithub.com\u002FNaruseMioShirakana\u002FMoeVoiceStudio)\n- **OpenVPI** | [OpenVPI GitHub Organization](https:\u002F\u002Fgithub.com\u002Fopenvpi) | [Vocoders GitHub Repository](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002Fvocoders)\n- **Up 主 [infinite_loop]** | [Bilibili Profile](https:\u002F\u002Fspace.bilibili.com\u002F286311429) | [Related Video](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Bd4y1W7BN) | [Related Column](https:\u002F\u002Fwww.bilibili.com\u002Fread\u002Fcv21425662)\n- **Up 主 [羽毛布団]** | [Bilibili Profile](https:\u002F\u002Fspace.bilibili.com\u002F3493141443250876) | [Error Resolution Guide](https:\u002F\u002Fwww.bilibili.com\u002Fread\u002Fcv22206231) | [Common Error Solutions](https:\u002F\u002Fwww.yuque.com\u002Fumoubuton\u002Fueupp5\u002Fieinf8qmpzswpsvr)\n- **All Contributors of Training Audio Samples**\n- **You** - For your interest, support, and contributions.\n","\u003Cdiv align=\"center\">\n\n# SoftVC VITS 歌唱语音转换本地部署教程\n[![在 Google Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FSUC-DriverOld\u002Fso-vits-svc-Deployment-Documents\u002Fblob\u002F4.1\u002Fsovits4_for_colab.ipynb) \u003Cbr>\nEnglish | [简体中文](README_zh_CN.md) \u003Cbr>\n本帮助文档提供了项目 [so-vits-svc](https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc) 的详细安装、调试和推理教程。您也可以直接参考官方的 [README](https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc#readme) 文档。\u003Cbr>\n由 Sucial 撰写。[Bilibili](https:\u002F\u002Fspace.bilibili.com\u002F445022409) | [Github](https:\u002F\u002Fgithub.com\u002FSUC-DriverOld)\n\n\u003C\u002Fdiv>\n\n---\n\n✨ **点击查看：[配套视频教程](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Hr4y197Cy\u002F) | [UVR5 人声分离教程](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1F4421c7qU\u002F)（注：配套视频可能已过时，请以最新教程文档为准！）**\n\n✨ **相关资源：[官方 README 文档](https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc) | [常见错误解决方案](https:\u002F\u002Fwww.yuque.com\u002Fumoubuton\u002Fueupp5\u002Fieinf8qmpzswpsvr) | [羽毛布団](https:\u002F\u002Fspace.bilibili.com\u002F3493141443250876)**\n\n> [!IMPORTANT]\n>\n> 1. **重要！请先阅读！** 如果您不想手动配置环境或正在寻找集成包，请使用 [羽毛布団](https:\u002F\u002Fspace.bilibili.com\u002F3493141443250876) 提供的集成包。\n> 2. **关于旧版本教程**：如需 so-vits-svc3.0 版本的教程，请切换到 [3.0 分支](https:\u002F\u002Fgithub.com\u002FSUC-DriverOld\u002Fso-vits-svc-Chinese-Detaild-Documents\u002Ftree\u002F3.0)。该分支已不再更新！\n> 3. **持续改进文档**：如果您遇到本文未提及的错误，可在 issues 区域提问。对于项目本身的 bug，请向原项目提交 issue。如果您希望改进本教程，欢迎提交 PR！\n\n# 教程目录\n\n- [SoftVC VITS 歌唱语音转换本地部署教程](#softvc-vits-singing-voice-conversion-local-deployment-tutorial)\n- [教程目录](#tutorial-index)\n- [0. 使用前须知](#0-before-you-use)\n    - [任何国家、地区、组织或个人使用本项目必须遵守以下法律：](#any-country-region-organization-or-individual-using-this-project-must-comply-with-the-following-laws)\n      - [《民法典》](#民法典)\n      - [第一千零一十九条](#第一千零一十九条)\n      - [第一千零二十四条](#第一千零二十四条)\n      - [第一千零二十七条](#第一千零二十七条)\n      - [《中华人民共和国宪法》|《中华人民共和国刑法》|《中华人民共和国民法典》|《中华人民共和国合同法》](#中华人民共和国宪法中华人民共和国刑法中华人民共和国民法典中华人民共和国合同法)\n  - [0.1 使用规范](#01-usage-regulations)\n  - [0.2 硬件要求](#02-hardware-requirements)\n  - [0.3 准备工作](#03-preparation)\n- [1. 环境依赖](#1-environment-dependencies)\n  - [1.1 so-vits-svc4.1 源代码](#11-so-vits-svc41-source-code)\n  - [1.2 Python](#12-python)\n  - [1.3 Pytorch](#13-pytorch)\n  - [1.4 其他依赖的安装](#14-installation-of-other-dependencies)\n  - [1.5 FFmpeg](#15-ffmpeg)\n- [2. 配置与训练](#2-configuration-and-training)\n  - [2.1 关于与 4.0 模型兼容性的注意事项](#21-issues-regarding-compatibility-with-the-40-model)\n  - [2.2 预下载的模型文件](#22-pre-downloaded-model-files)\n    - [2.2.1 必备项](#221-mandatory-items)\n      - [各编码器的详细说明](#detailed-explanation-of-each-encoder)\n    - [2.2.2 预训练的基础模型（强烈推荐）](#222-pre-trained-base-model-strongly-recommended)\n    - [2.2.3 可选项（根据需要选择）](#223-optional-items-choose-as-needed)\n  - [2.3 数据准备](#23-data-preparation)\n  - [2.4 数据预处理](#24-data-preprocessing)\n    - [2.4.0 音频切片](#240-audio-slicing)\n    - [2.4.1 重采样至 44100Hz 单声道](#241-resampling-to-44100hz-mono)\n    - [2.4.2 自动划分数据集并生成配置文件](#242-automatic-dataset-splitting-and-configuration-file-generation)\n      - [使用响度嵌入](#using-loudness-embedding)\n    - [2.4.3 根据需要修改配置文件](#243-modify-configuration-files-as-needed)\n      - [config.json](#configjson)\n      - [diffusion.yaml](#diffusionyaml)\n    - [2.4.3 生成 Hubert 和 F0](#243-generating-hubert-and-f0)\n      - [各 F0 预测器的优缺点](#pros-and-cons-of-each-f0-predictor)\n  - [2.5 训练](#25-training)\n    - [2.5.1 主模型训练（必选）](#251-main-model-training-required)\n    - [2.5.2 扩散模型（可选）](#252-diffusion-model-optional)\n    - [2.5.3 Tensorboard](#253-tensorboard)\n- [3. 推理](#3-inference)\n  - [3.1 命令行推理](#31-command-line-inference)\n  - [3.2 webUI 推理](#32-webui-inference)\n- [4. 可选增强功能](#4-optional-enhancements)\n  - [4.1 自动 F0 预测](#41-automatic-f0-prediction)\n  - [4.2 聚类音色泄漏控制](#42-clustering-timbre-leakage-control)\n  - [4.3 特征检索](#43-feature-retrieval)\n  - [4.4 语音合成器微调](#44-vocoder-fine-tuning)\n  - [4.5 保存模型的目录](#45-directories-for-saved-models)\n- [5. 其他可选功能](#5-other-optional-features)\n  - [5.1 模型压缩](#51-model-compression)\n  - [5.2 人声混合](#52-voice-mixing)\n    - [5.2.1 静态人声混合](#521-static-voice-mixing)\n    - [5.2.2 动态人声混合](#522-dynamic-voice-mixing)\n  - [5.3 Onnx 导出](#53-onnx-export)\n- [6. 简单混音及成品导出](#6-simple-mixing-and-exporting-finished-product)\n    - [使用音频宿主软件处理推理后的音频](#use-audio-host-software-to-process-inferred-audio)\n- [附录：常见错误及解决方案](#appendix-common-errors-and-solutions)\n  - [关于内存不足 (OOM)](#about-out-of-memory-oom)\n  - [安装依赖时常见的错误及解决方案](#common-errors-and-solutions-when-installing-dependencies)\n  - [数据预处理和模型训练中的常见错误](#common-errors-during-dataset-preprocessing-and-model-training)\n  - [使用 webUI 时的错误\\*\\*](#errors-when-using-webui)\n- [致谢](#acknowledgements)\n\n# 0. 使用前须知\n\n### 任何国家、地区、组织或个人使用本项目时，必须遵守以下法律：\n\n#### 《[民法典](http:\u002F\u002Fgongbao.court.gov.cn\u002FDetails\u002F51eb6750b8361f79be8f90d09bc202.html)》\n\n#### 第一千零一十九条\n\n任何组织或者个人**不得**以丑化、污损，或者利用信息技术手段伪造等方式侵害他人的肖像权。**未经**肖像权人同意，**不得**制作、使用、公开肖像权人的肖像，但是法律另有规定的除外。**未经**肖像权人同意，肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。对自然人声音的保护，参照适用肖像权保护的有关规定。\n**对自然人声音的保护，参照适用肖像权保护的有关规定**\n\n#### 第一千零二十四条\n\n【名誉权】民事主体享有名誉权。任何组织或者个人**不得**以侮辱、诽谤等方式侵害他人的名誉权。\n\n#### 第一千零二十七条\n\n【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象，含有侮辱、诽谤内容，侵害他人名誉权的，受害人有权依法请求该行为人承担民事责任。行为人发表的文学、艺术作品不以特定人为描述对象，仅其中的情节与该特定人的情况相似的，不承担民事责任。\n\n#### 《[中华人民共和国宪法](http:\u002F\u002Fwww.gov.cn\u002Fguoqing\u002F2018-03\u002F22\u002Fcontent_5276318.htm)》|《[中华人民共和国刑法](http:\u002F\u002Fgongbao.court.gov.cn\u002FDetails\u002Ff8e30d0689b23f57bfc782d21035c3.html?sw=中华人民共和国刑法)》|《[中华人民共和国民法典](http:\u002F\u002Fgongbao.court.gov.cn\u002FDetails\u002F51eb6750b8361f79be8f90d09bc202.html)》|《[中华人民共和国合同法](http:\u002F\u002Fwww.npc.gov.cn\u002Fzgrdw\u002Fnpc\u002Flfzt\u002Frlyw\u002F2016-07\u002F01\u002Fcontent_1992739.htm)》\n\n## 0.1 使用规定\n\n> [!WARNING]\n>\n> 1. **本教程仅用于交流和学习目的，请勿将其用于非法活动、违反公共秩序或其他不道德的目的。为尊重音频来源提供者，请勿将本教程用于不当用途。**\n> 2. **继续使用本教程即表示您同意此处所述的相关规定。本教程仅履行其指导义务，不对后续可能出现的问题承担责任。**\n> 3. **请自行解决数据集授权问题。请勿使用未经授权的数据集进行训练！因使用未经授权的数据集而产生的任何问题均由您自行负责，与本仓库、仓库维护者、svc开发团队或教程作者无关。**\n\n具体使用规定如下：\n\n- 本教程的内容仅代表个人观点，不代表so-vits-svc团队或原作者的观点。\n- 本教程假定使用so-vits-svc团队维护的仓库。请遵守所涉及的任何开源代码的开源许可证。\n- 基于sovits制作并发布在视频平台上的任何视频，都必须在说明中明确注明用于语音转换器的输入歌声或音频来源。例如，如果在分离人声后使用他人的视频或音频作为输入源，则必须提供原始视频或音乐的清晰链接。如果使用自己的声音或由其他语音合成引擎合成的声音，也必须在说明中予以注明。\n- 确保用于创建数据集的数据来源合法合规，并且数据提供者知晓您正在创作的内容及其可能带来的后果。因输入来源引发的任何侵权问题均由您自行负责。在使用其他商业语音合成软件作为输入来源时，务必遵守该软件的使用条款。请注意，许多语音合成引擎的使用条款明确禁止将其用作转换的输入源！\n- 云端训练和推理可能会产生费用。如果您是未成年人，请在继续操作前征得监护人的许可和理解。本教程不对未经授权使用可能引发的任何后续问题负责。\n- 本地训练（尤其是在性能较弱的硬件上）可能需要设备长时间高负载运行。请确保对设备采取适当的维护和散热措施。\n- 由于设备原因，本教程仅在Windows系统上进行了测试。对于Mac和Linux用户，请确保具备一定的问题解决能力。\n- 继续使用本仓库即表示您同意README中所述的相关规定。本README仅履行其指导义务，不对后续可能出现的问题承担责任。\n\n## 0.2 硬件要求\n\n1. 训练**必须**使用GPU进行！对于推理，可以通过**命令行推理**或**WebUI推理**完成，若对速度要求不高，可使用CPU或GPU。\n2. 如果您计划训练自己的模型，请准备一块**至少具有6GB显存的NVIDIA显卡**。\n3. 确保您的计算机虚拟内存设置为**至少30GB**，最好将其设置在SSD上，否则会非常缓慢。\n4. 对于云端训练，建议使用[Google Colab](https:\u002F\u002Fcolab.google\u002F)，您可以按照我们提供的[sovits4_for_colab.ipynb](.\u002Fsovits4_for_colab.ipynb)进行配置。\n\n## 0.3 准备工作\n\n1. 准备至少30分钟（越多越好！）的**干净人声**作为您的训练集，要求**无背景噪声且无混响**。演唱时最好保持**音色一致**，确保**音域宽广（训练集的音域决定了训练后模型的音域！）**，并使**音量适中**。如果可能，可使用Audition等音频处理软件进行**音量匹配**。\n2. **重要！** 提前下载训练所需的**基础模型**。请参考[2.2.2 预训练基础模型（强烈推荐）](#222-pre-trained-baseline-models-highly-recommended)。\n3. 对于推理：准备**干燥的人声**，要求**背景噪声\u003C30dB**，最好**无混响或和声**。\n\n> [!NOTE]\n>\n> **注1**：唱歌和说话都可以作为训练集，但使用语音作为训练集可能会导致**推理时高低音出现问题（通常称为音域问题或声音闷塞）**，因为训练集的音域很大程度上决定了训练后模型的音域。因此，如果您的最终目标是实现歌唱效果，建议使用唱歌的人声作为训练集。\n>\n> **注2**：当使用男声模型来推理女歌手演唱的歌曲时，若出现明显闷塞现象，可尝试降低音高（通常降低12个半音，即一个八度）。同样地，当使用女声模型来推理男歌手演唱的歌曲时，可以尝试提高音高。\n>\n> **✨ 2024年3月8日最新建议 ✨**：目前，[GPT-SoVITS](https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS)项目的TTS（文本转语音）功能相比so-vits-svc的TTS，所需训练集更小、训练速度更快且效果更好。因此，如果您希望使用语音合成功能，请切换至[GPT-SoVITS](https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS)。相应地，建议在此项目中使用唱歌的人声作为训练集。\n\n# 1. 环境依赖\n\n✨ **本项目所需环境**: [NVIDIA-CUDA](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit) | [Python](https:\u002F\u002Fwww.python.org\u002F) = 3.8.9（推荐此版本）| [Pytorch](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F) | [FFmpeg](https:\u002F\u002Fffmpeg.org\u002F)\n\n✨ **您也可以尝试使用我的脚本进行一键环境搭建和WebUI启动：[so-vits-svc-webUI-QuickStart-bat](https:\u002F\u002Fgithub.com\u002FSUC-DriverOld\u002Fso-vits-svc-webUI-QuickStart-bat)**\n\n## 1.1 so-vits-svc4.1 源代码\n\n您可以通过以下任一方式下载或克隆源代码：\n\n1. **从Github项目页面下载源代码ZIP文件**：前往[so-vits-svc官方仓库](https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc)，点击右上角的绿色`Code`按钮，选择`Download ZIP`即可下载压缩包。如果需要其他分支的代码，请先切换到相应分支。下载完成后，将ZIP文件解压到任意目录，该目录即为您的工作目录。\n\n2. **使用git克隆源代码**：运行以下命令：\n\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc.git\n   ```\n\n## 1.2 Python\n\n- 请访问[Python官方网站](https:\u002F\u002Fwww.python.org\u002F)下载Python 3.8，并将其**添加到系统环境变量PATH中**。详细的安装方法及Path配置在此不再赘述，网上均可轻松找到相关教程。\n\n```bash\n# Conda配置方法，将YOUR_ENV_NAME替换为您想要创建的虚拟环境名称。\nconda create -n YOUR_ENV_NAME python=3.8 -y\nconda activate YOUR_ENV_NAME\n# 在执行任何命令之前，请确保您已进入该虚拟环境！\n```\n\n- 安装完成后，在命令提示符中输入`python`。若输出类似如下内容，则表示安装成功：\n\n  ```bash\n  Python 3.8.9 (tags\u002Fv3.8.9:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)] on win32\n  Type \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n  >>>\n  ```\n\n**关于Python版本**：经测试，Python 3.8.9可稳定运行该项目（尽管更高版本也可能适用）。\n\n## 1.3 Pytorch\n\n> [!IMPORTANT]\n>\n> ✨ 我们强烈建议您安装Pytorch 11.7或11.8。版本高于12.0可能与当前项目不兼容。\n\n- 我们需要**单独安装**`torch`、`torchaudio`、`torchvision`库。直接访问[Pytorch官网](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F)选择所需版本，然后将“Run this Command”部分显示的命令复制到控制台执行安装。您还可以从[这里](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Fprevious-versions\u002F)下载旧版本的Pytorch。\n\n- 安装完`torch`、`torchaudio`、`torchvision`后，在cmd控制台中使用以下命令检查torch是否能成功调用CUDA。若最后一行显示`True`，则表示成功；若显示`False`，则需重新安装正确版本。\n\n```bash\npython\n# 按Enter键执行\nimport torch\n# 按Enter键执行\nprint(torch.cuda.is_available())\n# 按Enter键执行\n```\n\n> [!NOTE]\n>\n> 1. 如果需要手动指定`torch`的版本，只需在后面加上版本号。例如：`pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu117`。\n> 2. 在安装适用于CUDA 11.7的Pytorch时，可能会遇到错误`ERROR: Package 'networkx' requires a different Python: 3.8.9 not in '>=3.9'`。此时，可先执行`pip install networkx==3.0`，再继续安装Pytorch，以避免类似错误。\n> 3. 由于版本更新，您可能无法直接复制Pytorch 11.7的下载链接。在这种情况下，您可以直接复制下方的安装命令来安装Pytorch 11.7。或者，您也可以从[这里](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Fprevious-versions\u002F)下载旧版本。\n\n```bash\npip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu117\n```\n\n## 1.4 其他依赖的安装\n\n> [!IMPORTANT]\n> ✨ 在开始安装其他依赖之前，请务必**下载并安装**[Visual Studio 2022](https:\u002F\u002Fvisualstudio.microsoft.com\u002F)或[Microsoft C++ Build Tools](https:\u002F\u002Fvisualstudio.microsoft.com\u002Fzh-hans\u002Fvisual-cpp-build-tools\u002F)（后者体积较小）。**选择并安装“使用C++的桌面开发”组件包**，然后执行修改操作，等待安装完成。\n\n- 在从[1.1](#11-so-vits-svc41-source-code)获取的文件夹中右键单击，选择**在终端中打开**。使用以下命令首先更新`pip`、`wheel`和`setuptools`。\n\n```bash\npip install --upgrade pip==23.3.2 wheel setuptools\n```\n\n- 执行以下命令安装库文件（**若出现错误，请多次尝试，直至无误且所有依赖均安装成功**）。请注意，项目文件夹中有三个`requirements`文本文件，此处请选择`requirements_win.txt`。\n\n```bash\npip install -r requirements_win.txt\n```\n\n- 确保安装**正确且无误**后，使用以下命令更新`fastapi`、`gradio`和`pydantic`依赖：\n\n```bash\npip install --upgrade fastapi==0.84.0\npip install --upgrade pydantic==1.10.12\npip install --upgrade gradio==3.41.2\n```\n\n## 1.5 FFmpeg\n\n- 请访问[FFmpeg官方网站](https:\u002F\u002Fffmpeg.org\u002F)下载FFmpeg。将其解压到任意位置，并将路径添加到环境变量中。导航至`.\\ffmpeg\\bin`（详细的安装方法及Path配置在此不再赘述，网上均可轻松找到相关教程）。\n\n- 安装完成后，在cmd控制台中输入`ffmpeg -version`。若输出类似如下内容，则表示安装成功：\n\n```bash\nffmpeg version git-2020-08-12-bb59bdb Copyright (c) 2000-2020 the FFmpeg developers\nbuilt with gcc 10.2.1 (GCC) 20200805\nconfiguration: 这里是一堆配置信息\nlibavutil      56. 58.100 \u002F 56. 58.100\nlibavcodec     58.100.100 \u002F 58.100.100\n...\n```\n\n# 2. 配置与训练\n\n✨ 本节是整篇教程文档中最关键的部分。它参考了[官方文档](https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc#readme)，并加入了一些解释和说明，以便更好地理解。\n\n✨ 在进入第二部分内容之前，请确保您的计算机虚拟内存设置为**30GB或以上**，最好位于固态硬盘（SSD）上。具体操作方法可在网络上搜索获取。\n\n## 2.1 与4.0模型兼容性相关问题\n\n- 您可以通过修改4.0模型的`config.json`来确保对4.0模型的支持。需要在`config.json`中的`model`部分添加`speech_encoder`字段，如下所示：\n\n```bash\n  \"model\":\n  {\n    # 其他内容省略\n\n    # “ssl_dim”，填写256或768，需与下方“speech_encoder”的值一致\n    \"ssl_dim\": 256,\n    # 发音人数量\n    \"n_speakers\": 200,\n    # 或者“vec768l12”，但请注意，此处的值应与上方的“ssl_dim”匹配。即，256对应vec256l9，768对应vec768l12。\n    \"speech_encoder\":\"vec256l9\"\n    # 如果您不确定自己的模型是vec768l12还是vec256l9，可以通过查看“gin_channels”字段的值来确认。\n\n    # 其他内容省略\n  }\n```\n\n## 2.2 预下载模型文件\n\n### 2.2.1 必备项\n\n> [!WARNING]\n>\n> **您必须选择以下编码器之一使用：**\n>\n> - “vec768l12”\n> - “vec256l9”\n> - “vec256l9-onnx”\n> - “vec256l12-onnx”\n> - “vec768l9-onnx”\n> - “vec768l12-onnx”\n> - “hubertsoft-onnx”\n> - “hubertsoft”\n> - “whisper-ppg”\n> - “cnhubertlarge”\n> - “dphubert”\n> - “whisper-ppg-large”\n> - “wavlmbase+”\n\n| 编码器                  | 下载链接                                                                                                                                                                                                        | 存放位置                                                                    | 说明                                                             |\n| ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------- |\n| contentvec（推荐）       | [checkpoint_best_legacy_500.pt](https:\u002F\u002Fibm.box.com\u002Fs\u002Fz1wgl1stco8ffooyatzdwsqn2psd9lrr)                                                                                                                              | 放入`pretrain`目录                                                           | `vec768l12`和`vec256l9`需要此编码器                         |\n|                          | [hubert_base.pt](https:\u002F\u002Fhuggingface.co\u002Flj1995\u002FVoiceConversionWebUI\u002Fresolve\u002Fmain\u002Fhubert_base.pt)                                                                                                                     | 重命名为checkpoint_best_legacy_500.pt，然后放入`pretrain`目录              | 效果与上述`checkpoint_best_legacy_500.pt`相同，但仅199MB         |\n| hubertsoft               | [hubert-soft-0d54a1f4.pt](https:\u002F\u002Fgithub.com\u002Fbshall\u002Fhubert\u002Freleases\u002Fdownload\u002Fv0.1\u002Fhubert-soft-0d54a1f4.pt)                                                                                                           | 放入`pretrain`目录                                                          | 被so-vits-svc3.0使用                                                  |\n| Whisper-ppg              | [medium.pt](https:\u002F\u002Fopenaipublic.azureedge.net\u002Fmain\u002Fwhisper\u002Fmodels\u002F345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1\u002Fmedium.pt)                                                                       | 放入`pretrain`目录                                                          | 与`whisper-ppg`兼容                                                   |\n| whisper-ppg-large        | [large-v2.pt](https:\u002F\u002Fopenaipublic.azureedge.net\u002Fmain\u002Fwhisper\u002Fmodels\u002F81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524\u002Flarge-v2.pt)                                                                   | 放入`pretrain`目录                                                          | 与`whisper-ppg-large`兼容                                             |\n| cnhubertlarge            | [chinese-hubert-large-fairseq-ckpt.pt](https:\u002F\u002Fhuggingface.co\u002FTencentGameMate\u002Fchinese-hubert-large\u002Fresolve\u002Fmain\u002Fchinese-hubert-large-fairseq-ckpt.pt)                                                                | 放入`pretrain`目录                                                          | -                                                                       |\n| dphubert                 | [DPHuBERT-sp0.75.pth](https:\u002F\u002Fhuggingface.co\u002Fpyf98\u002FDPHuBERT\u002Fresolve\u002Fmain\u002FDPHuBERT-sp0.75.pth)                                                                                                                        | 放入`pretrain`目录                                                          | -                                                                       |\n| WavLM                    | [WavLM-Base+.pt](https:\u002F\u002Fvalle.blob.core.windows.net\u002Fshare\u002Fwavlm\u002FWavLM-Base+.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D) | 放入`pretrain`目录                                                          | 下载链接可能存在问题，无法下载                                  |\n| OnnxHubert\u002FContentVec    | [MoeSS-SUBModel](https:\u002F\u002Fhuggingface.co\u002FNaruseMioShirakana\u002FMoeSS-SUBModel\u002Ftree\u002Fmain)                                                                                                                                 | 放入`pretrain`目录                                                          | -                                                                       |\n\n#### 各编码器详细说明\n\n| 编码器名称                   | 优点                                                         | 缺点                     |\n| ------------------------------ | ------------------------------------------------------------------ | --------------------------------- |\n| `vec768l12`（最推荐）         | 声音保真度最佳，基础模型较大，支持响度嵌入                       | 发音清晰度较弱                 |\n| `vec256l9`                     | 无特别优势                                                       | 不支持扩散模型                 |\n| `hubertsoft`                   | 发音清晰度强                                                   | 容易出现声音泄漏               |\n| `whisper-ppg`                  | 发音清晰度最强                                                 | 容易出现声音泄漏，显存占用高   |\n\n### 2.2.2 预训练基础模型（强烈推荐）\n\n- 预训练基础模型文件：`G_0.pth`、`D_0.pth`。请将其放置于`logs\u002F44k`目录下。\n\n- 扩散模型预训练基础模型文件：`model_0.pt`。请将其放置于`logs\u002F44k\u002Fdiffusion`目录下。\n\n该扩散模型参考了[DDSP-SVC](https:\u002F\u002Fgithub.com\u002Fyxlllc\u002FDDSP-SVC)中的扩散模型，基础模型与[DDSP-SVC](https:\u002F\u002Fgithub.com\u002Fyxlllc\u002FDDSP-SVC)的扩散模型兼容。部分提供的基础模型文件来自“羽毛布団”（[bilibili空间](https:\u002F\u002Fspace.bilibili.com\u002F3493141443250876)）的集成包，在此向其表示感谢。\n\n**提供4.1训练用的基础模型，请自行下载（需具备外网条件）**\n\n| 编码器类型                        | 主模型基础                                                                                                                                                                                                                  | 扩散模型基础                                                                                                 | 说明                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |\n| ----------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| vec768l12                           | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec768l12\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec768l12\u002FD_0.pth)                           | [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002F768l12\u002Fmodel_0.pt)      | 如果仅训练100步扩散，即 `k_step_max = 100`，请使用 [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002F768l12\u002Fmax100\u002Fmodel_0.pt) 作为扩散模型                                                                                                                                                                                                                                                                                         |\n| vec768l12 (带响度嵌入)              | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec768l12\u002Fvol_emb\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec768l12\u002Fvol_emb\u002FD_0.pth)           | [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002F768l12\u002Fmodel_0.pt)      | 如果仅训练100步扩散，即 `k_step_max = 100`，请使用 [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002F768l12\u002Fmax100\u002Fmodel_0.pt) 作为扩散模型                                                                                                                                                                                                                                                                                         |\n| vec256l9                            | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec256l9\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec256l9\u002FD_0.pth)                             | 不支持                                                                                                        | -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |\n| hubertsoft                          | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fhubertsoft\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fhubertsoft\u002FD_0.pth)                         | [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002Fhubertsoft\u002Fmodel_0.pt)  | -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |\n| whisper-ppg                         | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fwhisper-ppg\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fwhisper-ppg\u002FD_0.pth)                       | [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002Fwhisper-ppg\u002Fmodel_0.pt) | -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |\n| tiny (vec768l12_vol_emb)            | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Ftiny\u002Fvec768l12_vol_emb\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Ftiny\u002Fvec768l12_vol_emb\u002FD_0.pth) | -                                                                                                                    | TINY 基于原始 So-VITS 模型，通过减少网络参数、采用深度可分离卷积和共享参数的 FLOW 技术，显著缩小了模型体积并提升了推理速度。TINY 专为实时转换设计；由于参数减少，其转换效果理论上不如原模型。So-VITS 的实时转换 GUI 正在开发中。在此之前，如无特殊需求，不建议训练 TINY 模型。 |\n\n> [!WARNING]\n>\n> 不提供未提及的其他编码器的预训练模型。请在没有基础模型的情况下进行训练，这可能会显著增加训练难度！\n\n**基础模型与支持情况**\n\n| 标准基础 | 响度嵌入 | 响度嵌入 + TINY | 全扩散 | 100 步浅扩散 |\n| ------------- | ------------------ | ------------------------- | -------------- | -------------------------- |\n| Vec768L12     | 支持          | 支持                 | 支持      | 支持                  |\n| Vec256L9      | 支持          | 不支持             | 不支持  | 不支持              |\n| hubertsoft    | 支持          | 不支持             | 支持      | 不支持              |\n| whisper-ppg   | 支持          | 不支持             | 支持      | 不支持              |\n\n\n\n### 2.2.3 可选项目（根据需要选择）\n\n**1. NSF-HIFIGAN**\n\n如果使用 `NSF-HIFIGAN 增强器` 或 `浅扩散`，则需要下载由 [OpenVPI] 提供的预训练 NSF-HIFIGAN 模型。如果不需使用，可跳过此步骤。\n\n- 预训练 NSF-HIFIGAN 语音合成器：\n  - 2022年12月版本：[nsf_hifigan_20221211.zip](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002Fvocoders\u002Freleases\u002Fdownload\u002Fnsf-hifigan-v1\u002Fnsf_hifigan_20221211.zip)；\n  - 2024年2月版本：[nsf_hifigan_44.1k_hop512_128bin_2024.02.zip](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002Fvocoders\u002Freleases\u002Fdownload\u002Fnsf-hifigan-44.1k-hop512-128bin-2024.02\u002Fnsf_hifigan_44.1k_hop512_128bin_2024.02.zip)\n- 解压后，将四个文件放入 `pretrain\u002Fnsf_hifigan` 目录中。\n- 如果下载的是2024年2月版本的语音合成器，请将 `model.ckpt` 重命名为 `model`，即去掉文件扩展名。\n\n**2. RMVPE**\n\n如果使用 `rmvpe` F0 预测器，则需要下载预训练的 RMVPE 模型。\n\n- 下载模型 [rmvpe.zip](https:\u002F\u002Fgithub.com\u002Fyxlllc\u002FRMVPE\u002Freleases\u002Fdownload\u002F230917\u002Frmvpe.zip)，这是目前推荐的版本。\n- 解压 `rmvpe.zip`，将 `model.pt` 文件重命名为 `rmvpe.pt`，然后将其放入 `pretrain` 目录中。\n\n**3. FCPE（预览版）**\n\n[FCPE](https:\u002F\u002Fgithub.com\u002FCNChTu\u002FMelPE)（快速上下文音高估计器）是由 svc-develop-team 独立开发的新一代 F0 预测器，专为实时语音转换设计。未来它将成为 Sovits 实时语音转换的首选 F0 预测器。\n\n如果使用 `fcpe` F0 预测器，则需要下载预训练的 FCPE 模型。\n\n- 下载模型 [fcpe.pt](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fylzz1997\u002Frmvpe_pretrain_model\u002Fresolve\u002Fmain\u002Ffcpe.pt)。\n- 将其放入 `pretrain` 目录中。\n\n## 2.3 数据准备\n\n1. 按照以下文件结构将数据集整理到 `dataset_raw` 目录中。\n\n```\ndataset_raw\n├───speaker0\n│   ├───xxx1-xxx1.wav\n│   ├───...\n│   └───Lxx-0xx8.wav\n└───speaker1\n    ├───xx2-0xxx2.wav\n    ├───...\n    └───xxx7-xxx007.wav\n```\n\n2. 您可以自定义说话人的名称。\n\n```\ndataset_raw\n└───suijiSUI\n    ├───1.wav\n    ├───...\n    └───25788785-20221210-200143-856_01_(Vocals)_0_0.wav\n```\n\n3. 此外，您还需要在 `dataset_raw` 中创建并编辑 `config.json` 文件。\n\n```json\n{\n  \"n_speakers\": 10,\n\n  \"spk\": {\n    \"speaker0\": 0,\n    \"speaker1\": 1\n  }\n}\n```\n\n- `\"n_speakers\": 10\"`：该数字表示说话人的数量，从1开始计数，需与下方的数字对应。\n- `\"speaker0\": 0\"`：“speaker0”指的是说话人的名字，可以更改。数字0、1、2…代表说话人编号，从0开始。\n\n## 2.4 数据预处理\n\n### 2.4.0 音频切片\n\n- 将音频切分为 `5秒 - 15秒` 的片段。稍长的片段也可以接受，但过长的片段可能导致训练或预处理过程中出现内存不足的错误。\n  \n- 您可以使用 [audio-slicer-GUI](https:\u002F\u002Fgithub.com\u002Fflutydeer\u002Faudio-slicer) 或 [audio-slicer-CLI](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002Faudio-slicer) 来辅助切片。通常只需调整 `Minimum Interval` 参数即可。对于普通语音材料，使用默认值通常就足够了；而对于歌唱材料，则可以将其调整为 `100` 甚至 `50`。\n\n- 切片完成后，手动处理过长（超过15秒）或过短（低于4秒）的音频。较短的音频可以拼接成多个片段，而较长的音频则可以手动分割。\n\n> [!WARNING]\n>\n> **如果您使用 Whisper-ppg 声学编码器进行训练，所有切片的长度都必须小于30秒。**\n\n### 2.4.1 重采样至44100Hz单声道\n\n使用以下命令（如果已进行响度匹配，则跳过此步骤）：\n\n```bash\npython resample.py\n```\n\n> [!NOTE]\n>\n> 虽然该项目提供了用于重采样、转单声道和响度匹配的脚本 `resample.py`，但默认的响度匹配会将音量调整至0dB，这可能会降低音频质量。此外，Python中的响度归一化库 `pyloudnorm` 无法应用电平限制，可能导致削波现象。建议考虑使用专业的音频处理软件，如 `Adobe Audition` 进行响度匹配。您也可以使用我开发的响度匹配工具 [Loudness Matching Tool](https:\u002F\u002Fgithub.com\u002FAI-Hobbyist\u002FLoudness-Matching-Tool)。如果您已经使用其他软件进行了响度匹配，可以在运行上述命令时添加 `--skip_loudnorm` 参数来跳过响度匹配步骤。例如：\n\n```bash\npython resample.py --skip_loudnorm\n```\n\n### 2.4.2 自动数据集划分及配置文件生成\n\n使用以下命令（如果需要响度嵌入，则跳过此步骤）：\n\n```bash\npython preprocess_flist_config.py --speech_encoder vec768l12\n```\n\n`speech_encoder` 参数有七种选项，具体说明见 **[2.2.1 必需项及各编码器详解](#detailed-explanation-of-each-encoder)**。若省略 `speech_encoder` 参数，则默认值为 `vec768l12`。\n\n```\nvec768l12\nvec256l9\nhubertsoft\nwhisper-ppg\nwhisper-ppg-large\ncnhubertlarge\ndphubert\n```\n\n#### 使用响度嵌入\n\n- 使用响度嵌入时，训练好的模型会匹配输入源的响度；否则，它将匹配训练集的响度。\n- 如果使用响度嵌入，需要添加 `--vol_aug` 参数，例如：\n\n```bash\npython preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug\n```\n\n### 2.4.3 根据需要修改配置文件\n\n#### config.json\n\n- `vocoder_name`: 选择声码器，默认为 `nsf-hifigan`。\n- `log_interval`: 日志输出频率，默认为 `200`。\n- `eval_interval`: 验证和保存模型的频率，默认为 `800`。\n- `epochs`: 总训练轮数，默认为 `10000`。达到该轮数后训练将自动停止。\n- `learning_rate`: 学习率，建议保持默认值。\n- `batch_size`: 每次训练步骤加载到 GPU 上的数据量，应调整为小于 GPU 显存容量的值。\n- `all_in_mem`: 将整个数据集加载到内存中。如果某些平台的磁盘 IO 过慢，且内存容量远大于数据集大小时，可启用此选项。\n- `keep_ckpts`: 训练过程中保留的最近模型数量，`0` 表示保留所有模型。默认仅保留最后 `3` 个模型。\n\n**声码器选项**\n\n```\nnsf-hifigan\nnsf-snake-hifigan\n```\n\n#### diffusion.yaml\n\n- `cache_all_data`: 将整个数据集加载到内存中。如果某些平台的磁盘 IO 过慢，且内存容量远大于数据集大小时，可启用此选项。\n- `duration`: 训练时音频片段的持续时间。请根据 GPU 显存大小进行调整。**注意：该值必须小于训练集中最短音频的持续时间！**\n- `batch_size`: 每次训练步骤加载到 GPU 上的数据量，应调整为小于 GPU 显存容量的值。\n- `timesteps`: 扩散模型的总步数，默认为 1000。完整的高斯扩散过程共有 1000 步。\n- `k_step_max`: 训练时仅训练前 `k_step_max` 步以节省训练时间。请注意，该值必须小于 `timesteps`。`0` 表示训练整个扩散模型。**注意：如果不训练整个扩散模型，则无法仅使用扩散模型进行推理！**\n\n### 2.4.3 生成 Hubert 和 F0\n\n使用以下命令（若训练浅层扩散则跳过此步骤）：\n\n```bash\n# 下述命令使用 rmvpe 作为 F0 预测器，您可手动修改\npython preprocess_hubert_f0.py --f0_predictor rmvpe\n```\n\n`f0_predictor` 参数有六种选项，部分 F0 预测器需要下载额外的预处理模型。详情请参阅 **[2.2.3 可选项目（按需选择）](#223-optional-items-choose-as-needed)**。\n\n```\ncrepe\ndio\npm\nharvest\nrmvpe（推荐！）\nfcpe\n```\n\n#### 各 F0 预测器的优缺点\n\n| 预测器 | 优点                                                                                          | 缺点                                                                     |\n| -------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |\n| pm       | 速度快、资源消耗低                                                                            | 容易产生气声                                                             |\n| crepe    | 很少产生气声                                                                                  | 内存消耗大，可能产生音准偏差                                             |\n| dio      | -                                                                                             | 可能产生音准偏差                                                         |\n| harvest  | 在低音域表现更好                                                                              | 在其他音域表现较差                                                       |\n| rmvpe    | 几乎无瑕疵，目前最为准确的预测器                                                              | 几乎没有缺点（极低音可能存在问题）                                     |\n| fcpe     | 由 SVC 团队开发，目前速度最快的预测器，准确度与 crepe 相当                                   | -                                                                        |\n\n> [!NOTE]\n>\n> 1. 如果训练集噪声过大，请使用 crepe 处理 F0。\n> 2. 如果省略 `f0_predictor` 参数，则默认使用 rmvpe。\n\n**若需浅层扩散功能（可选），请添加 `--use_diff` 参数，例如：**\n\n```bash\n# 下述命令使用 rmvpe 作为 F0 预测器，您可手动修改\npython preprocess_hubert_f0.py --f0_predictor rmvpe --use_diff\n```\n\n**若处理速度较慢，或您的数据集较大，可添加 `--num_processes` 参数：**\n\n```bash\n# 下述命令使用 rmvpe 作为 F0 预测器，您可手动更改\npython preprocess_hubert_f0.py --f0_predictor rmvpe --num_processes 8\n# 所有工作进程将自动分配到多个线程\n```\n\n完成上述步骤后，生成的 `dataset` 目录即为预处理后的数据，您可以根据需要删除 `dataset_raw` 文件夹。\n\n## 2.5 训练\n\n### 2.5.1 主模型训练（必选）\n\n使用以下命令训练主模型。若训练中断，也可使用相同命令继续训练。\n\n```bash\npython train.py -c configs\u002Fconfig.json -m 44k\n```\n\n### 2.5.2 扩散模型（可选）\n\n- So-VITS-SVC 4.1 的一项重大更新是引入了浅层扩散机制，该机制会将 SoVITS 原始输出的音频转换为梅尔频谱图，添加噪声，并进行浅层扩散处理，然后再通过声码器输出音频。测试表明，**原始输出音频经过浅层扩散处理后，音质显著提升，能够有效解决电子噪声和背景噪声等问题**。\n- 如果需要使用浅层扩散功能，则需训练扩散模型。在训练之前，请确保已下载并正确放置 `NSF-HIFIGAN`（**请参阅 [2.2.3 可选项目](#223-optional-items-choose-as-needed）**），并在预处理生成 Hubert 和 F0 时添加 `--use_diff` 参数（**请参阅 [2.4.3 生成 Hubert 和 F0](#243-generating-hubert-and-f0)**）。\n\n要训练扩散模型，可以使用以下命令：\n\n```bash\npython train_diff.py -c configs\u002Fdiffusion.yaml\n```\n\n模型训练完成后，模型文件会保存在 `logs\u002F44k` 目录下，其中扩散模型会保存在 `logs\u002F44k\u002Fdiffusion` 中。\n\n> [!IMPORTANT]\n>\n> **如何判断模型是否训练到位？**\n>\n> 1. 这其实是一个非常无聊且毫无意义的问题。就好比你问老师如何让孩子学得好一样，除了你自己之外，没有人能回答这个问题。\n> 2. 模型训练的效果与你的数据集质量、持续时间、所选编码器、F0 算法，甚至一些玄学因素都有关。即便你已经得到了一个训练好的模型，最终的合成效果仍然取决于你的输入源以及推理参数。这并不是一个线性过程，涉及的变量太多。所以，如果你还在纠结“为什么我的模型听起来不像？”或者“怎么知道模型训练得怎么样呢？”，那我只能说：谁知道呢？\n> 3. 不过这也并不意味着完全没有方法。你可以多祈祷、多拜一拜。我并不否认这种方式的有效性，但也可以借助一些科学工具，比如 TensorBoard 等。下面的 [2.5.3 TensorBoard](#253-tensorboard) 就会教你如何利用 TensorBoard 来辅助判断训练状态。**当然，最强大的工具其实在于你自己的耳朵。如何判断声学模型是否训练到位？戴上耳机，用耳朵去听就知道了。**\n\n**Epoch 与 Step 的关系**：\n\n在训练过程中，根据 `config.json` 中的设置，每隔指定步数（默认为 800 步，对应 `eval_interval` 值）就会保存一次模型。\n需要注意的是，Epoch 和 Step 是不同的概念：1 Epoch 表示训练集中所有样本都参与了一次完整的学习过程；而 1 Step 则表示完成了一次学习步骤。由于存在 `batch_size`，每次学习步骤可能包含多个样本。因此，Epoch 和 Step 之间的换算关系如下：\n\n$$\nEpoch = \\frac{Step}{(\\text{数据集中的样本总数} \\div batch\\_size)}\n$$\n\n默认情况下，训练会在达到 10,000 个 Epoch 后结束（可以通过修改 `config.json` 中的 `epoch` 字段来调整上限），但通常几百个 Epoch 就能达到不错的效果。当你觉得训练差不多完成时，可以在训练终端按下 `Ctrl + C` 来中断训练。中断后，只要没有重新处理训练集，就可以从最近保存的检查点继续训练。\n\n### 2.5.3 TensorBoard\n\n你可以使用 TensorBoard 来可视化训练过程中损失函数值的变化趋势、聆听音频样本，并帮助判断模型的训练状态。**然而，对于 So-VITS-SVC 项目而言，损失函数值本身并没有太大的参考意义（无需比较或研究其具体数值），真正有参考价值的还是通过耳朵聆听推理后的音频输出！**\n\n- 使用以下命令打开 TensorBoard：\n\n```bash\ntensorboard --logdir=.\u002Flogs\u002F44k\n```\n\nTensorBoard 默认每 200 步会生成一次评估日志。如果训练尚未达到 200 步，TensorBoard 中将不会显示任何图像。这个 200 步的间隔可以通过修改 `config.json` 中的 `log_interval` 值来调整。\n\n- 关于损失的说明\n\n你无需深入理解每个损失的具体含义。一般来说：\n\n- `loss\u002Fg\u002Ff0`、`loss\u002Fg\u002Fmel` 和 `loss\u002Fg\u002Ftotal` 应该呈现波动，并最终收敛到某个稳定值。\n- `loss\u002Fg\u002Fkl` 应保持较低水平的波动。\n- `loss\u002Fg\u002Ffm` 在训练中期应持续上升，而在后期则应逐渐放缓甚至开始下降。\n\n> [!IMPORTANT]\n>\n> ✨ 观察损失曲线的变化趋势可以帮助你判断模型的训练状态。然而，仅凭损失值并不能作为判断模型训练状态的唯一依据，**事实上，这些损失值的参考意义并不大。判断模型是否训练到位，仍然需要依靠耳朵聆听音频输出**。\n\n> [!WARNING]\n>\n> 1. 对于小型数据集（30 分钟甚至更小），在加载基础模型时，不建议进行过长时间的训练，这样可以更好地发挥基础模型的优势。通常几千步甚至几百步就能取得较好的效果。\n> 2. TensorBoard 中的音频样本是基于你的验证集生成的，**并不能代表模型的最终表现**。\n\n# 3. 推理\n\n✨ 在进行推理之前，请准备好你需要推理的干声文件，确保其无背景噪声\u002F混响且质量良好。你可以使用 [UVR5](https:\u002F\u002Fgithub.com\u002FAnjok07\u002Fultimatevocalremovergui\u002Freleases\u002Ftag\u002Fv5.6) 进行处理以获得干声。此外，我还制作了一个 [UVR5 人声分离教程](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1F4421c7qU\u002F)。\n\n## 3.1 命令行推理\n\n使用 inference_main.py 进行推理\n\n```bash\n\n# 示例\npython inference_main.py -m \"logs\u002F44k\u002FG_30400.pth\" -c \"configs\u002Fconfig.json\" -n \"your_inference_audio.wav\" -t 0 -s \"speaker\"\n```\n\n**必填参数：**\n\n- `-m` | `--model_path`: 模型路径\n- `-c` | `--config_path`: 配置文件路径\n- `-n` | `--clean_names`: wav文件名列表，位于raw文件夹中\n- `-t` | `--trans`: 音高调整，支持正负值（以半音为单位）\n- `-s` | `--spk_list`: 合成目标说话人名称\n- `-cl` | `--clip`: 音频强制裁剪，默认0为自动裁剪，单位为秒。\n\n> [!NOTE]\n>\n> **音频裁剪**\n>\n> - 在推理过程中，裁剪工具会根据静音段将上传的音频分割成若干小片段，推理完成后重新拼接成完整音频。这种方式有利于降低小片段音频对显存的占用，从而实现对长音频的分段推理，避免显存溢出。裁剪阈值参数控制最小满刻度分贝值，低于该值的部分会被视为静音并移除。因此，当上传的音频较为嘈杂时，可以将此参数调高（如-30）；而对于较干净的音频，则可调低至-50，以避免切掉呼吸声和微弱的人声。\n>\n> - 开发团队近期测试表明，较小的裁剪阈值（如-50）能够提升输出音频的清晰度，尽管其原理目前尚不明确。\n>\n> **强制裁剪** `-cl` | `--clip`\n>\n> - 在推理过程中，若连续的语音部分长时间无静音间隔，裁剪工具有时会产生过长的音频片段，可能导致显存溢出。自动音频裁剪功能会设定一个最大音频分割时长。在初次分割后，如果仍有超过该时长的音频片段，系统会按照此时长对其进行强制再次分割，以避免显存溢出问题。\n> - 强制裁剪可能会导致音频在单词中间被切断，从而造成合成语音的不连贯。您需要在高级设置中调整强制裁剪的交叉淡化长度来缓解这一问题。\n\n**可选参数：具体说明见下节**\n\n- `-lg` | `--linear_gradient`: 两个音频片段之间的交叉淡化长度。若强制裁剪后出现语音不连贯的情况，请调整该值；建议使用默认值0，单位为秒。\n- `-f0p` | `--f0_predictor`: 选择F0预测器，选项包括crepe、pm、dio、harvest、rmvpe、fcpe，默认为pm（注：crepe会对原始F0使用均值滤波）。不同F0预测器的优缺点参见[2.4.3 F0预测器的优缺点](#243-generating-hubert-and-f0)。\n- `-a` | `--auto_predict_f0`: 在语音转换过程中自动预测音高。请勿在转换歌声时启用此选项，否则可能导致严重跑调。\n- `-cm` | `--cluster_model_path`: 聚类模型或特征检索索引的路径。留空则自动设置为各解决方案模型的默认路径；若未训练聚类或特征检索模型，则随机填写。\n- `-cr` | `--cluster_infer_ratio`: 聚类方案或特征检索的比例，范围为0-1。若未训练聚类模型或特征检索模型，则默认为0。\n- `-eh` | `--enhance`: 是否使用NSF_HIFIGAN增强器。该选项对训练数据较少的模型有一定音质提升效果，但对训练充分的模型则可能产生负面影响，默认关闭。\n- `-shd` | `--shallow_diffusion`: 是否使用浅层扩散。启用此选项可解决部分电子音问题，默认关闭。当启用此选项时，NSF_HIFIGAN增强器将被禁用。\n- `-usm` | `--use_spk_mix`: 是否使用说话人混合\u002F动态语音混合。\n- `-lea` | `--loudness_envelope_adjustment`: 输入源响度包络替换与输出响度包络融合的比例。越接近1，越倾向于使用输出响度包络。\n- `-fr` | `--feature_retrieval`: 是否使用特征检索。若已使用聚类模型，则该选项将被禁用，此时cm和cr参数将分别变为特征检索的索引路径和混合比例。\n\n> [!NOTE]\n>\n> **聚类模型\u002F特征检索混合比例** `-cr` | `--cluster_infer_ratio`\n>\n> - 本参数控制在使用聚类模型或特征检索模型时线性参与的比例。聚类模型和特征检索模型均可略微提升音色相似度，但代价是降低发音准确性（特征检索的发音略优于聚类）。该参数范围为0-1，其中0表示未启用；数值越接近1，音色越相似，发音则越模糊。\n> - 聚类模型和特征检索共用此参数。加载模型时，实际使用的模型将由该参数决定。\n> - **请注意，当未加载聚类模型或特征检索模型时，请将此参数保持为0，否则会导致错误。**\n\n**浅层扩散设置：**\n\n- `-dm` | `--diffusion_model_path`: 扩散模型路径。\n- `-dc` | `--diffusion_config_path`: 扩散模型配置文件路径。\n- `-ks` | `--k_step`: 扩散步骤数。数值越大，结果越接近扩散模型的效果，默认为100。\n- `-od` | `--only_diffusion`: 纯扩散模式。该模式不加载sovits模型，仅基于扩散模型进行推理。\n- `-se` | `--second_encoding`: 二次编码。原始音频将在浅层扩散前再次编码，这是一个神秘的选项，有时效果很好，有时则不然。\n\n> [!NOTE]\n>\n> **关于浅层扩散步骤** `-ks` | `--k_step`\n>\n> 完整的高斯扩散过程需1000步。当浅层扩散步骤达到1000步时，最终输出结果将完全由扩散模型决定，而So-VITS模型的作用将被抑制。浅层扩散步骤越多，结果就越接近扩散模型的输出。**如果您希望仅利用浅层扩散去除电子噪声，同时尽可能保留So-VITS模型的音色，则可将浅层扩散步骤设置为30-50步。**\n\n> [!WARNING]\n>\n> 若使用`whisper-ppg`语音编码器进行推理，应将`--clip`设置为25，`-lg`设置为1。否则，推理将无法正常工作。\n\n## 3.2 WebUI 推理\n\n使用以下命令打开 WebUI 界面，**上传并加载模型，根据说明填写推理参数，上传推理音频，然后开始推理。**\n\n推理参数的详细说明与 [3.1 命令行推理](#31-command-line-inference) 中的参数相同，只是移至交互式界面，并配有简单说明。\n\n```bash\npython webUI.py\n```\n\n> [!WARNING]\n>\n> **请务必查看 [命令行推理](#31-command-line-inference)，以了解各参数的具体含义。特别注意 NOTE 和 WARNING 中的提醒！**\n\nWebUI 还内置了**文本到语音（TTS）**功能：\n\n- 文本到语音功能使用 Microsoft 的 edge_TTS 服务生成一段原始语音，再通过 So-VITS 将该语音转换为目标音色。需要注意的是，So-VITS 目前仅支持歌声的语音转换（SVC），并不具备**原生**的文本到语音功能！由于 Microsoft 的 edge_TTS 生成的语音较为生硬、缺乏情感，因此所有经过转换的音频也会呈现出类似的特点。**如果您需要带有情感的 TTS 功能，请访问 [GPT-SoVITS](https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS) 项目。**\n- 目前，文本到语音功能共支持 55 种语言，覆盖了大多数常用语言。程序会根据文本框中输入的内容自动识别语言并进行转换。\n- 自动识别仅能确定语言，而某些语言可能有不同的口音或发音者。如果使用自动识别功能，程序将随机选择符合该语言和指定性别的一位发音者进行转换。若您的目标语言存在多种口音或发音者（例如英语），建议您手动指定一位特定口音的发音者。一旦手动指定了发音者，之前手动选择的性别选项将被忽略。\n\n# 4. 可选增强功能\n\n✨ 如果您对之前的处理效果已经满意，或者对下文内容不太理解，可以跳过以下内容，不会影响模型的正常使用（这些可选增强功能的效果相对较小，通常只在特定数据上有所体现，但在大多数情况下并不明显）。\n\n## 4.1 自动 F0 预测\n\n在模型训练过程中，会同时训练一个 F0 预测器，它是一种自动音高调整功能，能够在推理时使模型音高与源音高匹配，对于语音转换任务尤其有用，能够更好地匹配音高。**然而，在转换歌声时切勿启用此功能！！！否则会导致严重的跑调！！！**\n\n- 命令行推理：在 `inference_main` 中将 `auto_predict_f0` 设置为 `true`。\n- WebUI 推理：勾选相应的选项。\n\n## 4.2 聚类音色泄漏控制\n\n聚类方案可以减少音色泄漏，使模型输出更接近目标音色（尽管效果并不显著）。然而，单独使用聚类可能会降低模型的发音清晰度（导致发音模糊）。本模型采用融合方式，允许线性控制聚类方案与非聚类方案的比例。也就是说，您可以手动调整“更接近目标音色”与“清晰发音”之间的比例，找到一个合适的平衡点。\n\n使用聚类无需更改前面提到的现有步骤，只需额外训练一个聚类模型即可。虽然效果有限，但训练成本也相对较低。\n\n- 训练方法：\n\n```bash\n# 使用 CPU 训练：\npython cluster\u002Ftrain_cluster.py\n# 或使用 GPU 训练：\npython cluster\u002Ftrain_cluster.py --gpu\n```\n\n训练完成后，模型输出将保存在 `logs\u002F44k\u002Fkmeans_10000.pt`。\n\n- 命令行推理时：\n  - 在 `inference_main.py` 中指定 `cluster_model_path`。\n  - 在 `inference_main.py` 中指定 `cluster_infer_ratio`，其中 `0` 表示完全不使用聚类，`1` 表示仅使用聚类。通常设置为 `0.5` 即可。\n- WebUI 推理时：\n  - 上传并加载聚类模型。\n  - 设置聚类模型\u002F特征检索的混合比例，范围为 0 到 1，其中 0 表示完全不使用聚类\u002F特征检索。使用聚类\u002F特征检索可以提高音色相似度，但可能会降低发音清晰度（如果使用，建议设置为 0.5 左右）。\n\n## 4.3 特征检索\n\n与聚类方案类似，特征检索也可以减少音色泄漏，且在发音清晰度方面略优于聚类，不过可能会降低推理速度。同样采用融合方式，允许线性控制特征检索与非特征检索的比例。\n\n- 训练过程：在生成 hubert 和 f0 后，执行以下命令：\n\n```bash\npython train_index.py -c configs\u002Fconfig.json\n```\n\n训练完成后，模型输出将保存在 `logs\u002F44k\u002Ffeature_and_index.pkl`。\n\n- 命令行推理时：\n  - 首先指定 `--feature_retrieval`，此时聚类方案将自动切换为特征检索方案。\n  - 在 `inference_main.py` 中指定 `cluster_model_path` 为模型输出文件。\n  - 在 `inference_main.py` 中指定 `cluster_infer_ratio`，其中 `0` 表示完全不使用特征检索，`1` 表示仅使用特征检索。通常设置为 `0.5` 即可。\n- WebUI 推理时：\n  - 上传并加载聚类模型。\n  - 设置聚类模型\u002F特征检索的混合比例，范围为 0 到 1，其中 0 表示完全不使用聚类\u002F特征检索。使用聚类\u002F特征检索可以提高音色相似度，但可能会降低发音清晰度（如果使用，建议设置为 0.5 左右）。\n\n## 4.4 语音编码器微调\n\n在 So-VITS 中使用扩散模型时，经过扩散模型增强的梅尔频谱图会通过语音编码器输出为最终音频。语音编码器对输出音频的质量起着决定性作用。目前，So-VITS-SVC 使用的是 [NSF-HiFiGAN 社区语音编码器](https:\u002F\u002Fopenvpi.github.io\u002Fvocoders\u002F)。实际上，你也可以在 So-VITS 的 **扩散流程** 中，利用自己的数据集对该语音编码器模型进行微调，使其更好地适配你的模型任务。\n\n[SingingVocoders](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002FSingingVocoders) 项目提供了语音编码器微调的方法。在 Diffusion-SVC 项目中，**使用微调后的语音编码器可以显著提升输出音质**。你也可以用自己收集的数据训练一个微调版的语音编码器，并将其应用到本集成包中。\n\n1. 使用 [SingingVocoders](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002FSingingVocoders) 训练一个微调版的语音编码器，并获取其模型文件和配置文件。\n2. 将模型文件和配置文件放置于 `pretrain\u002F{微调后语音编码器名称}\u002F` 目录下。\n3. 修改对应模型的扩散模型配置文件 `diffusion.yaml` 如下：\n\n```yaml\nvocoder:\n  ckpt: pretrain\u002Fnsf_hifigan\u002Fmodel.ckpt # 此行填写你微调后语音编码器的路径\n  type: nsf-hifigan # 此行填写你微调后语音编码器的类型，不确定时请勿修改\n```\n\n1. 按照 [3.2 webUI 推理](#32-webui-inference) 的步骤，上传扩散模型以及 **修改后的扩散模型配置文件**，即可使用微调后的语音编码器。\n\n> [!WARNING]\n>\n> **目前仅 NSF-HiFiGAN 语音编码器支持微调。**\n\n## 4.5 模型保存目录\n\n截至上一节，我们已经介绍了总共四种可训练的模型类型。下表总结了这四种模型及其对应的配置文件。\n\n在 webUI 中，除了上传和加载模型之外，你还可以直接读取本地模型文件。只需先将这些模型放入一个文件夹中，再将该文件夹放入 `trained` 文件夹内。点击“刷新本地模型列表”，webUI 就会识别出该模型。然后手动选择需要加载的模型进行加载。**注意**：对于下表中的（可选）模型，自动加载本地模型功能可能无法正常工作。\n\n| 文件                                          | 文件名及扩展名  | 存放位置             |\n| --------------------------------------------- | ----------------------- | -------------------- |\n| So-VITS 模型                                 | `G_xxxx.pth`            | `logs\u002F44k`           |\n| So-VITS 模型配置文件              | `config.json`           | `configs`            |\n| 扩散模型（可选）                    | `model_xxxx.pt`         | `logs\u002F44k\u002Fdiffusion` |\n| 扩散模型配置文件（可选）            | `diffusion.yaml`        | `configs`            |\n| K-means 聚类模型（可选）            | `kmeans_10000.pt`       | `logs\u002F44k`           |\n| 特征检索模型（可选）                | `feature_and_index.pkl` | `logs\u002F44k`           |\n\n# 5. 其他可选功能\n\n✨ 与前面几节相比，本部分的重要性较低。除 [5.1 模型压缩](#51-model-compression) 这一较为便捷的功能外，其他可选功能的实际使用频率相对较低。因此，此处仅提供官方文档的参考链接及简要说明。\n\n## 5.1 模型压缩\n\n生成的模型包含了继续训练所需的信息。如果你 **确定不再继续训练**，可以移除模型中的这部分信息，从而得到一个体积约为原模型三分之一的最终模型。\n\n使用 `compress_model.py` 脚本：\n\n```bash\n# 例如，若要压缩 logs\u002F44k\u002F 目录下的 G_30400.pth 模型，且配置文件为 configs\u002Fconfig.json，则可运行以下命令：\npython compress_model.py -c=\"configs\u002Fconfig.json\" -i=\"logs\u002F44k\u002FG_30400.pth\" -o=\"logs\u002F44k\u002Frelease.pth\"\n# 压缩后的模型将保存为 logs\u002F44k\u002Frelease.pth\n```\n\n> [!WARNING]\n>\n> **注意：压缩后的模型不可再用于继续训练！**\n\n## 5.2 音色混合\n\n### 5.2.1 静态音色混合\n\n**请参考 Tools\u002FExperimental Features 目录下的 `webUI.py` 文件中的静态音色混合功能。**\n\n该功能可以将多个音色模型合并为一个音色模型（通过对多个模型参数进行凸组合或线性组合），从而创造出现实中不存在的音色特征。\n\n**注意事项：**\n\n1. 该功能仅支持单说话人模型。\n2. 若强行使用多说话人模型，需确保各模型的说话人数量相同，以便将相同 SpeakerID 下的音色进行混合。\n3. 确保所有待混合模型的 config.json 文件中 `model` 字段一致。\n4. 输出的混合模型可以使用任一待混合模型的 config.json 文件，但聚类模型将无法使用。\n5. 批量上传模型时，建议将模型放入一个文件夹中一起上传。\n6. 建议将混合比例调整在 0 到 100 之间。虽然也可以设置其他数值，但在线性组合模式下可能会出现未知效果。\n7. 混合完成后，文件将以 output.pth 为文件名保存在项目根目录下。\n8. 凸组合模式会对混合比例执行 Softmax 处理，以保证各比例之和为 1；而线性组合模式则不会进行此操作。\n\n### 5.2.2 动态音色混合\n\n**请参考 `spkmix.py` 文件中关于动态音色混合的介绍。**\n\n角色轨道混合规则如下：\n\n- Speaker ID：\\[\\[开始时间 1, 结束时间 1, 开始值 1, 结束值 1], [开始时间 2, 结束时间 2, 开始值 2, 结束值 2]]\n- 每个时间段的开始时间必须与前一个时间段的结束时间相同，第一个开始时间必须为 0，最后一个结束时间必须为 1（时间范围为 0–1）。\n- 必须完整填写所有角色信息，未使用的角色可用 \\[\\[0., 1., 0., 0.]] 填充。\n- 融合值可任意填写，在指定的时间范围内，它会从开始值线性变化到结束值。系统内部会自动确保线性组合结果为 1（满足凸组合条件），因此你可以放心使用。\n\n在命令行推理时，使用 `--use_spk_mix` 参数启用动态音色混合功能。在 webUI 推理时，请勾选“动态音色混合”选项。\n\n## 5.3 ONNX 导出\n\n使用 `onnx_export.py` 脚本。目前，只有 [MoeVoiceStudio](https:\u002F\u002Fgithub.com\u002FNaruseMioShirakana\u002FMoeVoiceStudio) 需要使用 ONNX 模型。更多详细操作和使用方法，请参考 [MoeVoiceStudio](https:\u002F\u002Fgithub.com\u002FNaruseMioShirakana\u002FMoeVoiceStudio) 仓库的说明文档。\n\n- 创建一个新文件夹：`checkpoints` 并打开它。\n- 在 `checkpoints` 文件夹中，创建一个以项目命名的文件夹，例如 `aziplayer`。\n- 将您的模型重命名为 `model.pth`，配置文件重命名为 `config.json`，并将它们放入刚刚创建的 `aziplayer` 文件夹中。\n- 修改 `onnx_export.py` 中的 `path = \"NyaruTaffy\"`，将 `\"NyaruTaffy\"` 替换为您的项目名称，即 `path = \"aziplayer\"`（用于支持角色混合的 ONNX 导出）。\n- 运行 `python onnx_export.py`。\n- 等待执行完成。在您的项目文件夹中会生成一个 `model.onnx` 文件，这就是导出的模型。\n\n注意：Hubert 的 ONNX 模型请使用 [MoeVoiceStudio](https:\u002F\u002Fgithub.com\u002FNaruseMioShirakana\u002FMoeVoiceStudio) 提供的 ONNX 模型。目前无法独立导出（fairseq 中的 Hubert 包含许多 ONNX 不支持的算子，并且涉及常量，在导出时会导致输入输出形状及结果出现错误或问题）。\n\n# 6. 简单混音与成品导出\n\n### 使用音频宿主软件处理推理后的音频\n\n关于如何使用音频宿主软件对推理后的音频进行处理和优化，请参考[相关视频教程](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Hr4y197Cy\u002F)或其他专业的混音教程。\n\n# 附录：常见错误及解决方案\n\n✨ **部分错误解决方案感谢 [羽毛布団](https:\u002F\u002Fspace.bilibili.com\u002F3493141443250876) 的[相关专栏](https:\u002F\u002Fwww.bilibili.com\u002Fread\u002Fcv22206231) | [相关文档](https:\u002F\u002Fwww.yuque.com\u002Fumoubuton\u002Fueupp5\u002Fieinf8qmpzswpsvr)**\n\n## 关于内存不足 (OOM)\n\n如果在终端或 WebUI 中遇到类似以下的错误：\n\n```bash\nOutOfMemoryError: CUDA 内存不足。尝试分配 XX GiB（GPU 0：总容量 XX GiB；已分配 XX GiB；空闲 XX MiB；PyTorch 总共保留了 XX GiB）\n```\n\n请不要怀疑，这确实是您的 GPU 显存或虚拟内存不足所致。以下步骤可以 100% 解决该问题，请按照这些步骤操作即可。请避免在各处重复提问，因为解决方案已经非常明确。\n\n1. 在错误信息中，查看 `XX GiB already allocated` 后是否显示 `0 bytes free`。如果显示 `0 bytes free`，请按照步骤 2、3、4 操作；如果显示 `XX MiB free` 或 `XX GiB free`，则按照步骤 5 操作。\n2. 如果内存不足发生在预处理阶段：\n   - 使用对 GPU 友好的 F0 预测器（友好程度从高到低依次为：pm >= harvest >= rmvpe ≈ fcpe >> crepe）。建议优先使用 rmvpe 或 fcpe。\n   - 将多进程预处理设置为 1。\n3. 如果内存不足发生在训练阶段：\n   - 检查数据集中是否存在过长的片段（超过 20 秒）。\n   - 减少批次大小。\n   - 使用资源需求较低的项目。\n   - 在 Google Colab 等平台上租用显存更大的 GPU 进行训练。\n4. 如果内存不足发生在推理阶段：\n   - 确保源音频（干声）干净（无残留混响、伴奏或和声），因为脏音频会影响自动切片。可参考 [UVR5 人声分离教程](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1F4421c7qU\u002F) 获取最佳实践。\n   - 提高切片阈值（例如从 -40 调整为 -30；但不要调得过高，否则可能导致音频被突然切断）。\n   - 设置强制切片，从 60 秒开始，每次减少 10 秒，直到推理成功为止。\n   - 使用 CPU 进行推理，虽然速度较慢，但不会出现内存不足的问题。\n5. 如果仍有可用内存，但仍然出现内存不足错误，则将虚拟内存增加至至少 50G。\n\n以上步骤应能有效管理和解决内存不足问题，确保在预处理、训练和推理过程中顺利运行。\n\n## 安装依赖时的常见错误及解决方案\n\n**1. 安装带有 CUDA=11.7 的 PyTorch 时出错**\n\n```\nERROR: 包 'networkx' 需要不同的 Python 版本：3.8.9 不在 '> =3.9' 范围内\n```\n\n**解决方案：**\n\n- **升级 Python 至 3.10：**\n- **（推荐）保持 Python 版本不变：** 先安装版本为 3.0 的 `networkx`，然后再安装 PyTorch。\n\n```bash\npip install networkx==3.0\n# 然后继续安装 PyTorch。\n```\n\n**2. 依赖包未找到**\n\n如果您遇到类似以下的错误：\n\n```bash\nERROR: 找不到满足 librosa==0.9.1 要求的版本（从现有版本中找不到）\nERROR: 没有找到匹配的 librosa==0.9.1 发行版\n# 错误特征：\n没有找到 xxxxx 的匹配发行版\n找不到满足 xxxx 要求的版本\n```\n\n**解决方案：** 更改安装源。手动安装依赖时添加下载源。\n\n使用命令 `pip install [包名] -i [源地址]`。例如，从阿里云源下载 `librosa` 0.9.1 版本，可以使用以下命令：\n\n```bash\npip install librosa==0.9.1 -i http:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\n```\n\n**3. 某些依赖因 pip 版本过高而无法安装**\n\n2024 年 6 月 21 日，pip 更新至 24.1 版本。直接使用 `pip install --upgrade pip` 会将 pip 升级到 24.1 版本。然而，某些依赖需要安装 23.0 版本的 pip，因此需要手动降级 pip 版本。目前已知受此影响的有 hydra-core、omegaconf 和 fastapi。安装时遇到的具体错误如下：\n\n```bash\n如果您需要使用此版本，请使用 pip\u003C24.1。\nINFO: pip 正在检查 hydra-core 的多个版本，以确定哪个版本与其他要求兼容。这可能需要一些时间。\nERROR: 无法安装 -r requirements.txt（第 20 行）和 fairseq，因为这些包的版本存在依赖冲突。\n\n冲突原因如下：\n    fairseq 0.12.2 依赖 omegaconf\u003C2.1\n    hydra-core 1.0.7 依赖 omegaconf\u003C2.1 且 >=2.0.5\n\n要解决此问题，您可以尝试：\n1. 放宽您指定的包版本范围\n2. 删除包版本，让 pip 尝试解决依赖冲突\n```\n\n解决方案是在安装其他依赖之前限制 pip 版本，具体方法参见 [1.5 其他依赖的安装](#15-installation-of-other-dependencies)。使用以下命令限制 pip 版本：\n\n```bash\npip install --upgrade pip==23.3.2 wheel setuptools\n```\n\n执行完此命令后，再继续安装其他依赖。\n\n## 数据集预处理与模型训练中的常见错误\n\n**1. 错误：`UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position xx`**\n\n- 请确保数据集文件名中不包含非西方字符，例如中文或日文，尤其是中文标点符号，如括号、逗号、冒号、分号、引号等。重命名后，请**重新处理数据集**，然后再继续训练。\n\n**2. 错误：`The expand size of the tensor (768) must match the existing size (256) at non-singleton dimension 0.`**\n\n- 请删除 `dataset\u002F44k` 下的所有内容，并重新执行预处理步骤。\n\n**3. 错误：`RuntimeError: DataLoader worker (pid(s) 13920) exited unexpectedly`**\n\n```bash\nraise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e\nRuntimeError: DataLoader worker (pid(s) 13920) exited unexpectedly\n```\n\n- 请减小 `batchsize` 值，增加虚拟内存，并重启电脑以释放 GPU 内存，直到 `batchsize` 和虚拟内存设置合理且不再报错为止。\n\n**4. 错误：`torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 3221225477`**\n\n- 请增加虚拟内存并减小 `batchsize` 值，直到 `batchsize` 和虚拟内存设置合理且不再报错为止。\n\n**5. 错误：`AssertionError: CPU training is not allowed.`**\n\n- **无解决方案：** 不支持在没有 NVIDIA GPU 的情况下进行训练。对于初学者来说，直接的答案就是无法在没有 NVIDIA GPU 的情况下进行训练。\n\n**6. 错误：`FileNotFoundError: No such file or directory: 'pretrain\u002Frmvpe.pt'`**\n\n- 如果您运行 `python preprocess_hubert_f0.py --f0_predictor rmvpe --use_diff` 时遇到 `FileNotFoundError: No such file or directory: 'pretrain\u002Frmvpe.pt'`，这是因为官方文档更新了用于 F0 处理的 rmvpe 预处理器。请参考教程文档 [#2.2.3](#223-optional-items-choose-as-needed)，下载预处理模型 `rmvpe.pt` 并将其放置到相应目录中。\n\n**7. 错误：“页面文件太小，无法完成操作。”**\n\n- **解决方案：** 增加虚拟内存。您可以根据自己的操作系统，在网上查找详细的设置方法。\n\n## 使用 WebUI 时的错误\\*\\*\n\n**1. 在 WebUI 中启动或加载模型时的错误**\n\n- **启动 WebUI 时的错误：** `ImportError: cannot import name 'Schema' from 'pydantic'`\n- **在 WebUI 中加载模型时的错误：** `AttributeError(\"'Dropdown' object has no attribute 'update'\")`\n- **与依赖项相关的错误：** 如果错误涉及 `fastapi`、`gradio` 或 `pydantic`。\n\n**解决方案：**\n\n- 某些依赖项需要特定版本。安装完 `requirements_win.txt` 后，请在命令提示符中输入以下命令来更新相关包：\n\n```bash\npip install --upgrade fastapi==0.84.0\npip install --upgrade gradio==3.41.2\npip install --upgrade pydantic==1.10.12\n```\n\n**2. 错误：`Given groups=1, weight of size [xxx, 256, xxx], expected input[xxx, 768, xxx] to have 256 channels, but got 768 channels instead`**\n\n或者 **错误：配置文件中编码器和模型维度不匹配**\n\n- **原因：** v1 分支的模型使用了 `vec768` 配置文件，反之亦然。\n- **解决方案：** 请检查您的配置文件中的 `ssl_dim` 设置。如果 `ssl_dim` 是 256，则 `speech_encoder` 应为 `vec256|9`；如果是 768，则应为 `vec768|12`。\n- 详细说明请参阅 [#2.1](#21-issues-regarding-compatibility-with-the-40-model)。\n\n**3. 错误：`'HParams' object has no attribute 'xxx'`**\n\n- **原因：** 通常表示找不到音色，且配置文件与模型不匹配。\n- **解决方案：** 打开配置文件，滚动到底部，检查是否包含了您所训练的音色。\n\n# 致谢\n\n我们衷心感谢以下个人和组织，正是他们的贡献和资源使本项目得以实现：\n\n- **so-vits-svc** | [so-vits-svc GitHub 仓库](https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc)\n- **GPT-SoVITS** | [GPT-SoVITS GitHub 仓库](https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS)\n- **SingingVocoders** | [SingingVocoders GitHub 仓库](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002FSingingVocoders)\n- **MoeVoiceStudio** | [MoeVoiceStudio GitHub 仓库](https:\u002F\u002Fgithub.com\u002FNaruseMioShirakana\u002FMoeVoiceStudio)\n- **OpenVPI** | [OpenVPI GitHub 组织](https:\u002F\u002Fgithub.com\u002Fopenvpi) | [Vocoders GitHub 仓库](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002Fvocoders)\n- **UP 主 [infinite_loop]** | [Bilibili 个人主页](https:\u002F\u002Fspace.bilibili.com\u002F286311429) | [相关视频](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Bd4y1W7BN) | [相关专栏](https:\u002F\u002Fwww.bilibili.com\u002Fread\u002Fcv21425662)\n- **UP 主 [羽毛布団]** | [Bilibili 个人主页](https:\u002F\u002Fspace.bilibili.com\u002F3493141443250876) | [错误解决指南](https:\u002F\u002Fwww.bilibili.com\u002Fread\u002Fcv22206231) | [常见错误解决方案](https:\u002F\u002Fwww.yuque.com\u002Fumoubuton\u002Fueupp5\u002Fieinf8qmpzswpsvr)\n- **所有训练音频样本的贡献者**\n- **您** - 感谢您的关注、支持和贡献。","# SoftVC VITS 歌声转换 (so-vits-svc) 快速上手指南\n\n本指南基于 `so-vits-svc-Deployment-Documents` 整理，旨在帮助开发者快速在本地部署并进行歌声转换训练与推理。\n\n## 1. 环境准备\n\n### 系统要求\n- **操作系统**: 推荐 Windows (本教程主要测试环境)，Linux\u002FMac 需具备一定排错能力。\n- **GPU**: 训练**必须**使用 NVIDIA 显卡，显存建议 **≥6GB**。推理可使用 CPU 或 GPU。\n- **内存与虚拟内存**: 物理内存充足，虚拟内存建议设置为 **30GB** 以上（最好设置在 SSD 上）。\n- **数据准备**:\n    - **训练集**: 至少 30 分钟干净人声（无背景噪音、无混响），音色统一，音域宽广。\n    - **推理源**: 干声（背景噪音\u003C30dB，无混响\u002F和声）。\n\n### 前置依赖\n请确保已安装以下基础软件：\n1. **Python**: 推荐版本需与 PyTorch 兼容（通常为 3.8 - 3.10）。\n2. **Git**: 用于克隆项目代码。\n3. **FFmpeg**: 用于音频处理，需添加到系统环境变量。\n4. **CUDA Toolkit**: 根据显卡驱动安装对应版本以支持 GPU 加速。\n\n> **提示**: 若不想手动配置环境，可寻找社区提供的“整合包”（如羽毛布団发布的版本）。\n\n## 2. 安装步骤\n\n### 2.1 获取源代码\n克隆官方仓库或本教程对应的仓库：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fsvc-develop-team\u002Fso-vits-svc.git\ncd so-vits-svc\n```\n\n### 2.2 创建虚拟环境并安装依赖\n建议使用 `conda` 创建独立环境：\n\n```bash\n# 创建名为 sovits 的 python 3.9 环境\nconda create -n sovits python=3.9 -y\nconda activate sovits\n\n# 安装 PyTorch (根据是否使用 CUDA 选择，以下为 CUDA 11.8 示例)\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n\n# 安装项目其他依赖\npip install -r requirements.txt\n```\n> **国内加速**: 若下载速度慢，可添加清华或阿里镜像源：\n> `pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n\n### 2.3 下载预训练模型\n训练前必须下载底模（Base Model）和编码器文件，放入项目根目录或指定文件夹（通常为 `pretrained` 或 `logs`，具体参考配置文件）：\n- **必需项**: 各类 Encoder 模型 (如 Hubert, ContentVec 等)。\n- **强推荐**: 预训练底模 (Pre-trained Base Model)，可显著加快收敛速度。\n\n*(具体模型下载地址请参考原项目 README 或 HuggingFace 仓库)*\n\n## 3. 基本使用流程\n\n### 3.1 数据预处理\n将准备好的干净人声音频放入 `data\u002Fraw` 目录（或其他自定义目录），然后执行以下步骤：\n\n1. **音频切片与重采样** (自动将音频切分为短片段并重采样至 44100Hz 单声道):\n   ```bash\n   python preprocess_slicing.py\n   python preprocess_resample.py\n   ```\n\n2. **生成配置文件与数据集划分**:\n   ```bash\n   python preprocess_config.py\n   ```\n   *可选*: 若需使用响度嵌入，运行相关脚本开启该功能。\n\n3. **提取特征 (Hubert & F0)**:\n   ```bash\n   python preprocess_hubert_f0.py\n   ```\n   *注意*: 此步骤耗时较长，取决于数据量和硬件性能。\n\n### 3.2 模型训练\n确认 `config.json` 中的参数（如批量大小 `batch_size`，需根据显存调整）无误后，开始训练主模型：\n\n```bash\npython train.py\n```\n- 训练过程中可使用 TensorBoard 监控损失曲线：\n  ```bash\n  tensorboard --logdir=logs\n  ```\n- *(可选)* 若需训练扩散模型 (Diffusion)，运行对应的 diffusion 训练脚本。\n\n### 3.3 推理 (Inference)\n训练完成后（通常在 `logs\u002F44k\u002F` 下生成 `.pth` 模型文件），可通过以下两种方式转换歌声：\n\n#### 方式 A: 命令行推理\n```bash\npython infer.py -m \u003C模型路径> -c \u003C配置文件路径> -i \u003C输入音频路径> -o \u003C输出路径> --spk \u003C说话人 ID>\n```\n*示例*:\n```bash\npython infer.py -m logs\u002F44k\u002FG_10000.pth -c configs\u002Fconfig.json -i input.wav -o output.wav --spk 0\n```\n\n#### 方式 B: WebUI 界面推理\n启动图形化界面，更适合调整参数和实时试听：\n```bash\npython webui.py\n```\n启动后在浏览器访问显示的地址（通常为 `http:\u002F\u002F127.0.0.1:7860`），上传模型、配置参数并上传源音频即可转换。\n\n---\n**法律声明**: 本工具仅供学习与研究使用。使用者须严格遵守《中华人民共和国民法典》等相关法律法规，不得侵犯他人肖像权、名誉权及声音权益。严禁用于非法用途，数据集来源需合法合规，侵权后果由使用者自行承担。","一位独立音乐人希望将自己录制的干声歌曲转换为特定歌手的音色，以制作高质量的 AI 翻唱作品，但缺乏深度学习环境配置经验。\n\n### 没有 so-vits-svc-Deployment-Documents 时\n- 面对复杂的 Python 依赖库和 PyTorch 版本冲突，手动配置本地环境屡屡失败，耗费数天仍无法运行代码。\n- 不清楚数据预处理的具体标准（如切片长度、重采样至 44100Hz 单声道），导致训练数据格式错误，模型无法收敛。\n- 缺少预训练模型下载指引和编码器详细说明，不得不盲目搜索资源，甚至下载到不兼容的旧版本文件。\n- 遇到报错时无处查证，官方文档过于简略，只能在外网论坛大海捞针寻找解决方案，学习曲线极其陡峭。\n- 因害怕操作失误损坏系统，不敢尝试本地部署，最终被迫放弃个性化音色定制的想法。\n\n### 使用 so-vits-svc-Deployment-Documents 后\n- 直接利用提供的 Colab 笔记本或按部就班的安装教程，快速解决环境依赖问题，半小时内即可启动项目。\n- 依据清晰的音频处理指南（含 FFmpeg 使用及自动切片说明），轻松完成标准化数据准备，确保训练顺利进行。\n- 通过文档中列出的必选与可选模型文件清单，精准下载预训练底模和编码器，避免了版本不匹配的陷阱。\n- 参考“常见错误解决方案”章节和详细的调试步骤，迅速定位并修复运行中的报错，大幅降低试错成本。\n- 在详尽的流程指引下成功完成本地部署与推理，顺利生成目标音色的翻唱作品，实现了创作自由。\n\nso-vits-svc-Deployment-Documents 将高门槛的算法部署转化为可执行的标准化流程，让非技术背景创作者也能轻松驾驭顶尖的歌声转换技术。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FSUC-DriverOld_so-vits-svc-Deployment-Documents_5d8e28b2.png","SUC-DriverOld","Wentao Wu","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FSUC-DriverOld_36b22c5f.jpg","My name is Wentao Wu, A student of Nanjing University of Posts and Telecommunications, also a member of @NJUPT-SAST @AI-Hobbyist .","Nanjing University of Posts and Telecommunications","China","1584846096@qq.com",null,"https:\u002F\u002Fwww.njupt.edu.cn\u002F","https:\u002F\u002Fgithub.com\u002FSUC-DriverOld",[83],{"name":84,"color":85,"percentage":86},"Jupyter Notebook","#DA5B0B",100,747,106,"2026-03-27T12:46:47","AGPL-3.0",4,"Windows, Linux (需自行解决), macOS (需自行解决)","训练必需 NVIDIA GPU，显存至少 6GB；推理可选 CPU 或 GPU。未明确说明具体 CUDA 版本。","虚拟内存至少 30GB（建议设置在 SSD 上），物理内存未明确说明最低\u002F推荐值。",{"notes":96,"python":97,"dependencies":98},"1. 训练必须使用 GPU，推理可使用 CPU 但速度较慢。2. 务必将系统虚拟内存设置为至少 30GB 且最好位于 SSD，否则速度极慢。3. 需提前下载预训练底模文件。4. 数据集要求：至少 30 分钟干净人声（无背景噪音、无混响），音域越宽越好。5. 文档主要在 Windows 上测试，Mac 和 Linux 用户需具备自行解决问题的能力。6. 建议使用 Google Colab 进行云端训练。","未说明 (文档目录提及 Python 但未列出具体版本号)",[99,100],"PyTorch","FFmpeg",[14],[103,104,105,106],"pytorch","vits","so-vits-svc","sovits","2026-03-27T02:49:30.150509","2026-04-20T21:04:35.384916",[110,115,120,125,130,135],{"id":111,"question_zh":112,"answer_zh":113,"source_url":114},45817,"训练模型时 GPU 占用率低但显存几乎全满，且训练速度极慢（如每个 epoch 耗时 700 秒）怎么办？","这通常是因为使用了共享显存导致训练速度显著降低。解决方法有两种：\n1. 更新显卡驱动到 546.01 及以上版本，并按照 NVIDIA 官方教程禁用共享内存设置（需选中正确的 python.exe 路径，通常在 Anaconda 环境目录下）；或者将驱动回退到 532 及以下版本。\n2. 如果不更改驱动设置，可以尝试降低 batchsize 值（例如 8G 显存建议设置 batchsize=4）以防止占用共享显存。\n修改后建议重启计算机释放显存再重新训练。","https:\u002F\u002Fgithub.com\u002FSUC-DriverOld\u002Fso-vits-svc-Deployment-Documents\u002Fissues\u002F14",{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},45818,"模型训练到指定步数（如 60000 步或 10000 epoch）后自动停止，无法继续训练是什么原因？","这是因为达到了配置文件中设定的最大 epoch 限制（默认为 10000），程序会自动停止。一般达到此数值即可结束训练。如果确实需要继续训练，可以修改 config.json 文件中的 \"epochs\" 字段数值来扩大限制。","https:\u002F\u002Fgithub.com\u002FSUC-DriverOld\u002Fso-vits-svc-Deployment-Documents\u002Fissues\u002F18",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},45819,"So-VITS-SVC 是否提供文字转音频（TTS）的 API 接口供前端调用？","So-VITS-SVC 本身未提供文字转音频的 API 接口。其原理通常是先调用 edge-tts 生成语音，再进行音频转换，用户可根据此原理手动构建。\n如果需要直接使用带有情感且提供 API 的 TTS 项目，建议更换使用 GPT-SoVITS 项目，该项目对训练集要求更低且训练时长更短。","https:\u002F\u002Fgithub.com\u002FSUC-DriverOld\u002Fso-vits-svc-Deployment-Documents\u002Fissues\u002F20",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},45820,"使用高配显卡（如 RTX 4090）训练时 CPU 占用 100% 而显卡似乎未工作，是否在使用 CPU 训练？","不会。如果是强制 CPU 训练，程序会直接报错提示 \"cpu training is not allowed\"。正常训练过程中确实会占用一定的 CPU 性能。\n如果遇到类似情况且伴随报错，通常是因为系统内存不足，需要调大虚拟内存后重试。若需进一步排查，请提供具体的运行日志。","https:\u002F\u002Fgithub.com\u002FSUC-DriverOld\u002Fso-vits-svc-Deployment-Documents\u002Fissues\u002F13",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},45821,"老款显卡（如 GTX 660M）仅支持 CUDA 10.1 和 Torch 1.8.1，能否运行要求更高版本 PyTorch 的程序？","如果项目 requirements.txt 明确指定了较高的 PyTorch 版本（如 1.13.1），则通常无法直接在低版本环境下运行，因为新特性或算子可能不兼容。\n虽然推理（Inference）对显卡要求较低，但核心依赖库的版本不匹配会导致无法加载模型或运行报错。建议尝试在支持更高 CUDA 版本的机器上进行推理，或寻找专门适配旧版 PyTorch 的分支版本。","https:\u002F\u002Fgithub.com\u002FSUC-DriverOld\u002Fso-vits-svc-Deployment-Documents\u002Fissues\u002F5",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},45822,"使用 So-VITS-SVC 转换声音后用于训练 DiffSinger 声库，成品中是否需要声明音频来源？","这取决于底模的许可证协议。维护者发布的底模通常遵循 CC-BY-NC-SA 4.0 协议（署名 - 非商业性使用 - 相同方式共享）。如果是基于此类协议训练的模型或生成的音频用于衍生作品，通常需要遵守相应的署名和分享要求。建议查阅具体使用的底模仓库（如 HuggingFace 上的数据集）以确认其具体的 License 协议。","https:\u002F\u002Fgithub.com\u002FSUC-DriverOld\u002Fso-vits-svc-Deployment-Documents\u002Fissues\u002F19",[141,146,151,156],{"id":142,"version":143,"summary_zh":144,"released_at":145},360797,"v24.6.21","## 特性\n\n1. 增加不同编码器对应的预训练底模的下载链接，简化[#2.2.1](https:\u002F\u002Fgithub.com\u002FSUC-DriverOld\u002Fso-vits-svc-Deployment-Documents\u002Fblob\u002F4.1\u002FREADME_zh_CN.md#221-%E5%BF%85%E9%A1%BB%E9%A1%B9)部分的内容\n2. 增加`dataset_raw`文件夹内`config.json`文件编写说明\n3. 更新多线程处理参数 #22 \n\n| 编码器类型                | 主模型底模                                                                                                                                                                                                                       | 扩散模型底模                                                                                                         | 说明                                                                                                                                                                                                                                                                                                                                                                   |\n| ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| vec768l12                 | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec768l12\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec768l12\u002FD_0.pth)                           | [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002F768l12\u002Fmodel_0.pt)      | 若仅训练 100 步扩散，即`k_step_max = 100`，扩散模型请使用[model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002F768l12\u002Fmax100\u002Fmodel_0.pt)                                                                                                                                                                                        |\n| vec768l12（开启响度嵌入） | [G_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec768l12\u002Fvol_emb\u002FG_0.pth), [D_0.pth](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fvec768l12\u002Fvol_emb\u002FD_0.pth)           | [model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002F768l12\u002Fmodel_0.pt)      | 若仅训练 100 步扩散，即`k_step_max = 100`，扩散模型请使用[model_0.pt](https:\u002F\u002Fhuggingface.co\u002FSucial\u002Fso-vits-svc4.1-pretrain_model\u002Fblob\u002Fmain\u002Fdiffusion\u002F768l12\u002Fmax100\u002Fmodel_0.pt)                                                   ","2024-06-21T06:44:37",{"id":147,"version":148,"summary_zh":149,"released_at":150},360798,"v24.5.18","1. 完成英文文档的撰写  \n2. 增加常见报错及解决方法","2024-05-17T16:43:55",{"id":152,"version":153,"summary_zh":154,"released_at":155},360799,"v24.4.28","1. 修改了一些预训练底模下载链接  \n2. 修改了一些处理命令  \n3. 完善了教程内容","2024-04-28T14:40:13",{"id":157,"version":158,"summary_zh":159,"released_at":160},360800,"v24.3.9","重构文档，加入细节和更多新功能说明","2024-03-09T06:03:41"]