[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-DigitalPhonetics--IMS-Toucan":3,"tool-DigitalPhonetics--IMS-Toucan":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",142651,2,"2026-04-06T23:34:12",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":76,"owner_website":78,"owner_url":79,"languages":80,"stars":85,"forks":86,"last_commit_at":87,"license":88,"difficulty_score":10,"env_os":89,"env_gpu":90,"env_ram":91,"env_deps":92,"category_tags":104,"github_topics":106,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":115,"updated_at":116,"faqs":117,"releases":146},4909,"DigitalPhonetics\u002FIMS-Toucan","IMS-Toucan","Controllable and fast Text-to-Speech for over 7000 languages!","IMS-Toucan 是一款由德国斯图加特大学自然语言处理研究所开发的开源文本转语音（TTS）工具包，旨在让高质量语音合成变得快速、可控且易于获取。它最引人注目的能力是支持全球超过 7000 种语言的语音生成，极大地降低了多语言语音合成的门槛，解决了传统模型往往只聚焦于少数主流语言、训练成本高昂且难以定制的问题。\n\n无论是希望研究前沿语音技术的学者、需要为应用集成多语言配音功能的开发者，还是对小众语言数字化感兴趣的语言保护工作者，IMS-Toucan 都能提供强大的支持。其技术亮点在于采用了高效的架构设计，无需庞大的计算资源即可进行模型训练与推理，同时具备高度的可控性，允许用户精细调整语音特征。此外，项目不仅提供了完整的训练和使用流程，还开源了大规模多语言数据集和预训练模型，并托管了免费的在线演示实例，让用户无需本地配置即可立即体验。凭借友好的安装指引和活跃的社区支持，IMS-Toucan 正成为推动全球语音技术普惠化的重要力量。","\u003Cp align=\"right\">\n\u003Cimg alt=\"GitHub Repo stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDigitalPhonetics\u002FIMS-Toucan\">\n\u003Cimg alt=\"GitHub Repo Downloads\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fdownloads\u002FDigitalPhonetics\u002FIMS-Toucan\u002Ftotal\">\n\u003Cimg alt=\"GitHub Release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002FDigitalPhonetics\u002FIMS-Toucan\">\n\u003Ca href=https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FFlux9665\u002FMassivelyMultilingualTTS>\u003Cimg alt=\"Demo Link\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDEMO-\u003CCOLOR>.svg\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n# Text-to-Speech for over 7000 Languages\n\nIMS Toucan is a toolkit for training, using, and teaching state-of-the-art Text-to-Speech Synthesis, developed at the\n**Institute for Natural Language Processing (IMS), University of Stuttgart, Germany**, official home of the massively\nmultilingual ToucanTTS system. Our system is fast, controllable, and doesn't require a ton of compute.\n\n\u003Cbr>\n\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDigitalPhonetics_IMS-Toucan_readme_26313b6bad1a.png)\n\n\u003Cbr>\n\nIf you find this repo useful, consider giving it a star. ⭐ Large numbers make me happy, and they are very motivating. If\nyou want to motivate me even more, you can even\nconsider [sponsoring this toolkit](https:\u002F\u002Fgithub.com\u002Fsponsors\u002FFlux9665). We only use GitHub Sponsors for this, there\nare scammers on other platforms that pretend to be the creator. Don't let them fool you. The code and the models are\nabsolutely free, and thanks to the generous support of Hugging Face🤗, we even have\nan [instance of the model running on GPU](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FFlux9665\u002FMassivelyMultilingualTTS) free for\nanyone to use.\n\n--- \n\u003Cbr>\n\n## Links 🦚\n\n### Interactive Demo\n\n[Check out our interactive massively-multi-lingual demo on Hugging Face🤗](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FFlux9665\u002FMassivelyMultilingualTTS)\n\n### Dataset\n\n[We have also published a massively multilingual TTS dataset on Hugging Face🤗](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FFlux9665\u002FBibleMMS)\n\n### Languages\n\n[A list of supported languages can be found here](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Fblob\u002FMassiveScaleToucan\u002FUtility\u002Flanguage_list.md)\n\n--- \n\u003Cbr>\n\n## Installation 🦉\n\n#### Basic Requirements\n\nPython 3.10 is the recommended version.\n\nTo install this toolkit, clone it onto the machine you want to use it on\n(should have at least one cuda enabled GPU if you intend to train models on that machine. For inference, you don't need\na GPU).\n\nIf you're using Linux, you should have the following packages installed, or install them with apt-get if you haven't (on\nmost distributions they come pre-installed):\n\n```\nlibsndfile1\nespeak-ng\nffmpeg\nlibasound-dev\nlibportaudio2\nlibsqlite3-dev\n```\n\nNavigate to the directory you have cloned. We recommend creating and activating a\n[virtual environment](https:\u002F\u002Fdocs.python.org\u002F3\u002Flibrary\u002Fvenv.html)\nto install the basic requirements into. The commands below summarize everything you need to do under Linux. If you are\nrunning Windows, the second line needs to be changed, please have a look at\nthe [venv documentation](https:\u002F\u002Fdocs.python.org\u002F3\u002Flibrary\u002Fvenv.html).\n\n```\npython -m venv \u003Cpath_to_where_you_want_your_env_to_be>\n\nsource \u003Cpath_to_where_you_want_your_env_to_be>\u002Fbin\u002Factivate\n\npip install --no-cache-dir -r requirements.txt\n```\n\nRun the second line everytime you start using the tool again to activate the virtual environment again, if you e.g.\nlogged out in the meantime. To make use of a GPU, you don't need to do anything else on a Linux machine. On a Windows\nmachine, have a look at [the official PyTorch website](https:\u002F\u002Fpytorch.org\u002F) for the install-command that enables GPU\nsupport.\n\n#### Storage configuration\n\nIf you don't want the pretrained and trained models as well as the cache files resulting from preprocessing your\ndatasets to be stored in the default subfolders, you can set corresponding directories globally by\nediting `Utility\u002Fstorage_config.py` to suit your needs (the path can be relative to the repository root directory or\nabsolute).\n\n#### Pretrained Models\n\nYou don't need to use pretrained models, but it can speed things up tremendously. They will be downloaded on the fly\nautomatically when they are needed, thanks to Hugging Face🤗 and [VB](https:\u002F\u002Fgithub.com\u002FVaibhavs10) in particular.\n\n#### \\[optional] eSpeak-NG\n\neSpeak-NG is an optional requirement, that handles lots of special cases in many languages, so it's good to have.\n\nOn most **Linux** environments it will be installed already, and if it is not, and you have the sufficient rights, you\ncan install it by simply running\n\n```\napt-get install espeak-ng\n```\n\nFor **Windows**, they provide a convenient .msi installer file\n[on their GitHub release page](https:\u002F\u002Fgithub.com\u002Fespeak-ng\u002Fespeak-ng\u002Freleases). After installation on non-linux\nsystems, you'll also need to tell the phonemizer library where to find your espeak installation by setting the\n`PHONEMIZER_ESPEAK_LIBRARY` environment variable, which is discussed in\n[this issue](https:\u002F\u002Fgithub.com\u002Fbootphon\u002Fphonemizer\u002Fissues\u002F44#issuecomment-1008449718).\n\nFor **Mac** it's unfortunately a lot more complicated. Thanks to Sang Hyun Park, here is a guide for installing it on\nMac:\nFor M1 Macs, the most convenient method to install espeak-ng onto your system is via a\n[MacPorts port of espeak-ng](https:\u002F\u002Fports.macports.org\u002Fport\u002Fespeak-ng\u002F). MacPorts itself can be installed from the\n[MacPorts website](https:\u002F\u002Fwww.macports.org\u002Finstall.php), which also requires Apple's\n[XCode](https:\u002F\u002Fdeveloper.apple.com\u002Fxcode\u002F). Once XCode and MacPorts have been installed, you can install the port of\nespeak-ng via\n\n```\nsudo port install espeak-ng\n```\n\nAs stated in the Windows install instructions, the espeak-ng installation will need to be set as a variable for the\nphonemizer library. The environment variable is `PHONEMIZER_ESPEAK_LIBRARY` as given in the\n[GitHub thread](https:\u002F\u002Fgithub.com\u002Fbootphon\u002Fphonemizer\u002Fissues\u002F44#issuecomment-1008449718) linked above.\nHowever, the espeak-ng installation file you need to set this variable to is a .dylib file rather than a .dll file on\nMac. In order to locate the espeak-ng library file, you can run `port contents espeak-ng`. The specific file you are\nlooking for is named `libespeak-ng.dylib`.\n\n--- \n\u003Cbr>\n\n## Inference 🦢\n\nYou can load your trained models, or the pretrained provided one, using the `InferenceInterfaces\u002FToucanTTSInterface.py`.\nSimply create an object from it with the proper directory handle\nidentifying the model you want to use. The rest should work out in the background. You might want to set a language\nembedding or a speaker embedding using the *set_language* and *set_speaker_embedding* functions. Most things should be\nself-explanatory.\n\nAn *InferenceInterface* contains two methods to create audio from text. They are\n*read_to_file* and\n*read_aloud*.\n\n- *read_to_file* takes as input a list of strings and a filename. It will synthesize the sentences in the list and\n  concatenate them with a short pause inbetween and write them to the filepath you supply as the other argument.\n\n- *read_aloud* takes just a string, which it will then convert to speech and immediately play using the system's\n  speakers. If you set the optional argument\n  *view* to\n  *True*, a visualization will pop up, that you need to close for the program to continue.\n\nTheir use is demonstrated in\n*run_interactive_demo.py* and\n*run_text_to_file_reader.py*.\n\nThere are simple scaling parameters to control the duration, the variance of the pitch curve and the variance of the\nenergy curve. You can either change them in the code when using the interactive demo or the reader, or you can simply\npass them to the interface when you use it in your own code.\n\nTo change the language of the model and see which languages are available in our pretrained model,\n[have a look at the list linked here](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Fblob\u002Ffeb573ca630823974e6ced22591ab41cdfb93674\u002FUtility\u002Flanguage_list.md)\n\n--- \n\u003Cbr>\n\n## Creating a new Recipe (Training Pipeline) 🐣\n\nIn the directory called\n*Utility* there is a file called\n`path_to_transcript_dicts.py`. In this file you should write a function that returns a dictionary that has all the\nabsolute paths to each of the audio files in your dataset as strings as the keys and the textual transcriptions of the\ncorresponding audios as the values.\n\nThen go to the directory\n*TrainingInterfaces\u002FRecipes*. In there, make a copy of the `finetuning_example_simple.py` file if you just want to\nfinetune on a single dataset or `finetuning_example_multilingual.py` if you want to finetune on multiple datasets,\npotentially even multiple languages. We will use this copy\nas reference and only make the necessary changes to use the new dataset. Find the call(s) to the *prepare_tts_corpus*\nfunction. Replace the path_to_transcript_dict used there with the one(s) you just created. Then change the name of the\ncorresponding cache directory to something that makes sense for the dataset.\nAlso look out for the variable *save_dir*, which is where the checkpoints will be saved to. This is a default value, you\ncan overwrite it when calling\nthe pipeline later using a command line argument, in case you want to fine-tune from a checkpoint and thus save into a\ndifferent directory. Finally, change the\n*lang* argument in the creation of the dataset and in the call to the train loop function to the ISO 639-3 language ID\nthat\nmatches your data.\n\nThe arguments that are given to the train loop in the finetuning examples are meant for the case of finetuning from a\npretrained model. If you want\nto train from scratch, have a look at a different pipeline that has ToucanTTS in its name and look at the arguments\nused there.\n\nOnce this is complete, we are almost done, now we just need to make it available to the\n`run_training_pipeline.py` file in the top level. In said file, import the\n*run* function from the pipeline you just created and give it a meaningful name. Now in the\n*pipeline_dict*, add your imported function as value and use as key a shorthand that makes sense.\n\n--- \n\u003Cbr>\n\n## Training a Model 🦜\n\nOnce you have a recipe built, training is super easy:\n\n```\npython run_training_pipeline.py \u003Cshorthand of the pipeline>\n```\n\nYou can supply any of the following arguments, but don't have to (although for training you should definitely specify at\nleast a GPU ID).\n\n```\n--gpu_id \u003CID of the GPU you wish to use, as displayed with nvidia-smi, default is cpu. If multiple GPUs are provided (comma separated), then distributed training will be used, but the script has to be started with torchrun.> \n\n--resume_checkpoint \u003Cpath to a checkpoint to load>\n\n--resume (if this is present, the furthest checkpoint available will be loaded automatically)\n\n--finetune (if this is present, the provided checkpoint will be fine-tuned on the data from this pipeline)\n\n--model_save_dir \u003Cpath to a directory where the checkpoints should be saved>\n\n--wandb (if this is present, the logs will be synchronized to your weights&biases account, if you are logged in on the command line)\n\n--wandb_resume_id \u003Cthe id of the run you want to resume, if you are using weights&biases (you can find the id in the URL of the run)>\n```\n\nFor multi-GPU training, you have to supply multiple GPU ids (comma separated) and start the script with torchrun. You\nalso have to specify the number of GPUs. This has to match the number of IDs that you supply. Careful: torchrun is\nincompatible with nohup! Use tmux instead to keep the script running after you log out of the shell.\n\n```\ntorchrun --standalone --nproc_per_node=4 --nnodes=1 run_training_pipeline.py \u003Cshorthand of the pipeline> --gpu_id \"0,1,2,3\"\n```\n\nAfter every epoch (or alternatively after certain step counts), some logs will be written to the console and to the\nWeights and Biases website, if you are logged in and set the flag. If you get cuda out of memory errors, you need to\ndecrease\nthe batchsize in the arguments of the call to the training_loop in the pipeline you are running. Try decreasing the\nbatchsize in small steps until you get no more out of cuda memory errors.\n\nIn the directory you specified for saving, checkpoint files and spectrogram visualization\ndata will appear. Since the checkpoints are quite big, only the five most recent ones will be kept. The amount of\ntraining steps highly depends on the data you are using and whether you're finetuning from a pretrained checkpoint or\ntraining from scratch. The fewer data you have, the fewer steps you should take to prevent a possible collapse. If\nyou want to stop earlier, just kill the process, since everything is daemonic all the child-processes should die with\nit. In case there are some ghost-processes left behind, you can use the following command to find them and kill them\nmanually.\n\n```\nfuser -v \u002Fdev\u002Fnvidia*\n```\n\nWhenever a checkpoint is saved, a compressed version that can be used for inference is also created, which is named\n_best.py_\n\n--- \n\u003Cbr>\n\n## FAQ 🐓\n\nHere are a few points that were brought up by users:\n\n- How can I figure out if my data has outliers or similar problems? -- There is a scorer that can find and even remove\n  samples from your dataset cache that have extraordinarily high loss values, have a look at `run_scorer.py`.\n- My error message shows GPU0, even though I specified a different GPU -- The way GPU selection works is that the\n  specified GPU is set as the only visible device, in order to avoid backend stuff running accidentally on different\n  GPUs. So internally the program will name the device GPU0, because it is the only GPU it can see. It is actually\n  running on the GPU you specified.\n- read_to_file produces strange outputs -- Check if you're passing a list to the method or a string. Since strings can\n  be\n  iterated over, it might not throw an error, but a list of strings is expected.\n- `UserWarning: Detected call of lr_scheduler.step() before optimizer.step().` -- We use a custom scheduler, and torch\n  incorrectly thinks that we call the scheduler and the optimizer in the wrong order. Just ignore this warning, it is\n  completely meaningless.\n- `WARNING[XFORMERS]: xFormers can't load C++\u002FCUDA extensions. [...]` -- Another meaningless warning. We actually don't\n  use xFormers ourselves, it is just part of the dependencies of one of our dependencies, but it is not used at any\n  place.\n- `The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows. [...]` -- Just\n  happens under Windows and doesn't affect anything.\n- `WARNING:phonemizer:words count mismatch on 200.0% of the lines (2\u002F1) [...]` -- We have no idea why espeak started\n  giving out this warning, however it doesn't seem to affect anything, so it seems safe to ignore.\n- Loss turns to `NaN` -- The default learning rates work on clean data. If your data is less clean, try using the scorer\n  to find problematic samples, or reduce the learning rate. The most common problem is there being pauses in the speech,\n  but nothing that hints at them in the text. That's why ASR corpora, which leave out punctuation, are usually difficult\n  to use for TTS.\n\n--- \n\u003Cbr>\n\n## Acknowledgements 🦆\n\nThe basic PyTorch modules of FastSpeech 2 and GST are taken from\n[ESPnet](https:\u002F\u002Fgithub.com\u002Fespnet\u002Fespnet), the PyTorch modules of\nHiFi-GAN are taken from the [ParallelWaveGAN repository](https:\u002F\u002Fgithub.com\u002Fkan-bayashi\u002FParallelWaveGAN).\nSome modules related to the ConditionalFlowMatching based PostNet as outlined in MatchaTTS are taken\nfrom the [official MatchaTTS codebase](https:\u002F\u002Fgithub.com\u002Fshivammehta25\u002FMatcha-TTS) and some are taken\nfrom [the StableTTS codebase](https:\u002F\u002Fgithub.com\u002FKdaiP\u002FStableTTS).\nFor grapheme-to-phoneme conversion, we rely on the aforementioned eSpeak-NG as\nwell as [transphone](https:\u002F\u002Fgithub.com\u002Fxinjli\u002Ftransphone). We\nuse [encodec, a neural audio codec](https:\u002F\u002Fgithub.com\u002Fyangdongchao\u002FAcademiCodec) as intermediate representation\nfor caching the train data to save space.\n\n## Citation 🐧\n\n\u003Ca href=\"https:\u002F\u002Fstar-history.com\u002F#DigitalPhonetics\u002FIMS-Toucan&Date\">\n \u003Cpicture>\n   \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDigitalPhonetics_IMS-Toucan_readme_70f8732d0159.png&theme=dark\" \u002F>\n   \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDigitalPhonetics_IMS-Toucan_readme_70f8732d0159.png\" \u002F>\n   \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDigitalPhonetics_IMS-Toucan_readme_70f8732d0159.png\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>\n\n### Introduction of the Toolkit [[associated code and models]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv1.0)\n\n```\n@inproceedings{lux2021toucan,\n  year         = 2021,\n  title        = {{The IMS Toucan system for the Blizzard Challenge 2021}},\n  author       = {Florian Lux and Julia Koch and Antje Schweitzer and Ngoc Thang Vu},\n  booktitle    = {Blizzard Challenge Workshop},\n  publisher    = {ISCA Speech Synthesis SIG}\n}\n```\n\n### Adding Articulatory Features and Meta-Learning Pretraining [[associated code and models]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv1.1)\n\n```\n@inproceedings{lux2022laml,\n  year         = 2022,\n  title        = {{Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features}},\n  author       = {Florian Lux and Ngoc Thang Vu},\n  booktitle    = {ACL}\n}\n```\n\n### Adding Exact Prosody-Cloning Capabilities [[associated code and models]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv2.2)\n\n```\n@inproceedings{lux2022cloning,\n  year         = 2022,\n  title        = {{Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech}},\n  author       = {Lux, Florian and Koch, Julia and Vu, Ngoc Thang},\n  booktitle    = {SLT},\n  publisher    = {IEEE}\n}\n```\n\n### Adding Language Embeddings and Word Boundaries [[associated code and models]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv2.2)\n\n```\n@inproceedings{lux2022lrms,\n  year         = 2022,\n  title        = {{Low-Resource Multilingual and Zero-Shot Multispeaker TTS}},\n  author       = {Florian Lux and Julia Koch and Ngoc Thang Vu},\n  booktitle    = {AACL}\n}\n```\n\n### Adding Controllable Speaker Embedding Generation [[associated code and models]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv2.3)\n\n```\n@inproceedings{lux2023controllable,\n  year         = 2023,\n  title        = {{Low-Resource Multilingual and Zero-Shot Multispeaker TTS}},\n  author       = {Florian Lux and Pascal Tilli and Sarina Meyer and Ngoc Thang Vu},\n  booktitle    = {Interspeech}\n  publisher    = {ISCA}\n}\n```\n\n### Our Contribution to the Blizzard Challenge 2023 [[associated code and models]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv2.b)\n\n```\n@inproceedings{lux2023blizzard,\n  year         = 2023,\n  title        = {{The IMS Toucan System for the Blizzard Challenge 2023}},\n  author       = {Florian Lux and Julia Koch and Sarina Meyer and Thomas Bott and Nadja Schauffler and Pavel Denisov and Antje Schweitzer and Ngoc Thang Vu},\n  booktitle    = {Blizzard Challenge Workshop},\n  publisher    = {ISCA Speech Synthesis SIG}\n}\n```\n\n### Introducing the first TTS System in over 7000 languages [[associated code and models]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv3.0)\n\n```\n@inproceedings{lux2024massive,\n  year         = 2024,\n  title        = {{Meta Learning Text-to-Speech Synthesis in over 7000 Languages}},\n  author       = {Florian Lux and Sarina Meyer and Lyonel Behringer and Frank Zalkow and Phat Do and Matt Coler and  Emanuël A. P. Habets and Ngoc Thang Vu},\n  booktitle    = {Interspeech}\n  publisher    = {ISCA}\n}\n```\n\n### Introducing Text based in-context-Prompting to NAR TTS [[associated code and models]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002F2.p)\n\n```\n@inproceedings{bott2024prompting,\n  year         = 2024,\n  title        = {{Controlling Emotion in Text-to-Speech with Natural Language Prompts}},\n  author       = {Thomas Bott and Florian Lux and Ngoc Thang Vu},\n  booktitle    = {Interspeech}\n  publisher    = {ISCA}\n}\n```\n\n### Investigating Stochastic Prosody Modeling [[associated code and models]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Ftree\u002FStochasticProsodyModeling)\n\n```\n@inproceedings{mayer2025stochastic,\n  year         = 2025,\n  title        = {{Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis}},\n  author       = {Paul Mayer and Florian Lux and Alejandro P\\'erez-Gonz\\'alez-de-Martos and Angelina Elizarova and Lindsey Vanderlyn and Dirk V\\\"ath and Ngoc Thang Vu},\n  booktitle    = {Interspeech}\n  publisher    = {ISCA}\n}\n```\n","\u003Cp align=\"right\">\n\u003Cimg alt=\"GitHub 仓库星标数\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDigitalPhonetics\u002FIMS-Toucan\">\n\u003Cimg alt=\"GitHub 仓库下载量\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fdownloads\u002FDigitalPhonetics\u002FIMS-Toucan\u002Ftotal\">\n\u003Cimg alt=\"GitHub 发布版本\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002FDigitalPhonetics\u002FIMS-Toucan\">\n\u003Ca href=https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FFlux9665\u002FMassivelyMultilingualTTS>\u003Cimg alt=\"演示链接\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDEMO-\u003CCOLOR>.svg\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n# 面向7000多种语言的文本转语音\n\nIMS Toucan 是一套用于训练、使用和教学最先进文本转语音合成技术的工具包，由德国斯图加特大学自然语言处理研究所（IMS）开发，也是大规模多语言 ToucanTTS 系统的官方项目所在地。我们的系统速度快、可控性强，且无需大量计算资源。\n\n\u003Cbr>\n\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDigitalPhonetics_IMS-Toucan_readme_26313b6bad1a.png)\n\n\u003Cbr>\n\n如果您觉得这个仓库很有用，请考虑给它点个赞。⭐ 大数字让我很开心，也能给我很大的动力。如果您想进一步支持我，还可以考虑[赞助这个工具包](https:\u002F\u002Fgithub.com\u002Fsponsors\u002FFlux9665)。我们只通过 GitHub Sponsors 接受赞助，其他平台上有很多冒充开发者的人在行骗，请不要上当。代码和模型都是完全免费的，并且得益于 Hugging Face🤗 的慷慨支持，我们甚至提供了一个[基于 GPU 运行的模型实例](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FFlux9665\u002FMassivelyMultilingualTTS)，任何人都可以免费使用。\n\n--- \n\u003Cbr>\n\n## 链接 🦚\n\n### 交互式演示\n\n[请查看我们在 Hugging Face🤗 上提供的大规模多语言交互式演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FFlux9665\u002FMassivelyMultilingualTTS)\n\n### 数据集\n\n[我们还在 Hugging Face🤗 上发布了一个大规模多语言 TTS 数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FFlux9665\u002FBibleMMS)\n\n### 支持的语言\n\n[支持的语言列表请见这里](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Fblob\u002FMassiveScaleToucan\u002FUtility\u002Flanguage_list.md)\n\n--- \n\u003Cbr>\n\n## 安装 🦉\n\n#### 基本要求\n\n推荐使用 Python 3.10 版本。\n\n要安装此工具包，请将其克隆到您打算使用的机器上（如果计划在该机器上训练模型，至少需要一块支持 CUDA 的 GPU；如果是进行推理，则不需要 GPU）。\n\n如果您使用的是 Linux 系统，应确保已安装以下软件包，或者通过 apt-get 进行安装（大多数发行版默认已预装）：\n\n```\nlibsndfile1\nespeak-ng\nffmpeg\nlibasound-dev\nlibportaudio2\nlibsqlite3-dev\n```\n\n进入您克隆的目录。建议创建并激活一个[虚拟环境](https:\u002F\u002Fdocs.python.org\u002F3\u002Flibrary\u002Fvenv.html)，以便将基本依赖项安装到其中。以下命令总结了在 Linux 系统下需要执行的所有步骤。如果您使用的是 Windows 系统，则第二行需要修改，请参阅[venv 文档](https:\u002F\u002Fdocs.python.org\u002F3\u002Flibrary\u002Fvenv.html)。\n\n```\npython -m venv \u003C虚拟环境路径>\n\nsource \u003C虚拟环境路径>\u002Fbin\u002Factivate\n\npip install --no-cache-dir -r requirements.txt\n```\n\n每次重新使用该工具时，都需要运行第二条命令以再次激活虚拟环境，例如在您退出后重新登录时。要在 Linux 机器上利用 GPU，无需额外操作。而在 Windows 机器上，请参考[PyTorch 官方网站](https:\u002F\u002Fpytorch.org\u002F)获取启用 GPU 支持的安装命令。\n\n#### 存储配置\n\n如果您不希望预训练模型、已训练模型以及数据预处理过程中生成的缓存文件存储在默认子文件夹中，可以通过编辑 `Utility\u002Fstorage_config.py` 文件来全局设置相应的目录，以满足您的需求（路径可以是相对于仓库根目录的相对路径，也可以是绝对路径）。\n\n#### 预训练模型\n\n您可以不使用预训练模型，但它们能极大地加快流程。借助 Hugging Face🤗 和尤其是 [VB](https:\u002F\u002Fgithub.com\u002FVaibhavs10)，这些模型会在需要时自动在线下载。\n\n#### [可选] eSpeak-NG\n\neSpeak-NG 是一项可选依赖项，能够处理许多语言中的特殊情况，因此最好安装它。\n\n在大多数 **Linux** 环境中，它通常已经预装；如果没有，且您拥有足够的权限，只需运行以下命令即可安装：\n\n```\napt-get install espeak-ng\n```\n\n对于 **Windows**，他们在其 [GitHub 发布页面](https:\u002F\u002Fgithub.com\u002Fespeak-ng\u002Fespeak-ng\u002Freleases)上提供了便捷的 .msi 安装程序。在非 Linux 系统上安装完成后，还需要通过设置 `PHONEMIZER_ESPEAK_LIBRARY` 环境变量来告知 phonemizer 库 espeak 的安装位置，相关内容可在[此问题讨论](https:\u002F\u002Fgithub.com\u002Fbootphon\u002Fphonemizer\u002Fissues\u002F44#issuecomment-1008449718)中找到。\n\n对于 **Mac** 系统，情况则复杂得多。感谢 Sang Hyun Park 提供的 Mac 安装指南：对于 M1 Mac，最方便的安装方法是通过 [MacPorts 中的 espeak-ng 软件包](https:\u002F\u002Fports.macports.org\u002Fport\u002Fespeak-ng\u002F)。MacPorts 可以从 [MacPorts 官网](https:\u002F\u002Fwww.macports.org\u002Finstall.php)安装，而安装 MacPorts 本身又需要 Apple 的 [XCode](https:\u002F\u002Fdeveloper.apple.com\u002Fxcode\u002F)。在 XCode 和 MacPorts 安装完成后，您可以通过以下命令安装 espeak-ng 软件包：\n\n```\nsudo port install espeak-ng\n```\n\n正如 Windows 安装说明中所述，espeak-ng 的安装路径需要作为变量设置到 phonemizer 库中。该环境变量为 `PHONEMIZER_ESPEAK_LIBRARY`，具体信息可在上述[GitHub 讨论帖](https:\u002F\u002Fgithub.com\u002Fbootphon\u002Fphonemizer\u002Fissues\u002F44#issuecomment-1008449718)中找到。不过，在 Mac 系统上，您需要设置的 espeak-ng 库文件是一个 .dylib 文件，而不是 .dll 文件。要找到 espeak-ng 库文件，可以运行 `port contents espeak-ng` 命令，您需要查找的特定文件名为 `libespeak-ng.dylib`。\n\n--- \n\u003Cbr>\n\n## 推理 🦢\n\n你可以使用 `InferenceInterfaces\u002FToucanTTSInterface.py` 加载你训练好的模型，或者我们提供的预训练模型。只需通过正确的模型目录句柄创建一个对象，即可指定你要使用的模型。其余部分会在后台自动完成。你可能需要使用 `set_language` 和 `set_speaker_embedding` 函数来设置语言嵌入或说话人嵌入。大多数功能都相当直观易懂。\n\n`InferenceInterface` 包含两个从文本生成音频的方法：`read_to_file` 和 `read_aloud`。\n\n- `read_to_file` 接受一个字符串列表和一个文件名作为输入。它会将列表中的句子合成语音，在每句之间插入短暂的停顿，然后将结果写入你提供的文件路径。\n  \n- `read_aloud` 只接受一个字符串，将其转换为语音并立即通过系统的扬声器播放。如果你将可选参数 `view` 设置为 `True`, 会弹出一个可视化窗口，你需要关闭该窗口程序才会继续运行。\n\n这两个方法的用法在 `run_interactive_demo.py` 和 `run_text_to_file_reader.py` 中有演示。\n\n我们提供了一些简单的缩放参数，用于控制持续时间、音高曲线的方差以及能量曲线的方差。你可以在使用交互式演示或阅读器时直接修改代码中的这些参数，也可以在自己的代码中调用接口时直接传递这些参数。\n\n要更改模型的语言并查看我们预训练模型支持的语言，请参阅[此处链接的语言列表](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Fblob\u002Ffeb573ca630823974e6ced22591ab41cdfb93674\u002FUtility\u002Flanguage_list.md)。\n\n---\n\n## 创建新的配方（训练流程）🐣\n\n在名为 `Utility` 的目录中，有一个文件叫做 `path_to_transcript_dicts.py`。在这个文件中，你需要编写一个函数，返回一个字典，其中键是数据集中每个音频文件的绝对路径（以字符串形式表示），值则是对应音频的文本转录。\n\n接下来，进入 `TrainingInterfaces\u002FRecipes` 目录。如果你只想在一个数据集上进行微调，可以复制 `finetuning_example_simple.py` 文件；如果你想在多个数据集上进行微调，甚至可能是多语言的数据集，则复制 `finetuning_example_multilingual.py`。我们将以此副本作为参考，只做必要的修改以适应新的数据集。找到对 `prepare_tts_corpus` 函数的调用，将其中使用的 `path_to_transcript_dict` 替换为你刚刚创建的那个（或那些）。然后将对应的缓存目录名称改为更符合该数据集的名字。\n\n此外，注意变量 `save_dir`，这是保存检查点的目录。这是一个默认值，你可以在稍后通过命令行参数覆盖它，以便从某个检查点继续微调，并将结果保存到不同的目录中。最后，将数据集创建和训练循环函数调用中的 `lang` 参数改为与你的数据相匹配的 ISO 639-3 语言代码。\n\n微调示例中传给训练循环的参数适用于从预训练模型开始微调的情况。如果你想从头开始训练，可以查看其他包含 ToucanTTS 名称的流程，并参考其中使用的参数。\n\n完成这些步骤后，我们就差不多准备好了。现在只需要让顶层的 `run_training_pipeline.py` 文件能够访问这个新配方。在该文件中，从你刚刚创建的流程中导入 `run` 函数，并为其取一个有意义的名字。然后在 `pipeline_dict` 中，将你导入的函数作为值添加进去，键则使用一个简明易懂的缩写。\n\n---\n\n## 训练模型 🦜\n\n一旦你构建好了一个配方，训练就非常简单了：\n\n```\npython run_training_pipeline.py \u003C流水线的缩写>\n```\n\n你可以提供以下任意参数，但并非必须（不过对于训练来说，至少应该指定一个 GPU ID）。\n\n```\n--gpu_id \u003C你希望使用的 GPU 编号，可通过 nvidia-smi 查看，默认为 CPU。如果提供了多个 GPU 编号（用逗号分隔），则会启用分布式训练，但必须使用 torchrun 启动脚本。>\n\n--resume_checkpoint \u003C要加载的检查点路径>\n\n--resume （如果存在此选项，将自动加载最新的可用检查点）\n\n--finetune （如果存在此选项，将基于该流水线的数据对提供的检查点进行微调）\n\n--model_save_dir \u003C保存检查点的目录路径>\n\n--wandb （如果存在此选项，日志将同步到你的 Weights & Biases 账户，前提是已在命令行登录）\n\n--wandb_resume_id \u003C要恢复的运行的 ID，如果你正在使用 Weights & Biases（可在运行的 URL 中找到该 ID）>\n```\n\n对于多 GPU 训练，你需要提供多个 GPU 编号（用逗号分隔），并使用 torchrun 启动脚本。同时还需要指定 GPU 的数量，这必须与你提供的 GPU 编号数量一致。请注意：torchrun 与 nohup 不兼容！请改用 tmux，以确保你在退出终端后脚本仍能继续运行。\n\n```\ntorchrun --standalone --nproc_per_node=4 --nnodes=1 run_training_pipeline.py \u003C流水线的缩写> --gpu_id \"0,1,2,3\"\n```\n\n每完成一个 epoch（或按照特定的步数），一些日志会被写入控制台和 Weights & Biases 网站（如果你已登录并设置了相应标志）。如果出现 CUDA 内存不足的错误，你需要在当前流水线的 training_loop 调用参数中降低批次大小。尝试逐步减小批次大小，直到不再出现 CUDA 内存不足的错误为止。\n\n在你指定的保存目录中，将会出现检查点文件和频谱图可视化数据。由于检查点文件较大，系统只会保留最近的五个。训练步数高度依赖于你所使用的数据，以及你是从预训练检查点开始微调还是从零开始训练。数据越少，你应该采取的步数就越少，以避免模型过拟合或崩溃。如果你想提前停止训练，可以直接终止进程，因为所有子进程都是守护进程，主进程结束时它们也会自动退出。如果仍有残留的僵尸进程，可以使用以下命令查找并手动杀死它们：\n\n```\nfuser -v \u002Fdev\u002Fnvidia*\n```\n\n每当保存一个检查点时，还会生成一个可用于推理的压缩版本，文件名为 `_best.py`。\n\n---\n\n## 常见问题解答 🐓\n\n以下是用户提出的一些问题：\n\n- 我如何判断自己的数据是否存在异常值或其他类似问题？——有一个评分器可以检测并移除数据集中损失值异常高的样本，可以查看 `run_scorer.py`。\n- 我的错误信息显示 GPU0，尽管我指定了不同的 GPU —— GPU 选择的工作方式是将指定的 GPU 设置为唯一可见的设备，以避免后端程序意外地在不同 GPU 上运行。因此，在程序内部，该设备会被命名为 GPU0，因为它就是程序唯一能看到的 GPU。实际上，程序是在你指定的 GPU 上运行的。\n- `read_to_file` 会产生奇怪的输出 —— 请检查你是否向该方法传递的是列表还是字符串。由于字符串是可以被迭代的，可能不会抛出错误，但该方法期望的是字符串列表。\n- `UserWarning: 在 optimizer.step() 之前检测到 lr_scheduler.step() 的调用。` —— 我们使用了一个自定义的学习率调度器，而 PyTorch 错误地认为我们调用了调度器和优化器的顺序不对。请忽略这个警告，它完全没有意义。\n- `WARNING[XFORMERS]: xFormers 无法加载 C++\u002FCUDA 扩展。[...]` —— 这又是一个无意义的警告。我们实际上并没有使用 xFormers，它只是我们某个依赖项的依赖之一，但在任何地方都没有被使用。\n- `torchaudio 后端已切换到 'soundfile'。请注意，'sox_io' 在 Windows 上不受支持。[...]` —— 这只会在 Windows 系统下出现，并不会对系统产生任何影响。\n- `WARNING:phonemizer:200.0% 的行存在词数不匹配（2\u002F1）[...]` —— 我们不清楚 espeak 为何会发出这个警告，不过它似乎并不会影响任何功能，因此可以安全地忽略。\n- 损失变为 `NaN` —— 默认的学习率适用于干净的数据。如果你的数据不够干净，可以尝试使用评分器来查找有问题的样本，或者降低学习率。最常见的问题是语音中存在停顿，但文本中却没有相应的提示。这就是为什么通常会省略标点符号的 ASR 语料库很难用于 TTS 的原因。\n\n---\n\u003Cbr>\n\n## 致谢 🦆\n\nFastSpeech 2 和 GST 的基础 PyTorch 模块来自 [ESPnet](https:\u002F\u002Fgithub.com\u002Fespnet\u002Fespnet)，HiFi-GAN 的 PyTorch 模块则来自 [ParallelWaveGAN 仓库](https:\u002F\u002Fgithub.com\u002Fkan-bayashi\u002FParallelWaveGAN)。与 MatchaTTS 中描述的基于 Conditional Flow Matching 的 PostNet 相关的一些模块取自 [MatchaTTS 官方代码库](https:\u002F\u002Fgithub.com\u002Fshivammehta25\u002FMatcha-TTS)，另一些则来自 [StableTTS 代码库](https:\u002F\u002Fgithub.com\u002FKdaiP\u002FStableTTS)。对于字素到音素的转换，我们依赖于前面提到的 eSpeak-NG 以及 [transphone](https:\u002F\u002Fgithub.com\u002Fxinjli\u002Ftransphone)。我们使用 [encodec，一种神经音频编解码器](https:\u002F\u002Fgithub.com\u002Fyangdongchao\u002FAcademiCodec)，作为训练数据缓存的中间表示形式，以节省存储空间。\n\n## 引用 🐧\n\n\u003Ca href=\"https:\u002F\u002Fstar-history.com\u002F#DigitalPhonetics\u002FIMS-Toucan&Date\">\n \u003Cpicture>\n   \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDigitalPhonetics_IMS-Toucan_readme_70f8732d0159.png&theme=dark\" \u002F>\n   \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDigitalPhonetics_IMS-Toucan_readme_70f8732d0159.png\" \u002F>\n   \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDigitalPhonetics_IMS-Toucan_readme_70f8732d0159.png\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>\n\n### 工具包介绍 [[相关代码和模型]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv1.0)\n\n```\n@inproceedings{lux2021toucan,\n  year         = 2021,\n  title        = {{The IMS Toucan system for the Blizzard Challenge 2021}},\n  author       = {Florian Lux and Julia Koch and Antje Schweitzer and Ngoc Thang Vu},\n  booktitle    = {Blizzard Challenge Workshop},\n  publisher    = {ISCA Speech Synthesis SIG}\n}\n```\n\n### 添加发音特征和元学习预训练 [[相关代码和模型]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv1.1)\n\n```\n@inproceedings{lux2022laml,\n  year         = 2022,\n  title        = {{Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features}},\n  author       = {Florian Lux and Ngoc Thang Vu},\n  booktitle    = {ACL}\n}\n```\n\n### 添加精确的韵律克隆能力 [[相关代码和模型]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv2.2)\n\n```\n@inproceedings{lux2022cloning,\n  year         = 2022,\n  title        = {{Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech}},\n  author       = {Lux, Florian and Koch, Julia and Vu, Ngoc Thang},\n  booktitle    = {SLT},\n  publisher    = {IEEE}\n}\n```\n\n### 添加语言嵌入和词边界 [[相关代码和模型]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv2.2)\n\n```\n@inproceedings{lux2022lrms,\n  year         = 2022,\n  title        = {{Low-Resource Multilingual and Zero-Shot Multispeaker TTS}},\n  author       = {Florian Lux and Julia Koch and Ngoc Thang Vu},\n  booktitle    = {AACL}\n}\n```\n\n### 添加可控制的说话人嵌入生成 [[相关代码和模型]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv2.3)\n\n```\n@inproceedings{lux2023controllable,\n  year         = 2023,\n  title        = {{Low-Resource Multilingual and Zero-Shot Multispeaker TTS}},\n  author       = {Florian Lux and Pascal Tilli and Sarina Meyer and Ngoc Thang Vu},\n  booktitle    = {Interspeech},\n  publisher    = {ISCA}\n}\n```\n\n### 我们对 2023 年 Blizzard Challenge 的贡献 [[相关代码和模型]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv2.b)\n\n```\n@inproceedings{lux2023blizzard,\n  year         = 2023,\n  title        = {{The IMS Toucan System for the Blizzard Challenge 2023}},\n  author       = {Florian Lux and Julia Koch and Sarina Meyer and Thomas Bott and Nadja Schauffler and Pavel Denisov and Antje Schweitzer and Ngoc Thang Vu},\n  booktitle    = {Blizzard Challenge Workshop},\n  publisher    = {ISCA Speech Synthesis SIG}\n}\n```\n\n### 推出首个支持 7000 多种语言的 TTS 系统 [[相关代码和模型]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002Fv3.0)\n\n```\n@inproceedings{lux2024massive,\n  year         = 2024,\n  title        = {{Meta Learning Text-to-Speech Synthesis in over 7000 Languages}},\n  author       = {Florian Lux and Sarina Meyer and Lyonel Behringer and Frank Zalkow and Phat Do and Matt Coler and Emanuël A. P. Habets and Ngoc Thang Vu},\n  booktitle    = {Interspeech},\n  publisher    = {ISCA}\n}\n```\n\n### 将基于文本的上下文提示引入 NAR TTS [[相关代码和模型]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Freleases\u002Ftag\u002F2.p)\n\n```\n@inproceedings{bott2024prompting,\n  year         = 2024,\n  title        = {{利用自然语言提示控制文本到语音中的情感}},\n  author       = {托马斯·博特、弗洛里安·卢克斯、武玉堂},\n  booktitle    = {Interspeech},\n  publisher    = {ISCA}\n}\n```\n\n### 探究随机韵律建模 [[相关代码和模型]](https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Ftree\u002FStochasticProsodyModeling)\n\n```\n@inproceedings{mayer2025stochastic,\n  year         = 2025,\n  title        = {{探究语音合成中用于韵律建模的随机方法}},\n  author       = {保罗·迈耶、弗洛里安·卢克斯、亚历杭德罗·佩雷斯-冈萨雷斯-德-马尔托斯、安吉丽娜·埃利扎罗娃、林赛·范德林、迪尔克·瓦特、武玉堂},\n  booktitle    = {Interspeech},\n  publisher    = {ISCA}\n}\n```","# IMS-Toucan 快速上手指南\n\nIMS-Toucan 是由德国斯图加特大学自然语言处理研究所（IMS）开发的先进文本转语音（TTS）工具包。它支持超过 7000 种语言，具有速度快、可控性强且对计算资源要求相对较低的特点。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**：推荐 Linux（Windows 和 macOS 也可用，但配置稍复杂）。\n- **Python 版本**：推荐 Python 3.10。\n- **硬件要求**：\n  - **训练**：必须至少拥有一块支持 CUDA 的 GPU。\n  - **推理**：无需 GPU，CPU 即可运行。\n\n### 前置依赖 (Linux)\n在大多数 Linux 发行版中，以下包已预装。若未安装，请使用 `apt-get` 安装：\n\n```bash\nsudo apt-get update\nsudo apt-get install libsndfile1 espeak-ng ffmpeg libasound-dev libportaudio2 libsqlite3-dev\n```\n\n> **注意**：`espeak-ng` 是可选但强烈推荐的依赖，用于处理多种语言的特殊发音情况。Windows 和 macOS 用户需参考官方文档单独安装并配置环境变量 `PHONEMIZER_ESPEAK_LIBRARY`。\n\n## 安装步骤\n\n1. **克隆仓库**\n   将项目克隆到本地机器：\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan.git\n   cd IMS-Toucan\n   ```\n\n2. **创建虚拟环境**\n   推荐使用 Python 虚拟环境以避免依赖冲突：\n   ```bash\n   python -m venv toucan_env\n   source toucan_env\u002Fbin\u002Factivate\n   ```\n   *(Windows 用户激活命令为：`toucan_env\\Scripts\\activate`)*\n\n3. **安装依赖**\n   安装项目所需的核心库：\n   ```bash\n   pip install --no-cache-dir -r requirements.txt\n   ```\n   > **提示**：国内用户若下载缓慢，可添加清华或阿里镜像源加速：\n   > `pip install --no-cache-dir -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n\n4. **GPU 支持 (仅限 Windows)**\n   Linux 用户通常无需额外操作即可使用 GPU。Windows 用户若需启用 GPU 加速，请前往 [PyTorch 官网](https:\u002F\u002Fpytorch.org\u002F) 获取对应的安装命令重新安装 torch。\n\n5. **存储配置 (可选)**\n   若需更改模型和缓存文件的默认存储路径，可编辑 `Utility\u002Fstorage_config.py` 文件进行自定义设置。\n\n## 基本使用\n\nIMS-Toucan 会自动从 Hugging Face 下载预训练模型（首次运行时可能需要一点时间）。\n\n### 1. 交互式推理示例\n创建一个 Python 脚本（例如 `demo.py`），使用内置接口进行语音合成：\n\n```python\nfrom InferenceInterfaces.ToucanTTSInterface import ToucanTTSInterface\n\n# 初始化接口，自动加载默认预训练模型\ntts = ToucanTTSInterface()\n\n# 设置语言 (ISO 639-3 代码，例如 'eng' 代表英语，'cmn' 代表普通话)\n# 可用语言列表见：Utility\u002Flanguage_list.md\ntts.set_language(\"eng\") \n\n# 方法一：生成音频文件\n# 输入文本列表和输出文件名\ntts.read_to_file([\"Hello, this is a test of IMS Toucan.\"], \"output.wav\")\n\n# 方法二：直接朗读并播放 (可选参数 view=True 可显示波形图)\n# tts.read_aloud(\"Hello, welcome to the future of speech synthesis.\", view=False)\n```\n\n运行脚本：\n```bash\npython demo.py\n```\n\n### 2. 运行官方交互演示\n项目提供了现成的演示脚本，可直接体验功能：\n\n```bash\npython run_interactive_demo.py\n```\n在该脚本中，你可以动态调整语速、音高变化率和能量变化率等参数。\n\n### 3. 将文本保存为文件\n若只需批量将文本转换为音频文件：\n\n```bash\npython run_text_to_file_reader.py\n```\n\n> **提示**：预训练模型支持的语言非常广泛，具体语言代码请查阅项目仓库中的 `Utility\u002Flanguage_list.md` 文件。","一家专注于非洲本土文化保护的公益组织，正紧急需要将数千页濒危语言的口述历史文本转化为有声档案，以便在缺乏识字率的社区中通过广播播放。\n\n### 没有 IMS-Toucan 时\n- **语言覆盖严重不足**：主流 TTS 工具仅支持几十种通用语，面对项目所需的数百种非洲方言（如班图语系分支），完全无法找到预训练模型，导致大量珍贵文本无法发声。\n- **定制成本高昂**：若为每种小众语言单独采集数据并训练模型，需要昂贵的 GPU 集群和数月的时间，远超公益组织的预算和人力极限。\n- **语音表现僵硬**：传统拼接式合成或低质量模型生成的语音机械感强，缺乏情感起伏，难以让当地听众产生共鸣，甚至影响信息传达的准确性。\n- **部署门槛高**：现有方案往往依赖复杂的云端 API 或庞大的本地环境配置，在无稳定网络的偏远地区工作站上难以运行。\n\n### 使用 IMS-Toucan 后\n- **万语即刻可用**：利用 IMS-Toucan 支持的 7000 多种语言能力，团队直接调用预训练模型，瞬间覆盖了所有目标方言，无需从零开始收集数据。\n- **轻量高效训练**：借助其高效的架构，即便只有少量参考音频，也能在单张消费级显卡上快速微调出高质量语音，将原本数月的周期缩短至几天。\n- **可控自然合成**：通过调整韵律和情感参数，生成的语音语调自然、富有感染力，完美还原了讲述者的语气，极大提升了广播内容的可听性。\n- **离线灵活部署**：IMS-Toucan 支持本地离线推理，团队轻松将其部署在配置普通的笔记本电脑上，在无网环境下也能持续批量生成音频文件。\n\nIMS-Toucan 以极低的算力成本打破了语言壁垒，让技术真正服务于文化多样性的保护与传承。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDigitalPhonetics_IMS-Toucan_26313b6b.png","DigitalPhonetics","Speech and Language Technology (SaLT) at the University of Stuttgart","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FDigitalPhonetics_5a7d95ab.png","Research institute in the field of speech, natural language processing and machine learning",null,"thangvu@ims.uni-stuttgart.de","http:\u002F\u002Fwww.ims.uni-stuttgart.de\u002Finstitut\u002Farbeitsgruppen\u002Fdp\u002Findex.en.html","https:\u002F\u002Fgithub.com\u002FDigitalPhonetics",[81],{"name":82,"color":83,"percentage":84},"Python","#3572A5",100,2194,319,"2026-04-06T07:13:40","Apache-2.0","Linux, Windows, macOS","训练需要至少一块支持 CUDA 的 NVIDIA GPU（具体型号和显存未说明，需根据 batchsize 调整以避免 OOM）；推理不需要 GPU。多卡训练需使用 torchrun。","未说明",{"notes":93,"python":94,"dependencies":95},"1. Linux 下需预装 libsndfile1, espeak-ng, ffmpeg 等系统库。2. Windows 和 macOS 使用 espeak-ng 时需手动安装并设置 PHONEMIZER_ESPEAK_LIBRARY 环境变量指向对应的 .dll 或 .dylib 文件。3. macOS (M1) 建议通过 MacPorts 安装 espeak-ng。4. 预训练模型会自动下载，也可自定义存储路径。5. 多 GPU 训练时不支持 nohup，建议使用 tmux。","3.10",[96,97,98,99,100,101,102,103],"torch","phonemizer","espeak-ng (系统级依赖)","libsndfile1","ffmpeg","libasound-dev","libportaudio2","libsqlite3-dev",[105,14],"音频",[107,108,109,110,111,112,113,114],"text-to-speech","toolkit","speech-synthesis","deep-learning","speech-processing","tts","pytorch","speech","2026-03-27T02:49:30.150509","2026-04-07T14:38:20.620033",[118,123,127,132,137,142],{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},22280,"如何判断模型训练是否已经收敛或完成？","观察损失曲线图（plot），特别是回归损失（regression_loss）和随机损失（stochastic_loss）的后半部分。如果曲线在末尾看起来相当平坦，说明模型已完成训练；如果曲线仍在明显下降，则尚未完成，需要增加训练步数。一个实用的小技巧是：直接设置一个非常大的步数（如 500k），定期检查图表，一旦曲线变平即可手动终止进程，无需等待达到设定步数。","https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Fissues\u002F195",{"id":124,"question_zh":125,"answer_zh":126,"source_url":122},22281,"进行声音克隆微调时，需要多少数据和训练步数才能获得好结果？","对于约 2.5 小时的高质量数据，训练 30,000 到 60,000 步通常已经足够。如果有 5 小时的优质数据，可以训练 80,000 步以获得更好的效果。需要注意的是，如果继续增加数据量或延长训练时间，收益会严重递减。此外，由于声音克隆并非该工具包的主要焦点，其效果可能无法完全满足极高的专业标准。",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},22282,"在使用 reference_speaker 进行推理时报错 'TDNNBlock.forward() got an unexpected keyword argument lengths' 怎么办？","这通常是因为输入的音频文件是立体声（Stereo）导致的。解决方案是将音频从立体声转换为单声道（Mono）。工具内部已包含检测机制，如果检测到时间轴和通道轴切换（即立体声格式），会自动尝试修正形状以适配单声道输入。请确保传入的 .wav 或 .mp3 文件已转换为单声道格式。","https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Fissues\u002F172",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},22283,"微调后的模型无法合成词语，只产生噪音，可能的原因是什么？","这通常不是数据量的问题，也不是单纯的音频质量（MOS）问题。最可能的原因是文本转录与语音内容不匹配（Text-Speech Mismatch）。例如，数据集中存在转录包含了未说出的内容，或者语音中说出了未记录在转录中的内容。建议运行自动语音识别（ASR）工具重新生成转录，以确保文本和音频严格对应，然后再使用新数据集进行微调。","https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Fissues\u002F189",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},22284,"训练 HiFiGAN 时在预处理阶段因内存不足崩溃怎么办？","这是由于加载全部数据导致 RAM 不足。在新版本的工具包中，引入了一种新的训练流程来解决此问题：该流程不再一次性加载所有数据，而是在每个 epoch 中仅随机加载一部分数据块（chunk）进行训练，然后在下一个阶段加载另一个随机块。请使用更新后的版本进行训练以降低内存占用。","https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Fissues\u002F7",{"id":143,"question_zh":144,"answer_zh":145,"source_url":131},22285,"Toucan 支持哪些语言？是否支持小众语言（如克丘亚语、艾马拉语）？","Toucan 支持多种语言，包括一些资源较少的小众语言。开发者已与相关研究者合作，成功将克丘亚语（Quechua）和艾马拉语（Aymara）纳入支持范围。不过，对于非主流语言，如果使用的训练数据质量不高、标签不匹配或音素化器（phonemizer）表现不佳，合成质量可能会受到影响。建议使用高质量的数据集（如 LibrittsR 作为参考标准）进行训练或微调。",[147,152,157,162,167,172,176,181,186,191,196,201,206,211],{"id":148,"version":149,"summary_zh":150,"released_at":151},136024,"v3.1.2","本次发布包含一个全新的图形用户界面，使您可以精确控制语音的发音效果。\n\n您可以生成多种不同的语音输出，直到找到满意的一种。然后，您还可以通过拖动音高值和各个音素的持续时间来进一步调整。此外，在保留您对语调和时长所做的修改的同时，您也可以更换为其他语音合成器的声音。当然，这一切都支持超过7000种语言。\n\n只需更新新依赖项，并运行 `run_advanced_GUI_demo.py` 脚本即可。默认情况下，它会从 Hugging Face 加载预训练模型，但您也可以指定使用我们自己的模型。\n\n![image](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F25872d43-7297-4a78-8815-5b9f267f9552)","2024-10-07T13:22:58",{"id":153,"version":154,"summary_zh":155,"released_at":156},136025,"v3.1.1","这是一个小版本更新，改动不大，但对音频质量的影响却非常显著。此前，我们是在从真实参考音频文件中提取的声谱图上训练声码器的。尽管我们尽最大努力使合成声谱图尽可能自然，合成声谱图与真实声谱图之间始终存在一定的差距。过去我们认为这种差距微不足道，影响很小，但经过实验我们发现，它实际上会带来很大的差异。声码器是整个流程中的最后一环，它真正将生成的声谱图转化为可听的音频信号。因此，使用效果不佳的声码器，就好比给一辆昂贵的跑车装上廉价轮胎。\n\n现在，我们终于通过先生成合成声谱图，再用这些合成声谱图来训练声码器的方式解决了这一问题。此外，我们还增加了声码器的参数量，提升了其容量。推理速度保持不变，但音质得到了显著提升。本次更新属于为完成我的博士论文而进行的清理工作的一部分，除内部试点研究外，暂无专门的论文或正式评估报告。\n\n请更新您的代码，并运行模型下载脚本，即可立即获得显著的音质提升。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Fcompare\u002Fv3.1...v3.1.1","2024-09-22T14:41:06",{"id":158,"version":159,"summary_zh":160,"released_at":161},136026,"v3.1","## 变更内容\n\n本次发布提供了新的检查点，并改进了上一版本中因时间限制而未包含的若干方面。有关适用于7000种语言的通用TTS模型的更多信息，请参阅之前的v3.0版本。\n\n- 音韵学预测（包括音高、能量和时长）现采用随机采样方式，从概率分布中抽取值，而非假设一对一映射关系。\n- 增加了对更多国际音标修饰符的支持，以覆盖更多语言。\n- 在预训练阶段加入了更多语言。\n- 重新设计了语言相似度预测模块及可视化功能。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FDigitalPhonetics\u002FIMS-Toucan\u002Fcompare\u002Fv3.0...v3.1","2024-07-25T07:16:57",{"id":163,"version":164,"summary_zh":165,"released_at":166},136027,"v3.0","本次发布扩展了工具包的功能，并提供了新的检查点。\n\n- 我们提升了整体的TTS质量，后续还将推出更多优化。\n- 新增了水印功能，以防止滥用。\n- 我们将支持范围扩展至ISO-639-3标准中的几乎所有语言（超过7000种语言！）。\n- 通过一些巧妙的设计，我们得以从一个基于462种语言预训练的检查点，推演出一个能够“说”所有我们现已支持文本前端的语言的检查点！\n- 此外还进行了大量简化和用户体验方面的改进。\n\n这一成果是与格罗宁根大学和埃尔兰根弗劳恩霍夫IIS研究所的同事合作的结晶。我们与斯图加特大学的研究团队共同构建了这款模型，它是同类中的首个。\n\n我们将在Interspeech 2024会议上展示这一成果。论文作者名单如下：Florian Lux、Sarina Meyer、Lyonel Behringer、Frank Zalkow、Phat Do、Matt Coler、Emanuël Habets和Thang Vu。\n\n论文链接：https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06403  \n数据集链接：https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FFlux9665\u002FBibleMMS  \n交互式演示：https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FFlux9665\u002FMassivelyMultilingualTTS  \n静态演示：https:\u002F\u002Fanondemos.github.io\u002FMMDemo\u002F","2024-06-10T15:21:46",{"id":168,"version":169,"summary_zh":170,"released_at":171},136028,"2.p","在本次发布中，您可以在训练过程中根据情感提示对 TTS 模型进行条件化，并在推理阶段将任意提示中的情感迁移到合成语音中。\n\n演示样本可在 https:\u002F\u002Fanondemos.github.io\u002FPrompting\u002F 获取。此外，还有一个演示空间：https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FThommy96\u002Fpromptingtoucan。\n\n使用预训练模型：\n只需提供句子嵌入提取器的实例、说话人 ID 和提示，即可使用预训练模型进行推理（参见 run_sent_emb_test_suite.py）。\n\n训练您自己的模型：\n您需要为所有希望在训练中包含的情感类别提取若干提示及其对应的句子嵌入（例如，参见 extract_yelp_sent_embs.py）。然后，在您的训练流程中，加载这些句子嵌入并将其传递给训练循环。同时，在实例化 TTS 模型时，还需指定嵌入的维度，并将 static_speaker_embedding 设置为 True（参见 TrainingInterfaces\\TrainingPipelines\\ToucanTTS_Sent_Finetuning.py）。根据您用于训练的数据集中说话人的数量，您可能需要调整 TTS 模型中说话人嵌入表的维度。最后，请确保您使用的数据集已包含在用于从文件路径中提取情感和说话人 ID 的函数中（Utility\\utils.py）。","2024-06-10T14:03:18",{"id":173,"version":174,"summary_zh":76,"released_at":175},136029,"v2.asvspoof","2023-12-01T15:40:39",{"id":177,"version":178,"summary_zh":179,"released_at":180},136030,"v2.5","我们在一种全新架构中融入了大量创新设计，该架构将作为我们后续多语言及低资源语音合成研究的基础。我们将其命名为 ToucanTTS，并像往常一样提供了预训练模型。合成质量非常出色，训练过程极为稳定；从头开始训练所需的数据点很少，而微调则需要更少的数据。这些指标很难用具体数值来量化，因此最好还是亲自试一试。\n\n此外，我们还提供使用 BigVGAN 声码器的选项。BigVGAN 的音质十分优秀，但在 CPU 上运行稍显缓慢。如果使用 GPU，强烈建议采用新的声码器。","2023-04-10T18:22:10",{"id":182,"version":183,"summary_zh":184,"released_at":185},136031,"v2.b","我们提交至[暴雪挑战赛2023](https:\u002F\u002Fwww.synsig.org\u002Findex.php\u002FBlizzard_Challenge_2023)的参赛作品","2023-04-04T14:15:02",{"id":187,"version":188,"summary_zh":189,"released_at":190},136032,"v2.4","本次发布扩展了工具包的功能，并提供了新的检查点。\n\n- 语音合成器采用新的采样率：使用24kHz而非48kHz会降低理论上的音质上限，但在实际应用中产生的伪影更少。\n- 新的TTS模型中集成了PortaSpeech基于流的后网络，几乎不增加额外开销就能带来更清晰的输出效果。\n- 通过在低维空间中利用更优的嵌入函数进行人工说话人生成，新增了更多的可控性选项。\n- 其他提升用户体验的改进，例如集成的微调示例、用于选择训练循环和语音合成器微调方法的仲裁器（尽管实际上通常无需手动指定）。\n- 大量的错误修复与性能提升。\n\n本次发布破坏了向后兼容性。如果您依赖旧模型，请下载新模型或继续使用之前的版本。\n\n未来的版本还将对所使用的语音合成器再做一次调整（采用BigVGAN生成器），并大幅增强单模型的多语言能力。","2023-02-22T17:08:36",{"id":192,"version":193,"summary_zh":194,"released_at":195},136033,"v2.3","本次发布扩展了工具包的功能，并提供了新的检查点。\n\n- 自包含嵌入：我们不再使用外部嵌入模型来进行TTS条件控制。取而代之的是，我们训练了一个专门为此用途定制的嵌入模型。\n- 新声码器：Avocodo 取代了 HiFi-GAN。\n- 通过人工合成说话人实现了新的可控性选项。\n- 多项提升用户体验的改进，包括与Weights & Biases的集成、示例演示脚本以及模型的自动下载功能。\n- 大量的错误修复和性能提升。\n\n请注意，本次发布不兼容旧版本。如果您依赖于旧模型，请下载新模型或继续使用之前的版本。","2022-10-25T15:16:35",{"id":197,"version":198,"summary_zh":199,"released_at":200},136034,"v2.2","本次发布扩展了工具包的功能，并提供了新的检查点。\n\n**新特性：**\n- 通过扩展的发音部位特征查找，支持国际音标标准中的所有音素。\n- 通过解析（声调、延长、主要重音），支持国际音标标准中的一些超音段标记。\n- 使用 Praat-Parselmouth，显著提升基频提取性能。\n- 提高了音素化速度。\n- 添加了词边界，这些边界对对齐器和解码器不可见，但在多语言场景下可帮助编码器更好地工作。\n- 增加了声调语言的支持，并经过测试后纳入预训练模型中（中文、越南语）。\n- 新增 Scorer 类，可在给定已训练模型和数据集缓存的情况下检查数据（可使用提供的预训练模型实现此功能）。\n- 提供直观的控件，用于调整持续时间以及基频和能量的方差。\n- 修复了大量错误并提升了运行速度。\n\n**注意：**\n- 本版本不向后兼容。请确保使用配套的预训练模型。旧的检查点和数据集缓存将不再兼容，只有 HiFiGAN 仍保持兼容。\n- 下一版本的工作已在进行中。我们接下来的目标是改进语音适配功能。\n- 若要使用预训练检查点，请先下载它们，创建相应的目录，并按以下方式将其放置到您的代码库中（其中 HiFiGAN 和 FastSpeech2 的检查点在放置后需要重命名）：\n\n``` \n...\nModels\n└─ Aligner\n      └─ aligner.pt\n└─ FastSpeech2_Meta\n      └─ best.pt\n└─ HiFiGAN_combined\n      └─ best.pt\n...\n```","2022-05-20T10:04:57",{"id":202,"version":203,"summary_zh":204,"released_at":205},136035,"v2.1","- self contained aligner to get high quality durations quickly and easily without reliance on external tools or knowledge distillation\r\n- modelling speakers and languages jointly but disentangled, so you can use speakers across languages\r\n- look at the demo section for an interactive online demo\r\n\r\nPretrained FastSpeech2 model that can speak in many languages in any voices, HiFiGAN model and Aligner model are attached to this commit.","2022-03-01T20:37:47",{"id":207,"version":208,"summary_zh":209,"released_at":210},136036,"v1.1","This release includes our new text frontend that uses articulatory features of phonemes instead of phoneme identities as well as checkpoints trained with a variant of model agnostic meta learning that are very well suited as basis for fine-tuning a single speaker model on very little data in lots of different languages.","2022-02-28T20:36:18",{"id":212,"version":213,"summary_zh":214,"released_at":215},136037,"v1.0","The basic version of Tacotron 2, FastSpeech 2 and HiFiGAN are complete. A pretrained model for HiFiGAN is attached to this release.\r\n\r\nFuture updates will include different models and new features and changes to existing models which will break backwards compatibility. This version is the most basic, but complete.","2022-01-14T16:49:30"]