[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-OpenMOSS--MOSS-TTSD":3,"tool-OpenMOSS--MOSS-TTSD":64},[4,17,27,35,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",144730,2,"2026-04-07T23:26:32",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85013,"2026-04-06T11:09:19",[26,43,44,45,14,46,15,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":23,"last_commit_at":54,"category_tags":55,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[14,26,13,15,46],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":23,"last_commit_at":62,"category_tags":63,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",75097,"2026-04-07T22:51:14",[15,26,13,46],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":79,"owner_website":81,"owner_url":82,"languages":83,"stars":88,"forks":89,"last_commit_at":90,"license":91,"difficulty_score":92,"env_os":93,"env_gpu":94,"env_ram":95,"env_deps":96,"category_tags":106,"github_topics":107,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":114,"updated_at":115,"faqs":116,"releases":117},5416,"OpenMOSS\u002FMOSS-TTSD","MOSS-TTSD","MOSS-TTSD is a spoken dialogue generation model designed for expressive multi-speaker synthesis. It features long-context modeling,  flexible speaker control, and multilingual support, while enabling zero-shot voice cloning from short audio references.","MOSS-TTSD 是一款专为生成长篇多人对话而设计的开源语音生成模型。它突破了传统语音合成仅擅长单人朗读的局限，能够将静态的对话脚本转化为充满情感起伏、自然流畅的多人口语互动，完美模拟真实交谈中的轮替节奏与重叠说话场景。\n\n该工具主要解决了长篇幅内容中角色声音一致性难维持、多角色切换生硬以及缺乏情感表现力的问题。无论是制作长达一小时的播客节目、有声书演播，还是体育解说、相声表演及影视配音，MOSS-TTSD 都能确保在单次生成中保持长达 60 分钟的音频连贯性与角色身份稳定。\n\nMOSS-TTSD 非常适合需要高质量音频内容的创作者、广播剧制作人、游戏开发者以及从事语音技术研究的工程师。其核心技术亮点包括强大的长上下文建模能力，支持灵活控制 1 至 5 位不同说话人；具备先进的零样本语音克隆功能，仅需极短的参考音频即可复刻音色；同时支持中文、英文、日文等多种语言的跨语言合成。作为 OpenMOSS 家族的一员，它以开源形式为专业级长视频频内容创作提供了坚实的技术底座。","\u003Cdiv align=\"center\">\n    \u003Ch1>\n    MOSS-TTSD: Text to Spoken Dialogue Generation\n    \u003C\u002Fh1>\n    \u003Cp>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMOSS_MOSS-TTSD_readme_21622e81d8aa.png\" alt=\"OpenMOSS Logo\" width=\"300\">\n    \u003Cp>\n    \u003C\u002Fp>\n    \u003Ca href=\"https:\u002F\u002Fmosi.cn\u002Fmodels\u002Fmoss-ttsd\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-Read%20More-green\" alt=\"blog\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.19739\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-2603.19739%20-red\" alt=\"paper\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTSD-v1.0\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20MOSS%20TTSD%20-v1.0-yellow\" alt=\"MOSS-TTSD-v1.0\">\u003C\u002Fa>\n     \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenMOSS-Team\u002FMOSS-TTSD\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Huggingface%20%20-space-orange\" alt=\"MOSS-TTSD-space\">\u003C\u002Fa>\n    \u003Ca href=\"\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAI Stuidio-Coming%20Soon-blue\" alt=\"AI Studio\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10+-orange\" alt=\"version\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTSD\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.0+-brightgreen\" alt=\"python\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTSD\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg\" alt=\"mit\">\u003C\u002Fa>\n    \u003Cbr>\n\n\u003C\u002Fdiv>\n\n\n# MOSS-TTSD🪐\n\n[English](README.md) | [简体中文](README_zh.md)\n\n\u003C!-- **MOSS-TTSD** is a long-form spoken dialogue generation model that enables highly expressive multi-party conversational speech synthesis across multiple languages. It supports continuous long-duration generation, flexible multi-speaker dialogue control, and state-of-the-art zero-shot voice cloning with only short reference audio. MOSS-TTSD is designed for real-world long-form content creation, including podcasts, audiobook, sports and esports commentary, dubbing, crosstalk, and entertainment scenarios. （about）-->\n\n\n## Overview\n \u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMOSS_MOSS-TTSD_readme_ee002edd316a.png\" alt=\"alt text\" width=\"330\">\n  \u003C\u002Fp>\n\nMOSS-TTSD is the long-form dialogue specialist within our open-source [MOSS‑TTS Family](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS). While foundational models typically prioritize high-fidelity single-speaker synthesis, MOSS-TTSD is architected to bridge the gap between isolated audio samples and cohesive, continuous human interaction.\nThe model represents a paradigm shift from \"text-to-speech\" to \"script-to-conversation.\" By prioritizing the flow and emotional nuances of multi-party engagement, MOSS-TTSD transforms static dialogue scripts into dynamic, expressive oral performances. It is designed to serve as a robust backbone for creators and developers who require a seamless transition between distinct speaker personas without sacrificing narrative continuity.\nWhether it is capturing the spontaneous energy of a live talk show or the structured complexity of a multilingual drama, MOSS-TTSD provides the stability and expressive depth necessary for professional-grade, long-form content creation in an open-source framework.\n\n\n## Highlights\n- **From Monologue to Dialogue**: Unlike traditional TTS which optimizes for reading, MOSS-TTSD masters the rhythm of conversation. It supports 1 to 5 speakers with flexible control, handling natural turn-taking, overlapping speech patterns, and distinct persona maintenance.\n- **Extreme Long-Context Modeling**: moving beyond short-sentence generation, the model is architected for stability over long durations, supporting up to 60 minutes of coherent audio in a single session with consistent identity.\n- **Diverse Scenario Adaptation**: fine-tuned for high-variability scenarios including conversational media (AI Podcasts), dynamic commentary (Sports\u002FEsports), and entertainment (Audiobooks, Dubbing, and Crosstalk).\n- **Multilingual & Zero-Shot Capabilities**: features state-of-the-art zero-shot voice cloning requiring only short reference audio, with robust cross-lingual performance across major languages including Chinese, English, Japanese, and European languages.\n\n\n## News 🚀\n- **[2026-03-18]** We support efficient end-to-end SGLang inference for MOSS-TTSD v1.0.\n- **[2026-03-06]** We added end-to-end SGLang inference support for MOSS-TTSD v0.7. For detailed instructions, please see the [legacy v0.7 docs](.\u002Flegacy\u002Fv0.7\u002FREADME.md).\n- **[2026-02-10]** MOSS-TTSD v1.0 is released! MOSS-TTSD v1.0 is officially released! This milestone version redefines long-form synthesis with 60-minute single-session context and support for multi-party interactions. It significantly expands multilingual capabilities and diverse usage scenarios.\n- **[2025-11-01]** MOSS-TTSD v0.7 is released! v0.7 significantly improves audio quality, voice cloning capability, and stability, adds support for 32 kHz high‑quality output, greatly extends single‑pass generation length (960s→1700s).\n- **[2025-09-09]** We supported SGLang inference engine to accelerate model inference by up to **16x**.\n- **[2025-08-25]** We released the 32khz version of XY-Tokenizer.\n- **[2025-08-12]** We add support for streaming inference in MOSS-TTSD v0.5.\n- **[2025-07-29]** We provide the SiliconFlow API interface and usage examples for MOSS-TTSD v0.5.\n- **[2025-07-16]** We open-source the fine-tuning code for MOSS-TTSD v0.5, supporting full-parameter fine-tuning, LoRA fine-tuning, and multi-node training.\n- **[2025-07-04]** MOSS-TTSD v0.5 is released! v0.5 has enhanced the accuracy of timbre switching, voice cloning capability, and model stability.\n- **[2025-06-20]** MOSS-TTSD v0 is released! Moreover, we provide a podcast generation pipeline named Podever, which can automatically convert PDF, URL, or long text files into high-quality podcasts.\n\n**Note:** For MOSS-TTSD v0.7 (including end-to-end SGLang inference), please refer to the [legacy v0.7 docs](.\u002Flegacy\u002Fv0.7\u002FREADME.md) for detailed instructions.\n\n## Supported Languages\n\nMOSS-TTSD currently supports **20 languages**:\n\n| Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |\n|---|---|---|---|---|---|---|---|---|\n| Chinese | zh | 🇨🇳 | English | en | 🇺🇸 | German | de | 🇩🇪 |\n| Spanish | es | 🇪🇸 | French | fr | 🇫🇷 | Japanese | ja | 🇯🇵 |\n| Italian | it | 🇮🇹 | Hebrew | he | 🇮🇱 | Korean | ko | 🇰🇷 |\n| Russian | ru | 🇷🇺 | Persian (Farsi) | fa | 🇮🇷 | Arabic | ar | 🇸🇦 |\n| Polish | pl | 🇵🇱 | Portuguese | pt | 🇵🇹 | Czech | cs | 🇨🇿 |\n| Danish | da | 🇩🇰 | Swedish | sv | 🇸🇪 | Hungarian | hu | 🇭🇺 |\n| Greek | el | 🇬🇷 | Turkish | tr | 🇹🇷 |  |  |  |\n\n## Installation\n\nTo run MOSS-TTSD, you need to install the required dependencies. You can use pip and conda to set up your environment.\n\n### Using conda\n\n```bash\nconda create -n moss_ttsd python=3.12 -y && conda activate moss_ttsd\npip install -r requirements.txt\npip install flash-attn\n```\n\n## Usage\n\n### Quick Start\n\nMOSS-TTSD uses a **continuation** workflow: provide reference audio for each speaker, their transcripts as a prefix, and the dialogue text to generate. The model continues in each speaker's identity.\n\n```python\nimport os\nfrom pathlib import Path\nimport torch\nimport soundfile as sf\nimport torchaudio\nfrom transformers import AutoModel, AutoProcessor\n\npretrained_model_name_or_path = \"OpenMOSS-Team\u002FMOSS-TTSD-v1.0\"\naudio_tokenizer_name_or_path = \"OpenMOSS-Team\u002FMOSS-Audio-Tokenizer\"\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\ndtype = torch.bfloat16 if device == \"cuda\" else torch.float32\n\nprocessor = AutoProcessor.from_pretrained(\n    pretrained_model_name_or_path,\n    trust_remote_code=True,\n    codec_path=audio_tokenizer_name_or_path,\n)\nprocessor.audio_tokenizer = processor.audio_tokenizer.to(device)\nprocessor.audio_tokenizer.eval()\n\nattn_implementation = \"flash_attention_2\" if device == \"cuda\" else \"sdpa\"\n# If flash_attention_2 is unavailable on your environment, set this to \"sdpa\".\nmodel = AutoModel.from_pretrained(\n    pretrained_model_name_or_path,\n    trust_remote_code=True,\n    attn_implementation=attn_implementation,\n    torch_dtype=dtype,\n).to(device)\nmodel.eval()\n\n# --- Inputs ---\n\nprompt_audio_speaker1 = \"asset\u002Freference_02_s1.wav\"\nprompt_audio_speaker2 = \"asset\u002Freference_02_s2.wav\"\nprompt_text_speaker1 = \"[S1] In short, we embarked on a mission to make America great again for all Americans.\"\nprompt_text_speaker2 = \"[S2] NVIDIA reinvented computing for the first time after 60 years. In fact, Erwin at IBM knows quite well that the computer has largely been the same since the 60s.\"\n\ntext_to_generate = \"\"\"\n[S1] Listen, let's talk business. China. I'm hearing things.\nPeople are saying they're catching up. Fast. What's the real scoop?\nTheir AI—is it a threat?\n[S2] Well, the pace of innovation there is extraordinary, honestly.\nThey have the researchers, and they have the drive.\n[S1] Extraordinary? I don't like that. I want us to be extraordinary.\nAre they winning?\n[S2] I wouldn't say winning, but their progress is very promising.\nThey are building massive clusters. They're very determined.\n[S1] Promising. There it is. I hate that word.\nWhen China is promising, it means we're losing.\nIt's a disaster, Jensen. A total disaster.\n\"\"\".strip()\n\n# --- Load & resample audio ---\n\ntarget_sr = int(processor.model_config.sampling_rate)\naudio1, sr1 = sf.read(prompt_audio_speaker1, dtype=\"float32\", always_2d=True)\naudio2, sr2 = sf.read(prompt_audio_speaker2, dtype=\"float32\", always_2d=True)\nwav1 = torch.from_numpy(audio1).transpose(0, 1).contiguous()\nwav2 = torch.from_numpy(audio2).transpose(0, 1).contiguous()\n\nif wav1.shape[0] > 1:\n    wav1 = wav1.mean(dim=0, keepdim=True)\nif wav2.shape[0] > 1:\n    wav2 = wav2.mean(dim=0, keepdim=True)\nif sr1 != target_sr:\n    wav1 = torchaudio.functional.resample(wav1, sr1, target_sr)\nif sr2 != target_sr:\n    wav2 = torchaudio.functional.resample(wav2, sr2, target_sr)\n\n# --- Build conversation ---\n\nreference_audio_codes = processor.encode_audios_from_wav([wav1, wav2], sampling_rate=target_sr)\nconcat_prompt_wav = torch.cat([wav1, wav2], dim=-1)\nprompt_audio = processor.encode_audios_from_wav([concat_prompt_wav], sampling_rate=target_sr)[0]\n\nfull_text = f\"{prompt_text_speaker1} {prompt_text_speaker2} {text_to_generate}\"\n\nconversations = [\n    [\n        processor.build_user_message(\n            text=full_text,\n            reference=reference_audio_codes,\n        ),\n        processor.build_assistant_message(\n            audio_codes_list=[prompt_audio]\n        ),\n    ],\n]\n\n# --- Inference ---\n\nbatch_size = 1\n\nsave_dir = Path(\"output\")\nsave_dir.mkdir(exist_ok=True, parents=True)\nsample_idx = 0\nwith torch.no_grad():\n    for start in range(0, len(conversations), batch_size):\n        batch_conversations = conversations[start : start + batch_size]\n        batch = processor(batch_conversations, mode=\"continuation\")\n        input_ids = batch[\"input_ids\"].to(device)\n        attention_mask = batch[\"attention_mask\"].to(device)\n\n        outputs = model.generate(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            max_new_tokens=2000,\n        )\n\n        for message in processor.decode(outputs):\n            for seg_idx, audio in enumerate(message.audio_codes_list):\n                sf.write(\n                    save_dir \u002F f\"{sample_idx}_{seg_idx}.wav\",\n                    audio.detach().cpu().to(torch.float32).numpy(),\n                    int(processor.model_config.sampling_rate),\n                )\n            sample_idx += 1\n\n```\n### Batch Inference\n\nYou can use the provided inference script for batch inference. The script automatically uses all visible GPUs. You can control GPU visibility via `export CUDA_VISIBLE_DEVICES=\u003Cdevice_ids>`.\n\n```bash\npython inference.py \\\n  --model_path OpenMOSS-Team\u002FMOSS-TTSD-v1.0 \\\n  --codec_model_path OpenMOSS-Team\u002FMOSS-Audio-Tokenizer \\\n  --input_jsonl \u002Fpath\u002Fto\u002Finput.jsonl \\\n  --save_dir outputs \\\n  --mode voice_clone_and_continuation \\\n  --batch_size 1 \\\n  --text_normalize\n```\n\nParameters:\n\n- `--model_path`: Path or HuggingFace model ID for MOSS-TTSD.\n- `--codec_model_path`: Path or HuggingFace model ID for MOSS-Audio-Tokenizer.\n- `--input_jsonl`: Path to the input JSONL file containing dialogue scripts and speaker prompts.\n- `--save_dir`: Directory where the generated audio files will be saved.\n- `--mode`: Inference mode. Choices: `generation`, `continuation`, `voice_clone`, `voice_clone_and_continuation`. We recommend using `voice_clone_and_continuation` for the best voice cloning experience.\n- `--batch_size`: Number of samples per batch (default: `1`).\n- `--max_new_tokens`: Maximum number of new tokens to generate. Controls total generated audio length (1s ≈ 12.5 tokens).\n- `--temperature`: Sampling temperature (default: `1.1`).\n- `--top_p`: Top-p sampling threshold (default: `0.9`).\n- `--top_k`: Top-k sampling threshold (default: `50`).\n- `--repetition_penalty`: Repetition penalty (default: `1.1`).\n- `--text_normalize`: Normalize input text (**recommended to always enable**).\n- `--sample_rate_normalize`: Resample prompt audios to the lowest sample rate before encoding (**recommended when using 2 or more speakers**).\n\n#### JSONL Input Format\n\nThe input JSONL file should contain one JSON object per line. MOSS-TTSD supports 1 to 5 speakers per dialogue. Use `[S1]`–`[S5]` tags in the `text` field and provide corresponding `prompt_audio_speakerN` \u002F `prompt_text_speakerN` pairs for each speaker:\n```json\n{\n  \"base_path\": \"\u002Fpath\u002Fto\u002Faudio\u002Ffiles\",\n  \"text\": \"[S1]Speaker 1 dialogue[S2]Speaker 2 dialogue[S3]...[S4]...[S5]...\",\n  \"prompt_audio_speaker1\": \"path\u002Fto\u002Fspeaker1_audio.wav\",\n  \"prompt_text_speaker1\": \"Reference text for speaker 1 voice cloning\",\n  \"prompt_audio_speaker2\": \"path\u002Fto\u002Fspeaker2_audio.wav\",\n  \"prompt_text_speaker2\": \"Reference text for speaker 2 voice cloning\",\n  \"...\": \"...\",\n  \"prompt_audio_speaker5\": \"path\u002Fto\u002Fspeaker5_audio.wav\",\n  \"prompt_text_speaker5\": \"Reference text for speaker 5 voice cloning\"\n}\n```\n\n### Accelerate Inference with SGLang\n\nMOSS-TTSD v1.0 supports running the fused MOSS-TTSD and MOSS-Audio-Tokenizer model with the deeply extended [SGLang](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002Fsglang) from OpenMOSS, enabling efficient inference for audio generation.\n\n#### Environment Setup\n\nFirst, clone the SGLang branch compatible with MOSS-TTSD v1.0.\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002Fsglang -b moss-ttsd-v1.0-with-cat\n```\n\n##### Using venv\n\n```bash\npython -m venv moss_ttsd_sglang\nsource moss_ttsd_sglang\u002Fbin\u002Factivate\npip install .\u002Fsglang\u002Fpython[all]\n```\n\n##### Using conda\n\n```bash\nconda create -n moss_ttsd_sglang python=3.12\nconda activate moss_ttsd_sglang\npip install .\u002Fsglang\u002Fpython[all]\n```\n\n#### End-to-End Inference Service\n\n##### Start the inference server\n\nBefore starting the service, first download [MOSS-TTSD-v1.0](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTSD-v1.0) and [MOSS-Audio-Tokenizer](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Audio-Tokenizer).\n\n```bash\ngit clone https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTSD-v1.0\ngit clone https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Audio-Tokenizer\n```\n\nOr:\n\n```bash\nhf download OpenMOSS-Team\u002FMOSS-TTSD-v1.0 --local-dir .\u002FMOSS-TTSD-v1.0\nhf download OpenMOSS-Team\u002FMOSS-Audio-Tokenizer --local-dir .\u002FMOSS-Audio-Tokenizer\n```\n\nAfter the download is complete, run the following command to fuse MOSS-TTSD v1.0 and MOSS-Audio-Tokenizer into a single-directory model that can be loaded by SGLang. After fusion, the model uses `voice_clone_and_continuation` inference mode by default:\n\n```bash\npython scripts\u002Ffuse_moss_tts_delay_with_codec.py \\\n  --model-path \u003Cpath-to-moss-ttsd-v1.0> \\\n  --codec-model-path \u003Cpath-to-moss-audio-tokenizer> \\\n  --save-path \u003Cpath-to-fused-model>\n```\n\nThen start the inference server with:\n\n```bash\nsglang serve \\\n  --model-path \u003Cpath-to-fused-model> \\\n  --delay-pattern \\\n  --trust-remote-code \\\n  --port 30000 --host 0.0.0.0\n```\n\n> The first service startup may take longer due to compilation. Once you see `The server is fired up and ready to roll!`, the service is ready. The first request after startup may still trigger a lengthy compilation, which is expected behavior, so please be patient.\n\n> **Tip:** The end-to-end inference service may cause some VRAM fragmentation during runtime. If GPU memory is tight, we recommend using `--mem-fraction-static` when starting SGLang to reserve enough space for intermediate tensors.\n\n##### Send a generation request\n\nThe service API is compatible with the standard multimodal text-generation interface. The `text` field in the returned JSON contains the base64-encoded WAV audio.\n\nThe repository currently provides a minimal request example script:\n\n```bash\npython scripts\u002Frequest_sglang_generation.py\n```\n\nThis script will:\n\n- send requests to `http:\u002F\u002Flocalhost:30000\u002Fgenerate` by default\n- use `asset\u002Freference_02_s1.wav` and `asset\u002Freference_02_s2.wav` in the repository as reference audio\n- save the returned audio to `outputs\u002Foutput.wav`\n\nIf you need to change the reference audio, input text, sampling parameters, or server URL, you can directly edit the corresponding constants in `scripts\u002Frequest_sglang_generation.py`.\n\n## Evaluation\n### Objective Evaluation(TTSD-eval)\n\nWe introduce a robust evaluation framework leveraging MMS-FA for word-level alignment and utterance segmentation and wespeaker for embedding extraction to derive Speaker Attribution Accuracy (ACC) and Speaker Similarity (SIM). Please refer to [TTSD-eval](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FTTSD-eval) for the code and data.\n\n\u003Cbr>\n\n| Model | ZH - SIM | ZH - ACC | ZH - WER | EN - SIM | EN - ACC | EN - WER |\n| :--- | :---: | :---: | :---: | :---: | :---: | :---: |\n| **Comparison with Open-Source Models** | | | | | | |\n| MOSS-TTSD | **0.7949** | **0.9587** | **0.0485** | **0.7326** | **0.9626** | 0.0988 |\n| MOSS-TTSD v0.7 | 0.7423 | 0.9391 | 0.0517 | 0.6743 | 0.9266 | 0.1612 |\n| Vibevoice 7B | 0.7590 | 0.9222 | 0.0570 | 0.7140 | 0.9554 | **0.0946** |\n| Vibevoice 1.5 B | 0.7415 | 0.8798 | 0.0818 | 0.6961 | 0.9353 | 0.1133 |\n| FireRedTTS2 | 0.7383 | 0.9022 | 0.0768 | - | - | - |\n| Higgs Audio V2 | - | - | - | 0.6860 | 0.9025 | 0.2131 |\n| **Comparison with Proprietary Models** | | | | | | |\n| Eleven V3 | 0.6970 | 0.9653 | **0.0363** | 0.6730 | 0.9498 | **0.0824** |\n| MOSS-TTSD (elevenlabs_voice) | **0.8165** | **0.9736** | 0.0391 | **0.7304** | **0.9565** | 0.1005 |\n| | | | | | | |\n| gemini-2.5-pro-preview-tts | - | - | - | 0.6786 | 0.9537 | **0.0859** |\n| gemini-2.5-flash-preview-tts | - | - | - | 0.7194 | 0.9511 | 0.0871 |\n| MOSS-TTSD (gemini_voice) | - | - | - | **0.7893** | **0.9655** | 0.0984 |\n| | | | | | | |\n| Doubao_Podcast | 0.8034 | 0.9606 | **0.0472** | - | - | - |\n| MOSS-TTSD (doubao_voice) | **0.8226** | **0.9630** | 0.0571 | - | - | - |\n\n### Subjective Evaluation\nFor open-source models, annotators are asked to score each sample pair in terms of speaker attribution accuracy, voice similarity, prosody, and overall quality. Following the methodology of the LMSYS Chatbot Arena, we compute Elo ratings and confidence intervals for each dimension.\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMOSS_MOSS-TTSD_readme_0415ca8fcaed.jpg)\n\nFor closed-source models, annotators are only asked to choose the overall preferred one in each pair, and we compute the win rate accordingly.\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMOSS_MOSS-TTSD_readme_14a13905875b.png)\n\n## 📚 More Information\n###  🌟 Community Projects\nThe MOSS-TTS community has been growing rapidly, and we’re delighted to showcase some outstanding projects and features built by community members:\n- **[ComfyUI-MOSS-TTS](https:\u002F\u002Fgithub.com\u002Frichservo\u002Fcomfyui-moss-tts)** A MOSS-TTS extension for ComfyUI.\n- **[MOSS-TTS-OpenAI](https:\u002F\u002Fgithub.com\u002Fdasilva333\u002Fmoss-tts-openai)** An OpenAI-compatible TTS API for MOSS-TTS.\n- **[AnyPod](https:\u002F\u002Fgithub.com\u002Frulerman\u002FAnyPod)** A podcast generation tool using MOSS-TTS\u002FMOSS-TTSD as the backend.\n\n## License\n\nMOSS-TTSD is released under the Apache 2.0 license.\n\n## Citation\n\n```\n@misc{zhang2026mossttsdtextspokendialogue,\n      title={MOSS-TTSD: Text to Spoken Dialogue Generation}, \n      author={Yuqian Zhang and Donghua Yu and Zhengyuan Lin and Botian Jiang and Mingshu Chen and Yaozhou Jiang and Yiwei Zhao and Yiyang Zhang and Yucheng Yuan and Hanfu Chen and Kexin Huang and Jun Zhan and Cheng Chang and Zhaoye Fei and Shimin Li and Xiaogui Yang and Qinyuan Cheng and Xipeng Qiu},\n      year={2026},\n      eprint={2603.19739},\n      archivePrefix={arXiv},\n      primaryClass={cs.SD},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.19739}, \n}\n```\n\n## ⚠️ Usage Disclaimer\n\nThis project provides an open-source spoken dialogue synthesis model intended for academic research, educational purposes, and legitimate applications such as AI podcast production, assistive technologies, and linguistic research. Users must not use this model for unauthorized voice cloning, impersonation, fraud, scams, deepfakes, or any illegal activities, and should ensure compliance with local laws and regulations while upholding ethical standards. The developers assume no liability for any misuse of this model and advocate for responsible AI development and use, encouraging the community to uphold safety and ethical principles in AI research and applications. If you have any concerns regarding ethics or misuse, please contact us.\n\n\u003Cbr>\n\n# MOSS-TTS Family\n\n## Introduction\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMOSS_MOSS-TTSD_readme_47fbed5cd655.jpeg\" width=\"85%\" \u002F>\n\u003C\u002Fp>\n\nWhen a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.\n\n- **MOSS‑TTS**: MOSS-TTS is the flagship production TTS foundation model, centered on high-fidelity zero-shot voice cloning with controllable long-form synthesis, pronunciation, and multilingual\u002Fcode-switched speech. It serves as the core engine for scalable narration, dubbing, and voice-driven products.\n- **MOSS‑TTSD**: MOSS-TTSD is a production long-form dialogue model for expressive multi-speaker conversational audio at scale. It supports long-duration continuity, turn-taking control, and zero-shot voice cloning from short references for podcasts, audiobooks, commentary, dubbing, and entertainment dialogue.\n- **MOSS‑VoiceGenerator**: MOSS-VoiceGenerator is an open-source voice design model that creates speaker timbres directly from free-form text, without reference audio. It unifies timbre design, style control, and content synthesis, and can be used standalone or as a voice-design layer for downstream TTS.\n- **MOSS‑SoundEffect**: MOSS-SoundEffect is a high-fidelity text-to-sound model with broad category coverage and controllable duration for real content production. It generates stable audio from prompts across ambience, urban scenes, creatures, human actions, and music-like clips for film, games, interactive media, and data synthesis.\n- **MOSS‑TTS‑Realtime**: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS model for real-time voice agents. By conditioning on dialogue history across both text and prior user acoustics, it delivers low-latency synthesis with coherent, consistent voice responses across turns.\n\n## Released Models\n\n| Model | Architecture | Size | Model Card | Hugging Face | ModelScope |\n|---|---|---:|---|---|---|\n| **MOSS-TTS** | `MossTTSDelay` | 8B | [![Model Card](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Card-View-blue?logo=markdown)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS\u002Fblob\u002Fmain\u002Fdocs\u002Fmoss_tts_model_card.md) | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTS) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-lightgrey?logo=modelscope)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTS) |\n|  | `MossTTSLocal` | 1.7B | [![Model Card](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Card-View-blue?logo=markdown)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS\u002Fblob\u002Fmain\u002Fdocs\u002Fmoss_tts_model_card.md) | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTS-Local-Transformer) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-lightgrey?logo=modelscope)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTS-Local-Transformer) |\n| **MOSS‑TTSD‑V1.0** | `MossTTSDelay` | 8B | [![Model Card](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Card-View-blue?logo=markdown)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS\u002Fblob\u002Fmain\u002Fdocs\u002Fmoss_ttsd_model_card.md) | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTSD-v1.0) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-lightgrey?logo=modelscope)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTSD-v1.0) |\n| **MOSS‑VoiceGenerator** | `MossTTSDelay` | 1.7B | [![Model Card](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Card-View-blue?logo=markdown)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS\u002Fblob\u002Fmain\u002Fdocs\u002Fmoss_voice_generator_model_card.md) | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-VoiceGenerator) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-lightgrey?logo=modelscope)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-VoiceGenerator) |\n| **MOSS‑SoundEffect** | `MossTTSDelay` | 8B | [![Model Card](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Card-View-blue?logo=markdown)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS\u002Fblob\u002Fmain\u002Fdocs\u002Fmoss_sound_effect_model_card.md) | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-SoundEffect) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-lightgrey?logo=modelscope)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-SoundEffect) |\n| **MOSS‑TTS‑Realtime** | `MossTTSRealtime` | 1.7B | [![Model Card](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Card-View-blue?logo=markdown)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS\u002Fblob\u002Fmain\u002Fdocs\u002Fmoss_tts_realtime_model_card.md) | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTS-Realtime) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-lightgrey?logo=modelscope)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTS-Realtime) |\n","\u003Cdiv align=\"center\">\n    \u003Ch1>\n    MOSS-TTSD：文本到口语对话生成\n    \u003C\u002Fh1>\n    \u003Cp>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMOSS_MOSS-TTSD_readme_21622e81d8aa.png\" alt=\"OpenMOSS Logo\" width=\"300\">\n    \u003Cp>\n    \u003C\u002Fp>\n    \u003Ca href=\"https:\u002F\u002Fmosi.cn\u002Fmodels\u002Fmoss-ttsd\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-Read%20More-green\" alt=\"blog\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.19739\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-2603.19739%20-red\" alt=\"paper\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTSD-v1.0\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20MOSS%20TTSD%20-v1.0-yellow\" alt=\"MOSS-TTSD-v1.0\">\u003C\u002Fa>\n     \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenMOSS-Team\u002FMOSS-TTSD\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Huggingface%20%20-space-orange\" alt=\"MOSS-TTSD-space\">\u003C\u002Fa>\n    \u003Ca href=\"\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAI Stuidio-Coming%20Soon-blue\" alt=\"AI Studio\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10+-orange\" alt=\"version\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTSD\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.0+-brightgreen\" alt=\"python\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTSD\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg\" alt=\"mit\">\u003C\u002Fa>\n    \u003Cbr>\n\n\u003C\u002Fdiv>\n\n\n# MOSS-TTSD🪐\n\n[English](README.md) | [简体中文](README_zh.md)\n\n\u003C!-- **MOSS-TTSD** is a long-form spoken dialogue generation model that enables highly expressive multi-party conversational speech synthesis across multiple languages. It supports continuous long-duration generation, flexible multi-speaker dialogue control, and state-of-the-art zero-shot voice cloning with only short reference audio. MOSS-TTSD is designed for real-world long-form content creation, including podcasts, audiobook, sports and esports commentary, dubbing, crosstalk, and entertainment scenarios. （about）-->\n\n\n## Overview\n \u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMOSS_MOSS-TTSD_readme_ee002edd316a.png\" alt=\"alt text\" width=\"330\">\n  \u003C\u002Fp>\n\nMOSS-TTSD是我们在开源[MOSS‑TTS家族](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS)中的长篇对话专家。虽然基础模型通常优先考虑高保真度的单人合成，但MOSS-TTSD的设计旨在弥合孤立音频样本与连贯、持续的人类互动之间的差距。\n该模型代表了从“文本转语音”到“剧本转对话”的范式转变。通过优先考虑多方互动的流畅性和情感细微之处，MOSS-TTSD将静态的对话脚本转化为动态、富有表现力的口头表演。它旨在为创作者和开发者提供一个强大的核心框架，使他们能够在不牺牲叙事连贯性的情况下，实现不同角色之间的无缝切换。\n无论是捕捉现场脱口秀的即兴活力，还是多语言戏剧的结构化复杂性，MOSS-TTSD都能在开源框架中提供专业级长篇内容创作所需的稳定性和表现力深度。\n\n\n## Highlights\n- **从独白到对话**: 与传统TTS优化朗读不同，MOSS-TTSD精通对话的节奏。它支持1至5位说话者，并具备灵活的控制能力，能够处理自然的轮流发言、重叠的语音模式以及不同角色个性的保持。\n- **超长上下文建模**: 模型不再局限于短句生成，而是专为长时间运行的稳定性而设计，单次会话可支持长达60分钟的连贯音频，且身份一致性始终如一。\n- **多样场景适应**: 针对高变化性的场景进行了微调，包括对话类媒体（AI播客）、动态解说（体育\u002F电竞）以及娱乐领域（有声书、配音和相声）。\n- **多语言与零样本能力**: 具备最先进的零样本语音克隆功能，仅需少量参考音频即可实现，同时在主要语言之间具有强大的跨语言性能，包括中文、英文、日语以及欧洲语言。\n\n\n## News 🚀\n- **[2026-03-18]** 我们为MOSS-TTSD v1.0提供了高效的端到端SGLang推理支持。\n- **[2026-03-06]** 我们为MOSS-TTSD v0.7增加了端到端SGLang推理支持。有关详细说明，请参阅[旧版v0.7文档](.\u002Flegacy\u002Fv0.7\u002FREADME.md)。\n- **[2026-02-10]** MOSS-TTSD v1.0正式发布！这一里程碑版本重新定义了长篇合成技术，支持单次会话长达60分钟的上下文长度，并能处理多方交互。它显著扩展了多语言能力和多样化应用场景。\n- **[2025-11-01]** MOSS-TTSD v0.7正式发布！v0.7大幅提升了音频质量、语音克隆能力和模型稳定性，新增了32 kHz高品质输出支持，极大延长了单次生成时长（960秒→1700秒）。\n- **[2025-09-09]** 我们引入了SGLang推理引擎，可将模型推理速度提升至原来的**16倍**。\n- **[2025-08-25]** 我们发布了XY分词器的32kHz版本。\n- **[2025-08-12]** 我们在MOSS-TTSD v0.5中增加了流式推理支持。\n- **[2025-07-29]** 我们为MOSS-TTSD v0.5提供了SiliconFlow API接口及使用示例。\n- **[2025-07-16]** 我们开源了MOSS-TTSD v0.5的微调代码，支持全参数微调、LoRA微调以及多节点训练。\n- **[2025-07-04]** MOSS-TTSD v0.5正式发布！v0.5增强了音色切换的准确性、语音克隆能力以及模型稳定性。\n- **[2025-06-20]** MOSS-TTSD v0正式发布！此外，我们还推出了一条名为Podever的播客生成流水线，可自动将PDF、URL或长文本文件转换为高质量播客。\n\n**注:** 对于MOSS-TTSD v0.7（包括端到端SGLang推理），请参阅[旧版v0.7文档](.\u002Flegacy\u002Fv0.7\u002FREADME.md)以获取详细说明。\n\n## 支持的语言\n\nMOSS-TTSD目前支持**20种语言**：\n\n| 语言 | 缩写 | 国旗 | 语言 | 缩写 | 国旗 | 语言 | 缩写 | 国旗 |\n|---|---|---|---|---|---|---|---|---|\n| 中文 | zh | 🇨🇳 | 英语 | en | 🇺🇸 | 德语 | de | 🇩🇪 |\n| 西班牙语 | es | 🇪🇸 | 法语 | fr | 🇫🇷 | 日语 | ja | 🇯🇵 |\n| 意大利语 | it | 🇮🇹 | 希伯来语 | he | 🇮🇱 | 韩语 | ko | 🇰🇷 |\n| 俄语 | ru | 🇷🇺 | 波斯语（法尔西语） | fa | 🇮🇷 | 阿拉伯语 | ar | 🇸🇦 |\n| 波兰语 | pl | 🇵🇱 | 葡萄牙语 | pt | 🇵🇹 | 捷克语 | cs | 🇨🇿 |\n| 丹麦语 | da | 🇩🇰 | 瑞典语 | sv | 🇸🇪 | 匈牙利语 | hu | 🇭🇺 |\n| 希腊语 | el | 🇬🇷 | 土耳其语 | tr | 🇹🇷 |  |  |  |\n\n## 安装\n\n要运行MOSS-TTSD，您需要安装必要的依赖项。您可以使用pip和conda来设置环境。\n\n### 使用conda\n\n```bash\nconda create -n moss_ttsd python=3.12 -y && conda activate moss_ttsd\npip install -r requirements.txt\npip install flash-attn\n```\n\n## 使用方法\n\n### 快速入门\n\nMOSS-TTSD 采用**续写**的工作流：为每位说话人提供参考音频及其转录文本作为前缀，再输入待生成的对话文本。模型会以每位说话人的身份继续生成内容。\n\n```python\nimport os\nfrom pathlib import Path\nimport torch\nimport soundfile as sf\nimport torchaudio\nfrom transformers import AutoModel, AutoProcessor\n\npretrained_model_name_or_path = \"OpenMOSS-Team\u002FMOSS-TTSD-v1.0\"\naudio_tokenizer_name_or_path = \"OpenMOSS-Team\u002FMOSS-Audio-Tokenizer\"\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\ndtype = torch.bfloat16 if device == \"cuda\" else torch.float32\n\nprocessor = AutoProcessor.from_pretrained(\n    pretrained_model_name_or_path,\n    trust_remote_code=True,\n    codec_path=audio_tokenizer_name_or_path,\n)\nprocessor.audio_tokenizer = processor.audio_tokenizer.to(device)\nprocessor.audio_tokenizer.eval()\n\nattn_implementation = \"flash_attention_2\" if device == \"cuda\" else \"sdpa\"\n# 如果您的环境中无法使用 flash_attention_2，请将其设置为 \"sdpa\"。\nmodel = AutoModel.from_pretrained(\n    pretrained_model_name_or_path,\n    trust_remote_code=True,\n    attn_implementation=attn_implementation,\n    torch_dtype=dtype,\n).to(device)\nmodel.eval()\n\n# --- 输入 ---\n\nprompt_audio_speaker1 = \"asset\u002Freference_02_s1.wav\"\nprompt_audio_speaker2 = \"asset\u002Freference_02_s2.wav\"\nprompt_text_speaker1 = \"[S1] 简而言之，我们踏上了一项使命，旨在让美国再次伟大，惠及所有美国人。\"\nprompt_text_speaker2 = \"[S2] 英伟达在六十年后首次重新定义了计算。事实上，IBM 的 Erwin 非常清楚，自 60 年代以来，计算机基本上没有太大变化。\"\n\ntext_to_generate = \"\"\"\n[S1] 听着，咱们谈谈正事吧。中国。我听说了一些情况。\n有人说他们正在迅速赶上来。到底怎么回事？\n他们的人工智能——是不是一种威胁？\n[S2] 坦白说，那里的创新速度确实非同寻常。\n他们既有顶尖的研究人员，也有强大的驱动力。\n[S1] 非同寻常？我不喜欢这个词。我希望我们自己才是非同寻常的。\n他们现在占上风了吗？\n[S2] 我不会说他们赢了，但他们的进展非常有前景。\n他们在构建大规模的计算集群，而且意志坚定。\n[S1] 有前景。又是这个词。我最讨厌它了。\n只要中国“有前景”，就意味着我们在落后。\n这太糟糕了，Jensen。彻底的灾难。\n\"\"\".strip()\n\n# --- 加载并重采样音频 ---\n\ntarget_sr = int(processor.model_config.sampling_rate)\naudio1, sr1 = sf.read(prompt_audio_speaker1, dtype=\"float32\", always_2d=True)\naudio2, sr2 = sf.read(prompt_audio_speaker2, dtype=\"float32\", always_2d=True)\nwav1 = torch.from_numpy(audio1).transpose(0, 1).contiguous()\nwav2 = torch.from_numpy(audio2).transpose(0, 1).contiguous()\n\nif wav1.shape[0] > 1:\n    wav1 = wav1.mean(dim=0, keepdim=True)\nif wav2.shape[0] > 1:\n    wav2 = wav2.mean(dim=0, keepdim=True)\nif sr1 != target_sr:\n    wav1 = torchaudio.functional.resample(wav1, sr1, target_sr)\nif sr2 != target_sr:\n    wav2 = torchaudio.functional.resample(wav2, sr2, target_sr)\n\n# --- 构建对话 ---\n\nreference_audio_codes = processor.encode_audios_from_wav([wav1, wav2], sampling_rate=target_sr)\nconcat_prompt_wav = torch.cat([wav1, wav2], dim=-1)\nprompt_audio = processor.encode_audios_from_wav([concat_prompt_wav], sampling_rate=target_sr)[0]\n\nfull_text = f\"{prompt_text_speaker1} {prompt_text_speaker2} {text_to_generate}\"\n\nconversations = [\n    [\n        processor.build_user_message(\n            text=full_text,\n            reference=reference_audio_codes,\n        ),\n        processor.build_assistant_message(\n            audio_codes_list=[prompt_audio]\n        ),\n    ],\n]\n\n# --- 推理 ---\n\nbatch_size = 1\n\nsave_dir = Path(\"output\")\nsave_dir.mkdir(exist_ok=True, parents=True)\nsample_idx = 0\nwith torch.no_grad():\n    for start in range(0, len(conversations), batch_size):\n        batch_conversations = conversations[start : start + batch_size]\n        batch = processor(batch_conversations, mode=\"continuation\")\n        input_ids = batch[\"input_ids\"].to(device)\n        attention_mask = batch[\"attention_mask\"].to(device)\n\n        outputs = model.generate(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            max_new_tokens=2000,\n        )\n\n        for message in processor.decode(outputs):\n            for seg_idx, audio in enumerate(message.audio_codes_list):\n                sf.write(\n                    save_dir \u002F f\"{sample_idx}_{seg_idx}.wav\",\n                    audio.detach().cpu().to(torch.float32).numpy(),\n                    int(processor.model_config.sampling_rate),\n                )\n            sample_idx += 1\n\n```\n\n### 批量推理\n\n您可以使用提供的推理脚本进行批量推理。该脚本会自动使用所有可见的 GPU。您可以通过 `export CUDA_VISIBLE_DEVICES=\u003Cdevice_ids>` 来控制 GPU 的可见性。\n\n```bash\npython inference.py \\\n  --model_path OpenMOSS-Team\u002FMOSS-TTSD-v1.0 \\\n  --codec_model_path OpenMOSS-Team\u002FMOSS-Audio-Tokenizer \\\n  --input_jsonl \u002Fpath\u002Fto\u002Finput.jsonl \\\n  --save_dir outputs \\\n  --mode voice_clone_and_continuation \\\n  --batch_size 1 \\\n  --text_normalize\n```\n\n参数说明：\n\n- `--model_path`: MOSS-TTSD 的模型路径或 HuggingFace 模型 ID。\n- `--codec_model_path`: MOSS-Audio-Tokenizer 的模型路径或 HuggingFace 模型 ID。\n- `--input_jsonl`: 包含对话脚本和说话人提示的输入 JSONL 文件路径。\n- `--save_dir`: 生成的音频文件保存目录。\n- `--mode`: 推理模式。可选值：`generation`、`continuation`、`voice_clone`、`voice_clone_and_continuation`。我们推荐使用 `voice_clone_and_continuation` 以获得最佳的语音克隆效果。\n- `--batch_size`: 每批处理的样本数量（默认为 `1`）。\n- `--max_new_tokens`: 最大生成新标记数。用于控制总生成音频长度（1 秒 ≈ 12.5 个标记）。\n- `--temperature`: 采样温度（默认为 `1.1`）。\n- `--top_p`: Top-p 采样阈值（默认为 `0.9`）。\n- `--top_k`: Top-k 采样阈值（默认为 `50`）。\n- `--repetition_penalty`: 重复惩罚（默认为 `1.1`）。\n- `--text_normalize`: 对输入文本进行归一化处理（**建议始终启用**）。\n- `--sample_rate_normalize`: 在编码前将提示音频重采样到最低采样率（**当使用 2 名或更多说话人时建议启用**）。\n\n#### JSONL 输入格式\n\n输入 JSONL 文件应每行包含一个 JSON 对象。MOSS-TTSD 支持每段对话中最多 5 名说话人。在 `text` 字段中使用 `[S1]`–`[S5]` 标签，并为每位说话人提供对应的 `prompt_audio_speakerN` 和 `prompt_text_speakerN`：\n```json\n{\n  \"base_path\": \"\u002Fpath\u002Fto\u002Faudio\u002Ffiles\",\n  \"text\": \"[S1]Speaker 1 dialogue[S2]Speaker 2 dialogue[S3]...[S4]...[S5]...\",\n  \"prompt_audio_speaker1\": \"path\u002Fto\u002Fspeaker1_audio.wav\",\n  \"prompt_text_speaker1\": \"Reference text for speaker 1 voice cloning\",\n  \"prompt_audio_speaker2\": \"path\u002Fto\u002Fspeaker2_audio.wav\",\n  \"prompt_text_speaker2\": \"Reference text for speaker 2 voice cloning\",\n  \"...\": \"...\",\n  \"prompt_audio_speaker5\": \"path\u002Fto\u002Fspeaker5_audio.wav\",\n  \"prompt_text_speaker5\": \"Reference text for speaker 5 voice cloning\"\n}\n```\n\n### 使用 SGLang 加速推理\n\nMOSS-TTSD v1.0 支持运行由 OpenMOSS 深度扩展的 [SGLang](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002Fsglang) 提供的融合模型 MOSS-TTSD 和 MOSS-Audio-Tokenizer，从而实现高效的音频生成推理。\n\n#### 环境设置\n\n首先，克隆与 MOSS-TTSD v1.0 兼容的 SGLang 分支。\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002Fsglang -b moss-ttsd-v1.0-with-cat\n```\n\n##### 使用 venv\n\n```bash\npython -m venv moss_ttsd_sglang\nsource moss_ttsd_sglang\u002Fbin\u002Factivate\npip install .\u002Fsglang\u002Fpython[all]\n```\n\n##### 使用 conda\n\n```bash\nconda create -n moss_ttsd_sglang python=3.12\nconda activate moss_ttsd_sglang\npip install .\u002Fsglang\u002Fpython[all]\n```\n\n#### 端到端推理服务\n\n##### 启动推理服务器\n\n在启动服务之前，首先下载 [MOSS-TTSD-v1.0](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTSD-v1.0) 和 [MOSS-Audio-Tokenizer](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Audio-Tokenizer)。\n\n```bash\ngit clone https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTSD-v1.0\ngit clone https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Audio-Tokenizer\n```\n\n或者：\n\n```bash\nhf download OpenMOSS-Team\u002FMOSS-TTSD-v1.0 --local-dir .\u002FMOSS-TTSD-v1.0\nhf download OpenMOSS-Team\u002FMOSS-Audio-Tokenizer --local-dir .\u002FMOSS-Audio-Tokenizer\n```\n\n下载完成后，运行以下命令将 MOSS-TTSD v1.0 和 MOSS-Audio-Tokenizer 融合为一个可由 SGLang 加载的单目录模型。融合后，模型默认使用 `voice_clone_and_continuation` 推理模式：\n\n```bash\npython scripts\u002Ffuse_moss_tts_delay_with_codec.py \\\n  --model-path \u003Cpath-to-moss-ttsd-v1.0> \\\n  --codec-model-path \u003Cpath-to-moss-audio-tokenizer> \\\n  --save-path \u003Cpath-to-fused-model>\n```\n\n然后使用以下命令启动推理服务器：\n\n```bash\nsglang serve \\\n  --model-path \u003Cpath-to-fused-model> \\\n  --delay-pattern \\\n  --trust-remote-code \\\n  --port 30000 --host 0.0.0.0\n```\n\n> 第一次启动服务可能需要较长时间进行编译。当看到 `The server is fired up and ready to roll!` 时，服务即已准备就绪。启动后的首次请求仍可能触发较长的编译过程，这是正常现象，请耐心等待。\n\n> **提示**：端到端推理服务在运行过程中可能会导致显存碎片化。如果 GPU 显存紧张，建议在启动 SGLang 时使用 `--mem-fraction-static` 参数，以预留足够的中间张量空间。\n\n##### 发送生成请求\n\n该服务 API 兼容标准的多模态文本生成接口。返回的 JSON 中的 `text` 字段包含 base64 编码的 WAV 音频。\n\n目前仓库提供了一个最小化的请求示例脚本：\n\n```bash\npython scripts\u002Frequest_sglang_generation.py\n```\n\n该脚本将：\n\n- 默认向 `http:\u002F\u002Flocalhost:30000\u002Fgenerate` 发送请求\n- 使用仓库中的 `asset\u002Freference_02_s1.wav` 和 `asset\u002Freference_02_s2.wav` 作为参考音频\n- 将返回的音频保存到 `outputs\u002Foutput.wav`\n\n如果您需要更改参考音频、输入文本、采样参数或服务器 URL，可以直接编辑 `scripts\u002Frequest_sglang_generation.py` 中的相应常量。\n\n## 评估\n\n### 客观评价（TTSD-eval）\n\n我们引入了一个强大的评估框架，利用MMS-FA进行词级对齐和话语分割，并使用wespeaker提取嵌入，以计算说话人归属准确率（ACC）和说话人相似度（SIM）。代码和数据请参阅[TTSD-eval](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FTTSD-eval)。\n\n\u003Cbr>\n\n| 模型 | 中文 - SIM | 中文 - ACC | 中文 - WER | 英文 - SIM | 英文 - ACC | 英文 - WER |\n| :--- | :---: | :---: | :---: | :---: | :---: | :---: |\n| **与开源模型的对比** | | | | | | |\n| MOSS-TTSD | **0.7949** | **0.9587** | **0.0485** | **0.7326** | **0.9626** | 0.0988 |\n| MOSS-TTSD v0.7 | 0.7423 | 0.9391 | 0.0517 | 0.6743 | 0.9266 | 0.1612 |\n| Vibevoice 7B | 0.7590 | 0.9222 | 0.0570 | 0.7140 | 0.9554 | **0.0946** |\n| Vibevoice 1.5 B | 0.7415 | 0.8798 | 0.0818 | 0.6961 | 0.9353 | 0.1133 |\n| FireRedTTS2 | 0.7383 | 0.9022 | 0.0768 | - | - | - |\n| Higgs Audio V2 | - | - | - | 0.6860 | 0.9025 | 0.2131 |\n| **与专有模型的对比** | | | | | | |\n| Eleven V3 | 0.6970 | 0.9653 | **0.0363** | 0.6730 | 0.9498 | **0.0824** |\n| MOSS-TTSD (elevenlabs_voice) | **0.8165** | **0.9736** | 0.0391 | **0.7304** | **0.9565** | 0.1005 |\n| | | | | | | |\n| gemini-2.5-pro-preview-tts | - | - | - | 0.6786 | 0.9537 | **0.0859** |\n| gemini-2.5-flash-preview-tts | - | - | - | 0.7194 | 0.9511 | 0.0871 |\n| MOSS-TTSD (gemini_voice) | - | - | - | **0.7893** | **0.9655** | 0.0984 |\n| | | | | | | |\n| Doubao_Podcast | 0.8034 | 0.9606 | **0.0472** | - | - | - |\n| MOSS-TTSD (doubao_voice) | **0.8226** | **0.9630** | 0.0571 | - | - | - |\n\n### 主观评价\n对于开源模型，标注者被要求从说话人归属准确率、声音相似度、韵律和整体质量等方面对每一对样本进行打分。参照LMSYS Chatbot Arena的方法论，我们为每个维度计算Elo评分及置信区间。\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMOSS_MOSS-TTSD_readme_0415ca8fcaed.jpg)\n\n对于闭源模型，标注者只需在每一对中选择总体更偏好的一方，并据此计算胜率。\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMOSS_MOSS-TTSD_readme_14a13905875b.png)\n\n## 📚 更多信息\n###  🌟 社区项目\nMOSS-TTS社区发展迅速，我们很高兴展示一些由社区成员构建的优秀项目和功能：\n- **[ComfyUI-MOSS-TTS](https:\u002F\u002Fgithub.com\u002Frichservo\u002Fcomfyui-moss-tts)**：ComfyUI的MOSS-TTS扩展。\n- **[MOSS-TTS-OpenAI](https:\u002F\u002Fgithub.com\u002Fdasilva333\u002Fmoss-tts-openai)**：兼容OpenAI的MOSS-TTS TTS API。\n- **[AnyPod](https:\u002F\u002Fgithub.com\u002Frulerman\u002FAnyPod)**：使用MOSS-TTS\u002FMOSS-TTSD作为后端的播客生成工具。\n\n## 许可证\n\nMOSS-TTSD根据Apache 2.0许可证发布。\n\n## 引用\n\n```\n@misc{zhang2026mossttsdtextspokendialogue,\n      title={MOSS-TTSD: 文本到口语对话生成}, \n      author={Yuqian Zhang and Donghua Yu and Zhengyuan Lin and Botian Jiang and Mingshu Chen and Yaozhou Jiang and Yiwei Zhao and Yiyang Zhang and Yucheng Yuan and Hanfu Chen and Kexin Huang and Jun Zhan and Cheng Chang and Zhaoye Fei and Shimin Li and Xiaogui Yang and Qinyuan Cheng and Xipeng Qiu},\n      year={2026},\n      eprint={2603.19739},\n      archivePrefix={arXiv},\n      primaryClass={cs.SD},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.19739}, \n}\n```\n\n## ⚠️ 使用免责声明\n\n本项目提供了一个开源的口语对话合成模型，旨在用于学术研究、教育目的以及合法的应用场景，例如AI播客制作、辅助技术开发和语言学研究。用户不得将该模型用于未经授权的语音克隆、冒充他人、欺诈、诈骗、深度伪造或其他任何非法活动，并应在遵守当地法律法规的同时，坚持道德标准。开发者对该模型的任何滥用不承担任何责任，并倡导负责任的人工智能开发与使用，鼓励社区在人工智能研究和应用中坚守安全与伦理原则。如果您对伦理或滥用有任何疑虑，请联系我们。\n\n\u003Cbr>\n\n# MOSS-TTS家族\n\n## 简介\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMOSS_MOSS-TTSD_readme_47fbed5cd655.jpeg\" width=\"85%\" \u002F>\n\u003C\u002Fp>\n\n当一段音频需要**听起来像真人**、**准确地发音每一个词**、**在不同内容间切换说话风格**、**保持数十分钟的稳定性**，并且**支持对话、角色扮演和实时交互**时，单一的TTS模型往往难以胜任。**MOSS‑TTS家族**将工作流程拆分为五个可独立使用或组合成完整流水线的生产就绪模型。\n\n- **MOSS‑TTS**：MOSS-TTS是旗舰级的生产用TTS基础模型，专注于高保真零样本语音克隆，具备可控的长文本合成、发音能力和多语言\u002F语码转换能力。它是规模化叙述、配音和语音驱动产品的核心引擎。\n- **MOSS‑TTSD**：MOSS-TTSD是用于大规模表达性多说话人对话音频的生产级长文本对话模型。它支持长时间连续性、轮流发言控制，以及基于短参考的零样本语音克隆，适用于播客、有声书、解说、配音和娱乐对话。\n- **MOSS‑VoiceGenerator**：MOSS-VoiceGenerator是一个开源的语音设计模型，无需参考音频即可直接从自由文本创建说话人音色。它整合了音色设计、风格控制和内容合成功能，可单独使用，也可作为下游TTS的语音设计层。\n- **MOSS‑SoundEffect**：MOSS-SoundEffect是一个高保真文本转声音模型，覆盖广泛的类别，且持续时间可控，适用于真实内容制作。它可根据提示生成稳定的环境音效、城市场景、生物、人类动作以及类似音乐的片段，广泛应用于电影、游戏、互动媒体和数据合成等领域。\n- **MOSS‑TTS‑Realtime**：MOSS-TTS-Realtime是一个上下文感知、多轮流式传输的TTS模型，专为实时语音助手设计。通过结合文本对话历史和先前用户的声学特征进行条件化，它能够在低延迟下实现连贯、一致的语音响应。\n\n## 已发布模型\n\n| 模型 | 架构 | 参数量 | 模型卡片 | Hugging Face | ModelScope |\n|---|---|---:|---|---|---|\n| **MOSS-TTS** | `MossTTSDelay` | 80亿 | [![模型卡片](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Card-View-blue?logo=markdown)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS\u002Fblob\u002Fmain\u002Fdocs\u002Fmoss_tts_model_card.md) | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTS) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-lightgrey?logo=modelscope)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTS) |\n|  | `MossTTSLocal` | 17亿 | [![模型卡片](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Card-View-blue?logo=markdown)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS\u002Fblob\u002Fmain\u002Fdocs\u002Fmoss_tts_model_card.md) | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTS-Local-Transformer) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-lightgrey?logo=modelscope)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTS-Local-Transformer) |\n| **MOSS‑TTSD‑V1.0** | `MossTTSDelay` | 80亿 | [![模型卡片](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Card-View-blue?logo=markdown)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS\u002Fblob\u002Fmain\u002Fdocs\u002Fmoss_ttsd_model_card.md) | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTSD-v1.0) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-lightgrey?logo=modelscope)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTSD-v1.0) |\n| **MOSS‑VoiceGenerator** | `MossTTSDelay` | 17亿 | [![模型卡片](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Card-View-blue?logo=markdown)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS\u002Fblob\u002Fmain\u002Fdocs\u002Fmoss_voice_generator_model_card.md) | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-VoiceGenerator) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-lightgrey?logo=modelscope)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-VoiceGenerator) |\n| **MOSS‑SoundEffect** | `MossTTSDelay` | 80亿 | [![模型卡片](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Card-View-blue?logo=markdown)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS\u002Fblob\u002Fmain\u002Fdocs\u002Fmoss_sound_effect_model_card.md) | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-SoundEffect) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-lightgrey?logo=modelscope)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-SoundEffect) |\n| **MOSS‑TTS‑Realtime** | `MossTTSRealtime` | 17亿 | [![模型卡片](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Card-View-blue?logo=markdown)](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS\u002Fblob\u002Fmain\u002Fdocs\u002Fmoss_tts_realtime_model_card.md) | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTS-Realtime) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-lightgrey?logo=modelscope)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTS-Realtime) |","# MOSS-TTSD 快速上手指南\n\nMOSS-TTSD 是一款专注于长篇幅多角色对话生成的开源模型。它支持从“文本转语音”到“剧本转对话”的范式转变，能够生成具有自然轮替、情感起伏和多说话人身份一致性的连续音频（单次会话最长支持 60 分钟）。适用于播客、有声书、体育解说及多角色广播剧等场景。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐) 或 macOS\n*   **Python 版本**: 3.10+ (官方示例使用 3.12)\n*   **GPU**: 推荐使用 NVIDIA GPU，显存建议 16GB 以上以支持长上下文生成\n*   **深度学习框架**: PyTorch 2.0+\n*   **其他依赖**: `flash-attn` (用于加速推理，需 CUDA 环境)\n\n## 安装步骤\n\n建议使用 `conda` 创建独立的虚拟环境以避免依赖冲突。\n\n### 1. 创建并激活环境\n\n```bash\nconda create -n moss_ttsd python=3.12 -y && conda activate moss_ttsd\n```\n\n### 2. 安装基础依赖\n\n为了获得更快的下载速度，国内用户可配置 pip 使用清华或阿里镜像源：\n\n```bash\npip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 3. 安装 Flash Attention (可选但推荐)\n\n如果您的环境包含 CUDA 且希望获得最佳推理性能，请安装 `flash-attn`：\n\n```bash\npip install flash-attn --no-build-isolation\n```\n*注：若安装失败或不具备 CUDA 环境，代码中可将 `attn_implementation` 设置为 `\"sdpa\"`。*\n\n## 基本使用\n\nMOSS-TTSD 采用 **续写 (Continuation)** 的工作流：您需要提供每个说话人的参考音频片段及其对应的文本前缀，模型将基于这些信息延续生成后续的对话内容。\n\n以下是一个最简单的单脚本运行示例，展示如何加载模型并生成双人对谈音频：\n\n```python\nimport os\nfrom pathlib import Path\nimport torch\nimport soundfile as sf\nimport torchaudio\nfrom transformers import AutoModel, AutoProcessor\n\n# 模型路径配置\npretrained_model_name_or_path = \"OpenMOSS-Team\u002FMOSS-TTSD-v1.0\"\naudio_tokenizer_name_or_path = \"OpenMOSS-Team\u002FMOSS-Audio-Tokenizer\"\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\ndtype = torch.bfloat16 if device == \"cuda\" else torch.float32\n\n# 加载处理器与音频分词器\nprocessor = AutoProcessor.from_pretrained(\n    pretrained_model_name_or_path,\n    trust_remote_code=True,\n    codec_path=audio_tokenizer_name_or_path,\n)\nprocessor.audio_tokenizer = processor.audio_tokenizer.to(device)\nprocessor.audio_tokenizer.eval()\n\n# 加载主模型\nattn_implementation = \"flash_attention_2\" if device == \"cuda\" else \"sdpa\"\nmodel = AutoModel.from_pretrained(\n    pretrained_model_name_or_path,\n    trust_remote_code=True,\n    attn_implementation=attn_implementation,\n    torch_dtype=dtype,\n).to(device)\nmodel.eval()\n\n# --- 输入数据准备 ---\n\n# 1. 准备参考音频 (请替换为您本地的 wav 文件路径)\nprompt_audio_speaker1 = \"asset\u002Freference_02_s1.wav\"\nprompt_audio_speaker2 = \"asset\u002Freference_02_s2.wav\"\n\n# 2. 准备参考文本前缀 (需标记说话人 [S1], [S2] 等)\nprompt_text_speaker1 = \"[S1] In short, we embarked on a mission to make America great again for all Americans.\"\nprompt_text_speaker2 = \"[S2] NVIDIA reinvented computing for the first time after 60 years.\"\n\n# 3. 准备需要生成的对话文本\ntext_to_generate = \"\"\"\n[S1] Listen, let's talk business. China. I'm hearing things.\nPeople are saying they're catching up. Fast. What's the real scoop?\nTheir AI—is it a threat?\n[S2] Well, the pace of innovation there is extraordinary, honestly.\nThey have the researchers, and they have the drive.\n[S1] Extraordinary? I don't like that. I want us to be extraordinary.\nAre they winning?\n\"\"\".strip()\n\n# --- 音频预处理 ---\n\ntarget_sr = int(processor.model_config.sampling_rate)\n\ndef load_and_resample_audio(path, target_sr):\n    audio, sr = sf.read(path, dtype=\"float32\", always_2d=True)\n    wav = torch.from_numpy(audio).transpose(0, 1).contiguous()\n    if wav.shape[0] > 1:\n        wav = wav.mean(dim=0, keepdim=True)\n    if sr != target_sr:\n        wav = torchaudio.functional.resample(wav, sr, target_sr)\n    return wav\n\nwav1 = load_and_resample_audio(prompt_audio_speaker1, target_sr)\nwav2 = load_and_resample_audio(prompt_audio_speaker2, target_sr)\n\n# 编码参考音频\nreference_audio_codes = processor.encode_audios_from_wav([wav1, wav2], sampling_rate=target_sr)\nconcat_prompt_wav = torch.cat([wav1, wav2], dim=-1)\nprompt_audio = processor.encode_audios_from_wav([concat_prompt_wav], sampling_rate=target_sr)[0]\n\n# 构建完整文本上下文\nfull_text = f\"{prompt_text_speaker1} {prompt_text_speaker2} {text_to_generate}\"\n\nconversations = [\n    [\n        processor.build_user_message(\n            text=full_text,\n            reference=reference_audio_codes,\n        ),\n        processor.build_assistant_message(\n            audio_codes_list=[prompt_audio]\n        ),\n    ],\n]\n\n# --- 推理生成 ---\n\nsave_dir = Path(\"output\")\nsave_dir.mkdir(exist_ok=True, parents=True)\nsample_idx = 0\n\nwith torch.no_grad():\n    batch = processor(conversations, mode=\"continuation\")\n    input_ids = batch[\"input_ids\"].to(device)\n    attention_mask = batch[\"attention_mask\"].to(device)\n\n    outputs = model.generate(\n        input_ids=input_ids,\n        attention_mask=attention_mask,\n        max_new_tokens=2000, # 控制生成长度\n    )\n\n    for message in processor.decode(outputs):\n        for seg_idx, audio in enumerate(message.audio_codes_list):\n            output_path = save_dir \u002F f\"{sample_idx}_{seg_idx}.wav\"\n            sf.write(\n                output_path,\n                audio.detach().cpu().to(torch.float32).numpy(),\n                int(processor.model_config.sampling_rate),\n            )\n            print(f\"Audio saved to: {output_path}\")\n        sample_idx += 1\n```\n\n### 批量推理\n\n如需处理大量数据，可使用官方提供的批处理脚本。该脚本会自动利用所有可见的 GPU。\n\n```bash\npython inference.py \\\n  --model_path OpenMOSS-Team\u002FMOSS-TTSD-v1.0 \\\n  --codec_model_path OpenMOSS-Team\u002FMOSS-Audio-Tokenizer \\\n  --input_jsonl \u002Fpath\u002Fto\u002Finput.jsonl \\\n  --save_dir outputs \\\n  --mode voice_clone_and_continuation \\\n  --batch_size 1 \\\n  --text_normalize\n```\n\n**参数说明：**\n*   `--input_jsonl`: 输入文件路径，格式为 JSONL，每行包含参考音频路径、参考文本及待生成文本。\n*   `--mode`: 设置为 `voice_clone_and_continuation` 以启用声音克隆和对话续写模式。","一家小型播客制作团队正试图将一篇长达 40 分钟的双人访谈文稿快速转化为带有自然情感起伏的音频节目，以测试新栏目原型。\n\n### 没有 MOSS-TTSD 时\n- **角色切换生硬**：传统 TTS 只能逐句生成单人说辞，主持人及嘉宾的声音需要在不同模型或配置间频繁切换，导致对话节奏断裂，缺乏真实交流感。\n- **长文一致性差**：生成超过 10 分钟的音频后，同一说话人的音色、语速和情绪容易出现漂移，后期需要人工花费数小时进行剪辑和匀色处理。\n- **克隆门槛极高**：若想模拟特定嘉宾声音，通常需要录制数分钟的高质量干音进行训练，对于仅能提供几秒参考音频的远程采访场景完全无法实现。\n- **多语言支持局限**：当访谈中夹杂英文术语或日语引用时，原有工具往往发音怪异或被迫切换引擎，破坏整体听感流畅度。\n\n### 使用 MOSS-TTSD 后\n- **原生对话流生成**：MOSS-TTSD 直接理解“剧本即对话”，自动处理双人甚至多人的自然轮转与重叠插话，生成的音频拥有真实的交谈呼吸感和节奏。\n- **超长上下文稳定**：凭借长上下文建模能力，MOSS-TTSD 能一次性稳定输出 60 分钟连贯音频，确保整期节目中每位说话人的音色与性格特征高度一致。\n- **极简零样本克隆**：只需提供嘉宾 5-10 秒的简短录音，MOSS-TTSD 即可精准复刻其声线并融入对话，让远程素材瞬间变为生动的现场对谈。\n- **无缝多语言混合**：在中英日多语种混杂的语境下，MOSS-TTSD 能保持统一的语调风格，无需任何额外配置即可实现自然的跨语言流畅表达。\n\nMOSS-TTSD 将原本需要数天协作的多人播客制作流程压缩至分钟级，让创作者能专注于内容策划而非繁琐的音频工程。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMOSS_MOSS-TTSD_ee002edd.png","OpenMOSS","OpenMOSS (SII)","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FOpenMOSS_f805d84f.png","OpenMOSS Team is a research group under the Shanghai Innovation Institution (SII), working in close collaboration with Fudan University and MOSI Intelligence. ",null,"openmoss@sii.edu.cn","http:\u002F\u002Fopenmoss.ai\u002F","https:\u002F\u002Fgithub.com\u002FOpenMOSS",[84],{"name":85,"color":86,"percentage":87},"Python","#3572A5",100,1236,119,"2026-04-08T00:54:50","Apache-2.0",4,"Linux, macOS, Windows","必需（用于高性能推理），需支持 CUDA 的 NVIDIA GPU，建议使用支持 Flash Attention 2 的显卡（如 Ampere 架构及以上），显存建议 16GB+ 以支持长上下文生成","未说明（建议 32GB+ 以处理长音频和批量推理）",{"notes":97,"python":98,"dependencies":99},"1. 推荐使用 conda 创建 Python 3.12 环境进行安装。\n2. 必须安装 flash-attn 库以在 CUDA 设备上获得最佳性能，若环境不支持可回退至 sdpa 模式。\n3. 模型支持长达 60 分钟的连续对话生成，对显存和内存消耗较大。\n4. 需要额外下载 MOSS-Audio-Tokenizer 作为音频编解码器依赖。\n5. 支持多卡批量推理，可通过 CUDA_VISIBLE_DEVICES 控制可见显卡。","3.10+",[100,101,102,103,104,105],"torch>=2.0","transformers","flash-attn","soundfile","torchaudio","SGLang",[15,47],[108,109,110,111,112,113],"large-language-models","text-to-speeh","speech-dialogue-generation","finetune","sglang","streaming","2026-03-27T02:49:30.150509","2026-04-08T14:25:10.604262",[],[]]