[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-huggingface--speech-to-speech":3,"tool-huggingface--speech-to-speech":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",156033,2,"2026-04-14T23:32:00",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":77,"owner_email":77,"owner_twitter":73,"owner_website":78,"owner_url":79,"languages":80,"stars":89,"forks":90,"last_commit_at":91,"license":92,"difficulty_score":10,"env_os":93,"env_gpu":94,"env_ram":95,"env_deps":96,"category_tags":110,"github_topics":112,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":122,"updated_at":123,"faqs":124,"releases":154},7660,"huggingface\u002Fspeech-to-speech","speech-to-speech","Build local voice agents with open-source models","speech-to-speech 是一个开源项目，旨在帮助开发者在本地构建完全自主的语音智能体。它通过串联语音活动检测（VAD）、语音转文字（STT）、大语言模型（LM）和文字转语音（TTS）四个核心模块，形成了一套完整的端到端语音交互流水线，让用户无需依赖云端 API 即可实现低延迟、高隐私的实时对话系统。\n\n该项目主要解决了传统语音助手依赖专有云服务、数据隐私难以保障以及定制灵活性不足的问题。其最大的技术亮点在于高度的模块化设计：用户可以自由组合 Hugging Face 生态中的各类开源模型，例如选用 Whisper 进行高精度识别，搭配指令遵循大模型处理逻辑，并利用 MeloTTS 或针对 Apple Silicon 优化的 Kokoro 生成自然语音。此外，它还原生支持 Docker 部署及 macOS 本地运行，特别适配苹果芯片以实现高效推理。\n\nspeech-to-speech 非常适合希望深入探索本地化 AI 应用的开发者、研究人员以及对数据隐私有严格要求的技术爱好者。无论是想快速原型验证新的语音交互方案，还是需要在离线环境中部署智能助手，它都提供了灵活且强大的基础架构","speech-to-speech 是一个开源项目，旨在帮助开发者在本地构建完全自主的语音智能体。它通过串联语音活动检测（VAD）、语音转文字（STT）、大语言模型（LM）和文字转语音（TTS）四个核心模块，形成了一套完整的端到端语音交互流水线，让用户无需依赖云端 API 即可实现低延迟、高隐私的实时对话系统。\n\n该项目主要解决了传统语音助手依赖专有云服务、数据隐私难以保障以及定制灵活性不足的问题。其最大的技术亮点在于高度的模块化设计：用户可以自由组合 Hugging Face 生态中的各类开源模型，例如选用 Whisper 进行高精度识别，搭配指令遵循大模型处理逻辑，并利用 MeloTTS 或针对 Apple Silicon 优化的 Kokoro 生成自然语音。此外，它还原生支持 Docker 部署及 macOS 本地运行，特别适配苹果芯片以实现高效推理。\n\nspeech-to-speech 非常适合希望深入探索本地化 AI 应用的开发者、研究人员以及对数据隐私有严格要求的技术爱好者。无论是想快速原型验证新的语音交互方案，还是需要在离线环境中部署智能助手，它都提供了灵活且强大的基础架构，让构建专属的“贾维斯”变得触手可及。","\u003Cdiv align=\"center\">\n  \u003Cdiv>&nbsp;\u003C\u002Fdiv>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_speech-to-speech_readme_b4af2adb8a01.png\" width=\"600\"\u002F> \n\u003C\u002Fdiv>\n\n# Speech To Speech: Build local voice agents with open-source models\n\n## 📖 Quick Index\n* [Approach](#approach)\n  - [Structure](#structure)\n  - [Modularity](#modularity)\n* [Setup](#setup)\n* [Usage](#usage)\n  - [Docker Server approach](#docker-server)\n  - [Server\u002FClient approach](#serverclient-approach)\n  - [Local approach](#local-approach-running-on-mac)\n* [Command-line usage](#command-line-usage)\n  - [Model parameters](#model-parameters)\n  - [Generation parameters](#generation-parameters)\n  - [Notable parameters](#notable-parameters)\n\n## Approach\n\n### Structure\nThis repository implements a speech-to-speech cascaded pipeline consisting of the following parts:\n1. **Voice Activity Detection (VAD)**\n2. **Speech to Text (STT)**\n3. **Language Model (LM)**\n4. **Text to Speech (TTS)**\n\n### Modularity\nThe pipeline provides a fully open and modular approach, with a focus on leveraging models available through the Transformers library on the Hugging Face hub. The code is designed for easy modification, and we already support device-specific and external library implementations:\n\n**VAD** \n- [Silero VAD v5](https:\u002F\u002Fgithub.com\u002Fsnakers4\u002Fsilero-vad)\n\n**STT**\n- Any [Whisper](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fen\u002Fmodel_doc\u002Fwhisper) model checkpoint on the Hugging Face Hub through Transformers 🤗, including [whisper-large-v3](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-large-v3) and [distil-large-v3](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v3)\n- [Lightning Whisper MLX](https:\u002F\u002Fgithub.com\u002Fmustafaaljadery\u002Flightning-whisper-mlx?tab=readme-ov-file#lightning-whisper-mlx)\n- [MLX Audio Whisper](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fmlx-audio) - Fast Whisper inference on Apple Silicon\n- [Parakeet TDT](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fparakeet-tdt-1.1b) - Real-time streaming STT with sub-100ms latency on Apple Silicon (CUDA\u002FCPU via nano-parakeet, no NeMo)\n- [Paraformer - FunASR](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR)\n\n**LLM**\n- Any instruction-following model on the [Hugging Face Hub](https:\u002F\u002Fhuggingface.co\u002Fmodels?pipeline_tag=text-generation&sort=trending) via Transformers 🤗\n- [mlx-lm](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx-examples\u002Fblob\u002Fmain\u002Fllms\u002FREADME.md)\n- [OpenAI API](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fquickstart)\n\n**TTS**\n- [MeloTTS](https:\u002F\u002Fgithub.com\u002Fmyshell-ai\u002FMeloTTS)\n- [ChatTTS](https:\u002F\u002Fgithub.com\u002F2noise\u002FChatTTS?tab=readme-ov-file)\n- [Pocket TTS](https:\u002F\u002Fgithub.com\u002Fkyutai-labs\u002Fpocket-tts) - Streaming TTS with voice cloning from Kyutai Labs\n- [Kokoro-82M](https:\u002F\u002Fhuggingface.co\u002Fhexgrad\u002FKokoro-82M) - Fast and high-quality TTS optimized for Apple Silicon\n\n## Setup\n\nClone the repository:\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech.git\ncd speech-to-speech\n```\n\nInstall dependencies with [uv](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv):\n```bash\nuv sync\n```\n\nThe project now uses a single `pyproject.toml` with platform markers, so macOS and non-macOS dependencies are resolved automatically from one file.\n\nIf you use Melo TTS (default on macOS), run this once after install:\n```bash\nuv run python -m unidic download\n```\n\nApple Silicon MeloTTS note:\n- If MeloTTS fails on MPS with `Output channels > 65536 not supported at the MPS device`, update macOS first.\n- We reproduced this on an older macOS release and verified that the same environment worked after updating to macOS `26.3.1`.\n\n**Note on DeepFilterNet:** DeepFilterNet (used for optional audio enhancement in VAD) is currently incompatible with Pocket TTS due to numpy version constraints. DeepFilterNet requires `numpy\u003C2`, while Pocket TTS requires `numpy>=2`.\n\nIf you want a DeepFilterNet-focused setup with `pyproject.toml`:\n1. Edit [`pyproject.toml`](.\u002Fpyproject.toml): remove the `pocket-tts` dependency line.\n2. Add `deepfilternet>=0.5.6` and `numpy\u003C2` to `project.dependencies`.\n3. Re-sync the environment:\n   ```bash\n   uv sync --refresh\n   ```\n\nTo switch back to Pocket TTS, revert those `pyproject.toml` changes and run `uv sync --refresh` again.\n\n\n## Usage\n\nThe pipeline can be run in three ways:\n- **Server\u002FClient approach**: Models run on a server, and audio input\u002Foutput are streamed from a client using TCP sockets.\n- **WebSocket approach**: Models run on a server, and audio input\u002Foutput are streamed from a client using WebSockets.\n- **Local approach**: Runs locally.\n\n### Recommended setup \n\n### Server\u002FClient Approach\n\n1. Run the pipeline on the server:\n   ```bash\n   python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0\n   ```\n\n2. Run the client locally to handle microphone input and receive generated audio:\n   ```bash\n   python listen_and_play.py --host \u003CIP address of your server>\n   ```\n\n### WebSocket Approach\n\n1. Run the pipeline with WebSocket mode:\n   ```bash\n   python s2s_pipeline.py --mode websocket --ws_host 0.0.0.0 --ws_port 8765\n   ```\n\n2. Connect to the WebSocket server from your client application at `ws:\u002F\u002F\u003Cserver-ip>:8765`. The server handles bidirectional audio streaming:\n   - Send raw audio bytes to the server (16kHz, int16, mono)\n   - Receive generated audio bytes from the server\n\n### Local Approach (Mac)\n\n1. For optimal settings on Mac:\n   ```bash\n   python s2s_pipeline.py --local_mac_optimal_settings\n   ```\n\n   You can also specify a particular LLM model:\n   ```bash\n   python s2s_pipeline.py \\\n       --local_mac_optimal_settings \\\n       --lm_model_name mlx-community\u002FQwen3-4B-Instruct-2507-bf16\n   ```\n\nThis setting:\n   - Adds `--device mps` to use MPS for all models.\n   - Sets [Parakeet TDT](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fparakeet-tdt-1.1b) for STT (fast streaming ASR on Apple Silicon)\n   - Sets MLX LM for the language model (uses `--lm_model_name` to specify the model)\n   - Sets MeloTTS for TTS\n   - Requires one-time UniDic setup for MeloTTS:\n     ```bash\n     uv run python -m unidic download\n     ```\n   - `--tts pocket` and `--tts kokoro` are also valid TTS options on macOS.\n\n### Docker Server\n\n#### Install the NVIDIA Container Toolkit\n\nhttps:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html\n\n#### Start the docker container\n```docker compose up```\n\n\n\n### Recommended usage with Cuda\n\nLeverage Torch Compile for Whisper with Pocket TTS for a simple low-latency setup:\n\n```bash\npython s2s_pipeline.py \\\n\t--lm_model_name microsoft\u002FPhi-3-mini-4k-instruct \\\n\t--stt_compile_mode reduce-overhead \\\n  --tts pocket \\\n  --recv_host 0.0.0.0 \\\n\t--send_host 0.0.0.0 \n```\n\n### Multi-language Support\n\nThe pipeline currently supports English, French, Spanish, Chinese, Japanese, and Korean.  \nTwo use cases are considered:\n\n- **Single-language conversation**: Enforce the language setting using the `--language` flag, specifying the target language code (default is 'en').\n- **Language switching**: Set `--language` to 'auto'. In this case, Whisper detects the language for each spoken prompt, and the LLM is prompted with \"`Please reply to my message in ...`\" to ensure the response is in the detected language.\n\nPlease note that you must use STT and LLM checkpoints compatible with the target language(s). For multilingual TTS, use Melo (English, French, Spanish, Chinese, Japanese, and Korean) or Chat-TTS.\n\n#### With the server version:\n\nFor automatic language detection:\n\n```bash\npython s2s_pipeline.py \\\n    --stt whisper-mlx \\\n    --stt_model_name large-v3 \\\n    --language auto \\\n    --llm mlx-lm \\\n    --lm_model_name mlx-community\u002FQwen3-4B-Instruct-2507-bf16\n```\n\nOr for one language in particular, chinese in this example\n\n```bash\npython s2s_pipeline.py \\\n    --stt whisper-mlx \\\n    --stt_model_name large-v3 \\\n    --language zh \\\n    --llm mlx-lm \\\n    --lm_model_name mlx-community\u002FQwen3-4B-Instruct-2507-bf16\n```\n\n#### Local Mac Setup\n\nFor automatic language detection (note: `--stt whisper-mlx` overrides the default parakeet-tdt from optimal settings, since Whisper `large-v3` has broader language coverage):\n\n```bash\npython s2s_pipeline.py \\\n    --local_mac_optimal_settings \\\n    --stt whisper-mlx \\\n    --stt_model_name large-v3 \\\n    --language auto \\\n    --lm_model_name mlx-community\u002FQwen3-4B-Instruct-2507-bf16\n```\n\nOr for one language in particular, chinese in this example\n\n```bash\npython s2s_pipeline.py \\\n    --local_mac_optimal_settings \\\n    --stt whisper-mlx \\\n    --stt_model_name large-v3 \\\n    --language zh \\\n    --lm_model_name mlx-community\u002FQwen3-4B-Instruct-2507-bf16\n```\n\n### Using Pocket TTS\n\nPocket TTS from Kyutai Labs provides streaming TTS with voice cloning capabilities. To use it:\n\n```bash\npython s2s_pipeline.py \\\n    --tts pocket \\\n    --pocket_tts_voice jean \\\n    --pocket_tts_device cpu\n```\n\nAvailable voice presets: `alba`, `marius`, `javert`, `jean`, `fantine`, `cosette`, `eponine`, `azelma`. You can also use custom voice files or HuggingFace paths.\n\n## Command-line Usage\n\n> **_NOTE:_** References for all the CLI arguments can be found directly in the [arguments classes](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Ftree\u002Fd5e460721e578fef286c7b64e68ad6a57a25cf1b\u002Farguments_classes) or by running `python s2s_pipeline.py -h`.\n\n### Module level Parameters \nSee [ModuleArguments](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fblob\u002Fd5e460721e578fef286c7b64e68ad6a57a25cf1b\u002Farguments_classes\u002Fmodule_arguments.py) class. Allows to set:\n- a common `--device` (if one wants each part to run on the same device)\n- `--mode` `local` or `server`\n- chosen STT implementation \n- chosen LM implementation\n- chose TTS implementation\n- logging level\n\n### VAD parameters\nSee [VADHandlerArguments](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fblob\u002Fd5e460721e578fef286c7b64e68ad6a57a25cf1b\u002Farguments_classes\u002Fvad_arguments.py) class. Notably:\n- `--thresh`: Threshold value to trigger voice activity detection.\n- `--min_speech_ms`: Minimum duration of detected voice activity to be considered speech.\n- `--min_silence_ms`: Minimum length of silence intervals for segmenting speech, balancing sentence cutting and latency reduction.\n\n\n### STT, LM and TTS parameters\n\n`model_name`, `torch_dtype`, and `device` are exposed for each implementation of the Speech to Text, Language Model, and Text to Speech. Specify the targeted pipeline part with the corresponding prefix (e.g. `stt`, `lm` or `tts`, check the implementations' [arguments classes](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Ftree\u002Fd5e460721e578fef286c7b64e68ad6a57a25cf1b\u002Farguments_classes) for more details).\n\nFor example:\n```bash\n--lm_model_name google\u002Fgemma-2b-it\n```\n\n### Generation parameters\n\nOther generation parameters of the model's generate method can be set using the part's prefix + `_gen_`, e.g., `--stt_gen_max_new_tokens 128`. These parameters can be added to the pipeline part's arguments class if not already exposed.\n\n## Citations\n\n### Silero VAD\n```bibtex\n@misc{Silero VAD,\n  author = {Silero Team},\n  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},\n  year = {2021},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fsnakers4\u002Fsilero-vad}},\n  commit = {insert_some_commit_here},\n  email = {hello@silero.ai}\n}\n```\n\n### Distil-Whisper\n```bibtex\n@misc{gandhi2023distilwhisper,\n      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},\n      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},\n      year={2023},\n      eprint={2311.00430},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n### Parler-TTS\n```bibtex\n@misc{lacombe-etal-2024-parler-tts,\n  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},\n  title = {Parler-TTS},\n  year = {2024},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fparler-tts}}\n}\n```\n","\u003Cdiv align=\"center\">\n  \u003Cdiv>&nbsp;\u003C\u002Fdiv>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_speech-to-speech_readme_b4af2adb8a01.png\" width=\"600\"\u002F> \n\u003C\u002Fdiv>\n\n# 语音到语音：使用开源模型构建本地语音助手\n\n## 📖 快速索引\n* [方法](#approach)\n  - [结构](#structure)\n  - [模块化](#modularity)\n* [设置](#setup)\n* [使用](#usage)\n  - [Docker 服务器方式](#docker-server)\n  - [服务器\u002F客户端方式](#serverclient-approach)\n  - [本地方式](#local-approach-running-on-mac)\n* [命令行使用](#command-line-usage)\n  - [模型参数](#model-parameters)\n  - [生成参数](#generation-parameters)\n  - [重要参数](#notable-parameters)\n\n## 方法\n\n### 结构\n本仓库实现了一个由以下部分组成的级联式语音到语音流水线：\n1. **语音活动检测 (VAD)**\n2. **语音转文本 (STT)**\n3. **语言模型 (LM)**\n4. **文本转语音 (TTS)**\n\n### 模块化\n该流水线提供了一种完全开放且模块化的方案，重点利用 Hugging Face 中心库中 Transformers 库提供的模型。代码设计易于修改，并且我们已经支持特定设备和外部库的实现：\n\n**VAD** \n- [Silero VAD v5](https:\u002F\u002Fgithub.com\u002Fsnakers4\u002Fsilero-vad)\n\n**STT**\n- Hugging Face 中心库中通过 Transformers 🤗 提供的任何 [Whisper](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fen\u002Fmodel_doc\u002Fwhisper) 模型检查点，包括 [whisper-large-v3](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-large-v3) 和 [distil-large-v3](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v3)\n- [Lightning Whisper MLX](https:\u002F\u002Fgithub.com\u002Fmustafaaljadery\u002Flightning-whisper-mlx?tab=readme-ov-file#lightning-whisper-mlx)\n- [MLX Audio Whisper](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fmlx-audio) - 在 Apple Silicon 上实现快速 Whisper 推理\n- [Parakeet TDT](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fparakeet-tdt-1.1b) - 在 Apple Silicon 上实现亚百毫秒延迟的实时流式 STT（通过 nano-parakeet 使用 CUDA\u002FCPU，无需 NeMo）\n- [Paraformer - FunASR](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR)\n\n**LLM**\n- Hugging Face 中心库中任何遵循指令的语言模型（可通过 Transformers 🤗 获取），链接为：[Hugging Face Hub](https:\u002F\u002Fhuggingface.co\u002Fmodels?pipeline_tag=text-generation&sort=trending)\n- [mlx-lm](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx-examples\u002Fblob\u002Fmain\u002Fllms\u002FREADME.md)\n- [OpenAI API](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fquickstart)\n\n**TTS**\n- [MeloTTS](https:\u002F\u002Fgithub.com\u002Fmyshell-ai\u002FMeloTTS)\n- [ChatTTS](https:\u002F\u002Fgithub.com\u002F2noise\u002FChatTTS?tab=readme-ov-file)\n- [Pocket TTS](https:\u002F\u002Fgithub.com\u002Fkyutai-labs\u002Fpocket-tts) - 来自 Kyutai Labs 的流式语音克隆 TTS\n- [Kokoro-82M](https:\u002F\u002Fhuggingface.co\u002Fhexgrad\u002FKokoro-82M) - 针对 Apple Silicon 优化的快速高质量 TTS\n\n## 设置\n\n克隆仓库：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech.git\ncd speech-to-speech\n```\n\n使用 [uv](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) 安装依赖项：\n```bash\nuv sync\n```\n\n该项目现在使用单个带有平台标记的 `pyproject.toml` 文件，因此 macOS 和非 macOS 平台的依赖项会自动从一个文件中解析。\n\n如果您使用 Melo TTS（macOS 上的默认选项），请在安装后运行一次：\n```bash\nuv run python -m unidic download\n```\n\nApple Silicon 上的 MeloTTS 注意事项：\n- 如果 MeloTTS 在 MPS 上因“输出通道数超过 65536 不受 MPS 设备支持”而失败，请先更新 macOS。\n- 我们曾在较旧版本的 macOS 上复现过此问题，并确认在同一环境下升级到 macOS `26.3.1` 后问题得以解决。\n\n**关于 DeepFilterNet 的说明：** DeepFilterNet（用于 VAD 中可选的音频增强）目前由于 numpy 版本限制，与 Pocket TTS 不兼容。DeepFilterNet 需要 `numpy\u003C2`，而 Pocket TTS 则需要 `numpy>=2`。\n\n如果您希望以 DeepFilterNet 为中心进行设置并使用 `pyproject.toml`：\n1. 编辑 [`pyproject.toml`](.\u002Fpyproject.toml)：移除 `pocket-tts` 依赖行。\n2. 将 `deepfilternet>=0.5.6` 和 `numpy\u003C2` 添加到 `project.dependencies`。\n3. 重新同步环境：\n   ```bash\n   uv sync --refresh\n   ```\n\n要切换回 Pocket TTS，只需恢复 `pyproject.toml` 中的更改，并再次运行 `uv sync --refresh`。\n\n\n## 使用\n\n该流水线可以通过三种方式运行：\n- **服务器\u002F客户端方式**：模型在服务器上运行，音频输入输出通过 TCP 套接字从客户端传输。\n- **WebSocket 方式**：模型在服务器上运行，音频输入输出通过 WebSocket 从客户端传输。\n- **本地方式**：在本地运行。\n\n### 推荐设置 \n\n### 服务器\u002F客户端方式\n\n1. 在服务器上运行流水线：\n   ```bash\n   python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0\n   ```\n\n2. 在本地运行客户端以处理麦克风输入并接收生成的音频：\n   ```bash\n   python listen_and_play.py --host \u003C您的服务器 IP 地址>\n   ```\n\n### WebSocket 方式\n\n1. 以 WebSocket 模式运行流水线：\n   ```bash\n   python s2s_pipeline.py --mode websocket --ws_host 0.0.0.0 --ws_port 8765\n   ```\n\n2. 在您的客户端应用中连接到 WebSocket 服务器，地址为 `ws:\u002F\u002F\u003C服务器 IP>:8765`。服务器将处理双向音频流：\n   - 向服务器发送原始音频字节（16kHz，int16，单声道）\n   - 从服务器接收生成的音频字节\n\n### 本地方式（Mac）\n\n1. 为了在 Mac 上获得最佳设置：\n   ```bash\n   python s2s_pipeline.py --local_mac_optimal_settings\n   ```\n\n   您也可以指定特定的语言模型：\n   ```bash\n   python s2s_pipeline.py \\\n       --local_mac_optimal_settings \\\n       --lm_model_name mlx-community\u002FQwen3-4B-Instruct-2507-bf16\n   ```\n\n此设置：\n   - 添加 `--device mps` 以在所有模型上使用 MPS。\n   - 设置 [Parakeet TDT](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fparakeet-tdt-1.1b) 作为 STT（在 Apple Silicon 上实现快速流式 ASR）。\n   - 设置 MLX LM 作为语言模型（使用 `--lm_model_name` 指定模型）。\n   - 设置 MeloTTS 作为 TTS。\n   - 需要为 MeloTTS 进行一次性 UniDic 设置：\n     ```bash\n     uv run python -m unidic download\n     ```\n   - `--tts pocket` 和 `--tts kokoro` 也是 macOS 上有效的 TTS 选项。\n\n### Docker 服务器\n\n#### 安装 NVIDIA Container Toolkit\n\nhttps:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html\n\n#### 启动 docker 容器\n```docker compose up```\n\n\n\n### 使用 Cuda 的推荐用法\n\n借助 Torch Compile 和 Pocket TTS，结合 Whisper 实现简单低延迟的设置：\n\n```bash\npython s2s_pipeline.py \\\n\t--lm_model_name microsoft\u002FPhi-3-mini-4k-instruct \\\n\t--stt_compile_mode reduce-overhead \\\n  --tts pocket \\\n  --recv_host 0.0.0.0 \\\n\t--send_host 0.0.0.0 \n```\n\n### 多语言支持\n\n该流水线目前支持英语、法语、西班牙语、中文、日语和韩语。考虑了两种使用场景：\n\n- **单语言对话**：通过 `--language` 标志强制设置语言，指定目标语言代码（默认为 'en'）。\n- **语言切换**：将 `--language` 设置为 'auto'。在这种情况下，Whisper 会检测每个语音提示的语言，并向 LLM 提示 \"`请用 ... 回答我的消息`\"，以确保响应使用检测到的语言。\n\n请注意，您必须使用与目标语言兼容的 STT 和 LLM 检查点。对于多语言 TTS，请使用 Melo（英语、法语、西班牙语、中文、日语和韩语）或 Chat-TTS。\n\n#### 使用服务器版本：\n\n对于自动语言检测：\n\n```bash\npython s2s_pipeline.py \\\n    --stt whisper-mlx \\\n    --stt_model_name large-v3 \\\n    --language auto \\\n    --llm mlx-lm \\\n    --lm_model_name mlx-community\u002FQwen3-4B-Instruct-2507-bf16\n```\n\n或者针对特定语言，例如中文：\n\n```bash\npython s2s_pipeline.py \\\n    --stt whisper-mlx \\\n    --stt_model_name large-v3 \\\n    --language zh \\\n    --llm mlx-lm \\\n    --lm_model_name mlx-community\u002FQwen3-4B-Instruct-2507-bf16\n```\n\n#### 本地 Mac 设置\n\n对于自动语言检测（注意：`--stt whisper-mlx` 会覆盖最佳设置中的默认 parakeet-tdt，因为 Whisper `large-v3` 具有更广泛的语言覆盖范围）：\n\n```bash\npython s2s_pipeline.py \\\n    --local_mac_optimal_settings \\\n    --stt whisper-mlx \\\n    --stt_model_name large-v3 \\\n    --language auto \\\n    --lm_model_name mlx-community\u002FQwen3-4B-Instruct-2507-bf16\n```\n\n或者针对特定语言，例如中文：\n\n```bash\npython s2s_pipeline.py \\\n    --local_mac_optimal_settings \\\n    --stt whisper-mlx \\\n    --stt_model_name large-v3 \\\n    --language zh \\\n    --lm_model_name mlx-community\u002FQwen3-4B-Instruct-2507-bf16\n```\n\n### 使用 Pocket TTS\n\nKyutai Labs 的 Pocket TTS 提供具有语音克隆功能的流式 TTS。要使用它：\n\n```bash\npython s2s_pipeline.py \\\n    --tts pocket \\\n    --pocket_tts_voice jean \\\n    --pocket_tts_device cpu\n```\n\n可用的语音预设包括：`alba`、`marius`、`javert`、`jean`、`fantine`、`cosette`、`eponine`、`azelma`。您也可以使用自定义语音文件或 HuggingFace 路径。\n\n## 命令行使用\n\n> **_注意:_** 所有 CLI 参数的参考可以直接在 [参数类](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Ftree\u002Fd5e460721e578fef286c7b64e68ad6a57a25cf1b\u002Farguments_classes) 中找到，或者通过运行 `python s2s_pipeline.py -h` 获取。\n\n### 模块级参数 \n参见 [ModuleArguments](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fblob\u002Fd5e460721e578fef286c7b64e68ad6a57a25cf1b\u002Farguments_classes\u002Fmodule_arguments.py) 类。允许设置：\n- 一个通用的 `--device`（如果希望各部分在同一设备上运行）\n- `--mode` 为 `local` 或 `server`\n- 选择的 STT 实现\n- 选择的 LM 实现\n- 选择的 TTS 实现\n- 日志级别\n\n### VAD 参数\n参见 [VADHandlerArguments](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fblob\u002Fd5e460721e578fef286c7b64e68ad6a57a25cf1b\u002Farguments_classes\u002Fvad_arguments.py) 类。值得注意的是：\n- `--thresh`：触发语音活动检测的阈值。\n- `--min_speech_ms`：被检测到的语音活动的最小持续时间，才被视为语音。\n- `--min_silence_ms`：用于分割语音的静音间隔的最小长度，以平衡句子切割和延迟降低。\n\n\n### STT、LM 和 TTS 参数\n\n`model_name`、`torch_dtype` 和 `device` 针对语音转文本、语言模型和文本转语音的每种实现都公开了。使用相应的前缀指定目标流水线部分（例如 `stt`、`lm` 或 `tts`，更多细节请查看各实现的 [参数类](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Ftree\u002Fd5e460721e578fef286c7b64e68ad6a57a25cf1b\u002Farguments_classes)）。\n\n例如：\n```bash\n--lm_model_name google\u002Fgemma-2b-it\n```\n\n### 生成参数\n\n模型生成方法的其他生成参数可以使用部分前缀加 `_gen_` 来设置，例如 `--stt_gen_max_new_tokens 128`。如果尚未公开，这些参数可以添加到流水线部分的参数类中。\n\n## 引用\n\n### Silero VAD\n```bibtex\n@misc{Silero VAD,\n  author = {Silero Team},\n  title = {Silero VAD: 预训练的企业级语音活动检测器 (VAD)、数字检测器和语言分类器},\n  year = {2021},\n  publisher = {GitHub},\n  journal = {GitHub 仓库},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fsnakers4\u002Fsilero-vad}},\n  commit = {insert_some_commit_here},\n  email = {hello@silero.ai}\n}\n```\n\n### Distil-Whisper\n```bibtex\n@misc{gandhi2023distilwhisper,\n      title={Distil-Whisper：通过大规模伪标签进行稳健的知识蒸馏},\n      author={Sanchit Gandhi 和 Patrick von Platen 和 Alexander M. Rush},\n      year={2023},\n      eprint={2311.00430},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n### Parler-TTS\n```bibtex\n@misc{lacombe-etal-2024-parler-tts,\n  author = {Yoach Lacombe 和 Vaibhav Srivastav 和 Sanchit Gandhi},\n  title = {Parler-TTS},\n  year = {2024},\n  publisher = {GitHub},\n  journal = {GitHub 仓库},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fparler-tts}}\n}\n```","# Speech-to-Speech 快速上手指南\n\nSpeech-to-Speech 是一个基于开源模型构建本地语音智能体（Voice Agents）的级联管道工具。它支持语音活动检测 (VAD)、语音转文本 (STT)、大语言模型 (LLM) 和文本转语音 (TTS) 的全流程处理，具有高度模块化特性。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: macOS (推荐 Apple Silicon) 或 Linux (支持 NVIDIA CUDA)。\n- **Python 版本**: 建议使用 Python 3.10+。\n- **硬件加速**:\n  - **macOS**: 利用 MPS (Metal Performance Shaders) 加速。\n  - **Linux**: 需安装 [NVIDIA Container Toolkit](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html) 以支持 GPU 加速。\n\n### 前置依赖\n- **包管理器**: 项目推荐使用 [`uv`](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) 进行依赖管理，安装速度更快且能自动处理平台差异。\n  ```bash\n  # 安装 uv (如果尚未安装)\n  curl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n  ```\n- **网络环境**: 由于模型主要从 Hugging Face Hub 下载，国内用户建议配置镜像源或使用代理以确保下载顺利。\n\n## 安装步骤\n\n1. **克隆仓库**\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech.git\n   cd speech-to-speech\n   ```\n\n2. **同步依赖**\n   使用 `uv` 自动解析并安装适用于当前平台（macOS 或 Linux）的依赖：\n   ```bash\n   uv sync\n   ```\n\n3. **额外配置 (仅 macOS 用户)**\n   如果使用的是默认的 MeloTTS (macOS 默认)，需要下载 UniDic 词典：\n   ```bash\n   uv run python -m unidic download\n   ```\n   > **注意**: 如果在 Apple Silicon 上遇到 MPS 错误 (`Output channels > 65536 not supported`)，请尝试更新 macOS 系统至最新版本。\n\n4. **可选配置 (DeepFilterNet)**\n   如果需要音频增强功能且不使用 Pocket TTS，需手动修改 `pyproject.toml` 移除 `pocket-tts` 并添加 `deepfilternet>=0.5.6` 和 `numpy\u003C2`，然后运行 `uv sync --refresh`。\n\n## 基本使用\n\n该工具支持三种运行模式：**本地模式** (Local)、**服务端\u002F客户端模式** (Server\u002FClient) 和 **WebSocket 模式**。以下提供最常用的两种场景示例。\n\n### 场景一：macOS 本地快速体验 (推荐)\n此模式利用 Apple Silicon 优化设置，自动选择 Parakeet TDT (STT)、MLX LM (LLM) 和 MeloTTS (TTS)，适合单机调试。\n\n```bash\npython s2s_pipeline.py --local_mac_optimal_settings\n```\n\n**指定特定大模型运行：**\n```bash\npython s2s_pipeline.py \\\n    --local_mac_optimal_settings \\\n    --lm_model_name mlx-community\u002FQwen3-4B-Instruct-2507-bf16\n```\n\n### 场景二：服务器部署 (CUDA\u002FLinux)\n适用于在服务器上运行模型，通过 TCP  sockets 流式传输音频。\n\n1. **启动服务端** (监听所有接口):\n   ```bash\n   python s2s_pipeline.py \\\n       --lm_model_name microsoft\u002FPhi-3-mini-4k-instruct \\\n       --stt_compile_mode reduce-overhead \\\n       --tts pocket \\\n       --recv_host 0.0.0.0 \\\n       --send_host 0.0.0.0\n   ```\n\n2. **启动客户端** (在另一台机器或同一机器的不同终端运行，替换 `\u003CIP address>` 为服务器 IP):\n   ```bash\n   python listen_and_play.py --host \u003CIP address of your server>\n   ```\n\n### 多语言支持\n默认支持英语、法语、西班牙语、中文、日语和韩语。\n\n- **自动语言检测**:\n  ```bash\n  python s2s_pipeline.py --language auto ...\n  ```\n- **指定语言 (例如中文)**:\n  ```bash\n  python s2s_pipeline.py --language zh ...\n  ```\n  > 注意：确保所选的 STT 和 LLM 模型支持目标语言。对于多语言 TTS，推荐使用 `Melo` 或 `ChatTTS`。\n\n### 命令行参数提示\n所有可用参数可通过以下命令查看详细说明：\n```bash\npython s2s_pipeline.py -h\n```\n关键参数前缀说明：\n- `--stt_...`: 语音转文本相关参数\n- `--lm_...`: 大语言模型相关参数\n- `--tts_...`: 文本转语音相关参数\n- `--vad_...`: 语音活动检测相关参数","一位独立开发者希望在本地 Mac 电脑上构建一个低延迟、隐私安全的语音助手原型，用于实时处理用户指令并反馈自然语音。\n\n### 没有 speech-to-speech 时\n- **集成繁琐**：需要手动拼接 VAD、Whisper、LLM 和 TTS 四个独立模块，代码耦合度高且调试困难。\n- **硬件适配难**：难以充分利用 Apple Silicon 的 MPS 加速，导致推理延迟高，无法实现流畅的实时对话。\n- **隐私风险**：若调用云端 API 处理敏感数据，存在用户语音泄露隐患，不符合本地化部署需求。\n- **模型切换僵化**：更换更轻量的 STT 或更自然的 TTS 模型时，需重写大量底层接口代码，试错成本极高。\n\n### 使用 speech-to-speech 后\n- **开箱即用**：直接运行预置的级联流水线，一键整合 Silero VAD、Whisper-large-v3、指令微调模型及 Kokoro TTS，大幅缩短开发周期。\n- **极致性能**：自动调用 MLX Audio Whisper 和针对 Apple Silicon 优化的 Kokoro-82M 模型，将端到端延迟控制在亚秒级，对话体验丝滑。\n- **完全本地化**：所有计算均在本地完成，无需联网即可运行，彻底杜绝数据外传，完美保障用户隐私。\n- **灵活模块化**：通过简单配置即可在 Hugging Face Hub 上自由替换任意环节模型（如切换为 Parakeet TDT），快速验证不同组合效果。\n\nspeech-to-speech 让开发者能以模块化方式轻松构建高性能、纯本地的开源语音智能体，真正实现了从“拼凑组件”到“专注业务”的转变。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_speech-to-speech_b4af2adb.png","huggingface","Hugging Face","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhuggingface_90da21a4.png","The AI community building the future.",null,"https:\u002F\u002Fhuggingface.co\u002F","https:\u002F\u002Fgithub.com\u002Fhuggingface",[81,85],{"name":82,"color":83,"percentage":84},"Python","#3572A5",99.9,{"name":86,"color":87,"percentage":88},"Dockerfile","#384d54",0.1,4656,546,"2026-04-14T22:11:59","Apache-2.0","Linux, macOS","非必需。Linux 端推荐 NVIDIA GPU（需安装 NVIDIA Container Toolkit 以支持 Docker），支持 CUDA；macOS 端推荐使用 Apple Silicon (M1\u002FM2\u002FM3)，利用 MPS 加速。未明确具体显存大小要求，但提及大模型运行需求。","未说明",{"notes":97,"python":98,"dependencies":99},"1. 项目使用 'uv' 工具同步依赖，自动根据平台 (macOS\u002F非 macOS) 解析依赖。2. macOS 用户若使用 MeloTTS 且遇到 MPS 错误，需升级 macOS 至最新版本。3. DeepFilterNet 与 Pocket TTS 存在 numpy 版本冲突 (前者需\u003C2，后者需>=2)，需手动修改 pyproject.toml 选择其一。4. 首次使用 MeloTTS 需运行命令下载 UniDic 词典。5. 支持多种运行模式：本地运行、Server\u002FClient TCP 模式、WebSocket 模式及 Docker 部署。","未说明 (通过 uv 和 pyproject.toml 管理)",[100,101,102,103,104,105,106,107,108,109],"torch","transformers","mlx","mlx-lm","silero-vad","whisper","MeloTTS","Pocket TTS","numpy","unidic",[35,111,13,15,14],"音频",[113,114,115,116,117,118,119,120,121],"ai","assistant","language-model","machine-learning","python","speech","speech-synthesis","speech-to-text","speech-translation","2026-03-27T02:49:30.150509","2026-04-15T12:56:17.898657",[125,130,135,140,145,150],{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},34288,"如何在 macOS 上运行该项目？","在 macOS 上运行时，请使用 `--local_mac_optimal_settings` 标志。该标志会自动设置最适合 Mac 的设备和模型配置。此外，确保已合并最新的 PR 到主分支，并运行 `uv pip install -r requirements.txt` 进行安装。如果需要使用 Melo TTS，可能还需要手动运行 `python -m unidic download`（这会增加约 550MB 的安装体积，因此作为可选步骤处理）。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fissues\u002F26",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},34289,"安装 requirements.txt 时出现 NLTK 版本错误（找不到 nltk==3.8.2）怎么办？","该问题是由于 PyPI 上暂无 nltk 3.8.2 版本导致的。解决方法有兩種：\n1. 直接从 GitHub 安装特定版本：`pip install git+https:\u002F\u002Fgithub.com\u002Fnltk\u002Fnltk.git@3.8.2`\n2. 或者修改 requirements 文件，移除版本号限制，仅保留 `nltk`，让 pip 自动解决依赖冲突。目前主分支已修复此问题。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fissues\u002F3",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},34290,"安装时遇到 mlx 包版本冲突错误如何解决？","错误通常由 `lightning-whisper-mlx` 和 `mlx-lm` 对 `mlx` 版本的依赖冲突引起。维护者已在项目中添加了相关警告。如果遇到此问题，建议检查是否使用了最新的主分支代码，或者尝试 loosening 包版本范围。维护者提到已通过更新 README 和依赖处理来解决此类冲突。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fissues\u002F34",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},34291,"客户端启动后无法获取服务器响应怎么办？","如果您使用的是旧的 `localhost` 连接方式，该方案已被弃用，这可能是导致无响应的原因。请更新到最新版本并使用新的连接机制。此外，确保服务器已正确启动且端口未被占用。维护者确认该问题在弃用 localhost 方案后已得到解决。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fissues\u002F57",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},34292,"如何将项目集成到前端并暴露到互联网？","原生的服务器\u002F客户端架构主要设计用于本地运行。若需部署到公网或集成前端，可以参考社区项目 CleanS2S 的实现，他们已成功将此项目部署并暴露了 API。您可以查看其后端代码中关于如何配置任意 Hugging Face LLM 的部分（通常在 s2s_server_pipeline.py 中），或直接参考其架构进行二次开发。直接使用 `--recv_host` 和 `--send_host` 可能在复杂网络环境下（如经过 ngrok）需要额外的网络配置。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fissues\u002F132",{"id":151,"question_zh":152,"answer_zh":153,"source_url":144},34293,"项目支持多语言（如中文）吗？","目前原生实现尚未完全内置多语言支持，但社区正在积极开发中（已有针对中文 TTS 的 PR）。对于其他语言，用户可能需要自行修改代码以适应不同的语言模型。维护者计划在未来进一步 tackling 多语言支持问题。",[155],{"id":156,"version":157,"summary_zh":158,"released_at":159},271626,"2025","## 变更内容\n* 由 @Vaibhavs10 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F2 中进行的轻微文档修复。\n* 由 @AlexHayton 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F7 中修复了缺失的 sounddevice 模块问题。\n* 由 @RodriMora 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F23 中更新了 README.md 文件。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F29 中修复了 ntlk 相关问题。\n* 由 @codearranger 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F22 中实现了 Docker 化。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F20 中添加了对 MPS 的支持。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F31 中添加了 Apache 许可证。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F32 中重构了 arguments 文件夹，并运行了 ruff 工具。\n* 由 @RonanKMcGovern 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F40 中允许选择语言模型并引入 MLX Gemma。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F41 中改进了 mlx 流水线。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F43 中重构了所有处理程序的文件结构。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F49 中添加了最小新标记数配置。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F51 中添加了安装 flash attn 的警告提示。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F52 中改进了日志记录功能。\n* 由 @wuhongsheng 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F48 中新增了 paraformer_zh ASR 功能。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F58 中将最小新标记数分配给线程中的编译 whisper 图。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F53 中添加了中文 STT 功能 paraformer。\n* 由 @wuhongsheng 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F55 中新增了 ChatTTS 功能。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F59 中为 ChatTTS 添加了中文支持。\n* 由 @wuhongsheng 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F61 中新增了 DeepFilterNet 语音增强功能，以获得清晰的语音。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F77 中改进了文档。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F60 中添加了多语言支持。\n* 由 @AgainstEntropy 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F78 中更新了 module_arguments.py 文件。\n* 由 @rchan26 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F84 中为 lightning whisper 处理程序添加了语言参数。\n* 由 @rchan26 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F85 中修复了 README 中的相对链接。\n* 由 @andimarafioti 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F87 中进行了修复。\n* 由 @BrutalCoding 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fspeech-to-speech\u002Fpull\u002F91 中将 audio_enhancement 帮助文本中 [True] 改为 [False]，以与实际默认值保持一致。\n* 上","2026-02-06T12:26:33"]