[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-FluidInference--FluidAudio":3,"tool-FluidInference--FluidAudio":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150037,2,"2026-04-10T23:33:47",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":76,"owner_website":78,"owner_url":79,"languages":80,"stars":104,"forks":105,"last_commit_at":106,"license":107,"difficulty_score":32,"env_os":108,"env_gpu":109,"env_ram":110,"env_deps":111,"category_tags":119,"github_topics":121,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":141,"updated_at":142,"faqs":143,"releases":177},5829,"FluidInference\u002FFluidAudio","FluidAudio","Frontier CoreML audio models in your apps — text-to-speech, speech-to-text, voice activity detection, and speaker diarization. In Swift, powered by SOTA open source. ","FluidAudio 是一款专为 Apple 设备打造的 Swift SDK，旨在让开发者轻松在 macOS 和 iOS 应用中集成前沿的本地音频 AI 能力。它支持语音转文字、文字转语音、语音活动检测以及说话人区分等多种核心功能。\n\n针对传统音频处理依赖云端服务器导致的延迟高、隐私风险大及流量成本高等痛点，FluidAudio 通过将推理任务完全卸载至苹果神经引擎（ANE），实现了全本地化、低延迟且低功耗的运行模式。这不仅大幅减少了内存占用，还避免了占用 CPU 或 GPU 资源，非常适合后台处理、环境计算及“始终在线”的应用场景。\n\n该工具主要面向 iOS 和 macOS 应用开发者，尤其是那些希望在不依赖网络的情况下为用户提供实时语音交互功能的团队。其技术亮点在于集成了多个开源最先进模型（如 Parakeet 和 Kokoro），支持包括中文、日文在内的多种语言，并具备端到端 utterance 检测和语音克隆等高级特性。只需几行代码，开发者即可将这些强大的模型融入自己的项目中，同时保持用户数据的完全私密。","![https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_de63c6ec5a63.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_de63c6ec5a63.png)\n\n# FluidAudio - Transcription, Text-to-speech, VAD, Speaker diarization with CoreML Models\n\n[![Swift](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSwift-6.0+-orange.svg)](https:\u002F\u002Fswift.org)\n[![Platform](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPlatform-macOS%20%7C%20iOS-blue.svg)](https:\u002F\u002Fdeveloper.apple.com)\n[![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocumentation-docs.fluidinference.com-008574.svg)](https:\u002F\u002Fdocs.fluidinference.com\u002Fintroduction)\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-Join%20Chat-7289da.svg)](https:\u002F\u002Fdiscord.gg\u002FWNsvaCtmDe)\n[![Hugging Face Models](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face%20Models-800k%2B%20downloads-brightgreen?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FFluidInference)\n[![Ask DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg)](https:\u002F\u002Fdeepwiki.com\u002FFluidInference\u002FFluidAudio)\n\nFluidAudio is a Swift SDK for fully local, low-latency audio AI on Apple devices, with inference offloaded to the Apple Neural Engine (ANE), resulting in less memory and generally faster inference.\n\nThe SDK includes state-of-the-art speaker diarization, transcription, and voice activity detection via open-source models (MIT\u002FApache 2.0) that can be integrated with just a few lines of code. Models are optimized for background processing, ambient computing and always on workloads by running inference on the ANE, minimizing CPU usage and avoiding GPU\u002FMPS entirely.\n\nFor custom use cases, feedback, additional model support, or platform requests, join our [Discord](https:\u002F\u002Fdiscord.gg\u002FWNsvaCtmDe). We're also bringing visual, language, and TTS models to device and will share updates there.\n\nBelow are some featured local AI apps using Fluid Audio models on macOS and iOS:\n\n\u003Cp align=\"left\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FBeingpax\u002FVoiceInk\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_11448ffcd106.png\" height=\"40\" alt=\"Voice Ink\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fspokenly.app\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_8f49214036ee.png\" height=\"40\" alt=\"Spokenly\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fslipbox.ai\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_4ff51848e56a.png\" height=\"40\" alt=\"Slipbox\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhex.kitlangton.com\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_baca799affef.png\" height=\"40\" alt=\"Hex\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fboltai.com\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_9158346dcc25.png\" height=\"40\" alt=\"BoltAI\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fparaspeech.com\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_f180f436c2bd.png\" height=\"40\" alt=\"Paraspeech\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Faltic.dev\u002Ffluid\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_fb36cf7b00bd.png\" height=\"40\" alt=\"Fluid Voice\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fsnaply.ai\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_a1a791729f62.png\" height=\"40\" alt=\"Snaply\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fyazinsai\u002FOpenOats\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_2f9dd891491a.png\" height=\"40\" alt=\"OpenOats\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Ftalat.app\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_dc813ca8a54e.png\" height=\"40\" alt=\"Talat\">\u003C\u002Fa>\n\u003C!-- Add your app: submit logo via PR. The Fluid Inference team works to curate this and add new apps to the showcase section every couple of weeks. We appreciate your patience. -->\n\u003C\u002Fp>\n\nWant to convert your own model? Check [möbius](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Fmobius)\n\n## Highlights\n\n- **Automatic Speech Recognition (ASR)**: [Parakeet TDT v3](Documentation\u002FModels.md#batch-transcription-near-real-time) (0.6b) and other TDT\u002FCTC models for batch transcription supporting 25 European languages, Japanese, and Chinese; [Parakeet EOU](Documentation\u002FModels.md#streaming-transcription-true-real-time) (120m) for streaming ASR with end-of-utterance detection (English only). See all [ASR models](Documentation\u002FModels.md#asr-models).\n- **Inverse Text Normalization (ITN)**: Post-process ASR output to convert spoken-form to written-form (\"two hundred\" → \"200\"). See [text-processing-rs](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Ftext-processing-rs)\n- **Text-to-Speech (TTS)**: Kokoro (82m) for parallel synthesis with SSML and pronunciation control across 9 languages (EN, ES, FR, HI, IT, JA, PT, ZH); PocketTTS for streaming TTS with voice cloning support (English only)\n- **Speaker Diarization (Online + Offline)**: Speaker separation and identification across audio streams. Streaming pipeline for real-time processing and offline batch pipeline with advanced clustering.\n- **Speaker Embedding Extraction**: Generate speaker embeddings for voice comparison and clustering, you can use this for speaker identification\n- **Voice Activity Detection (VAD)**: Voice activity detection with Silero models\n- **Apple Neural Engine**: Models run efficiently on Apple's ANE for maximum performance with minimal power consumption\n- **Open-Source Models**: All models are publicly available on HuggingFace — converted and optimized by our team; permissive licenses. See [full model catalog](Documentation\u002FModels.md).\n\n## Video Demos\n\n| Link | Description |\n| --- | --- |\n| **[Spokenly Real-time ASR](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=9fXKKkyL8JE)** | Video demonstration of FluidAudio's transcription accuracy and speed |\n| **[Senko Integration](https:\u002F\u002Fx.com\u002Fhamza_q_\u002Fstatus\u002F1970228971657928995)** | Python Speaker diarization on Mac using FluidAudio's segmentation model |\n| **[Kokoro TTS](https:\u002F\u002Fx.com\u002Fsach1n\u002Fstatus\u002F1977817056507793521)** | Text-to-speech demo using FluidAudio's Kokoro and Silero models on iOS |\n| **[Parakeet Realtime EOU](https:\u002F\u002Fx.com\u002Fsach1n\u002Fstatus\u002F2003210626659680762)** | Parakeet streaming ASR with end-of-utterance detection on iOS |\n| **[Sortformer Diarization](https:\u002F\u002Fx.com\u002FAlex_tra_memory\u002Fstatus\u002F2010530705667661843)** | Sortformer for speaker diarization with overlapping speech on iOS |\n| **[PocketTTS](https:\u002F\u002Fx.com\u002Fsach1n\u002Fstatus\u002F2017627657006158296)** | Streaming text-to-speech using PocketTTS on iOS |\n| **[Parakeet EOU Ultra-Low Latency](https:\u002F\u002Fx.com\u002Fy_earu\u002Fstatus\u002F2038654262608064967)** | Real-time Parakeet EOU transcription on iOS demonstrating ultra-low latency speech-to-text |\n| **[Action Phrase Live Production Control](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=ykcvdTHHmrk)** | Voice-controlled live production workflow using FluidAudio's ASR and speaker diarization to trigger cameras, graphics, and layouts with natural voice commands |\n\n## Showcase\n\nMake a PR if you want to add your app, please keep it in chronological order.\n\n| App | Description |\n| --- | --- |\n| **[Voice Ink](https:\u002F\u002Ftryvoiceink.com\u002F)** | Local AI for instant, private transcription with near-perfect accuracy. Uses Parakeet ASR. |\n| **[Spokenly](https:\u002F\u002Fspokenly.app\u002F)** | Mac dictation app for fast, accurate voice-to-text; supports real-time dictation and file transcription. Uses Parakeet ASR and speaker diarization. |\n| **[Senko](https:\u002F\u002Fgithub.com\u002Fnarcotic-sh\u002Fsenko)** | A very fast and accurate speaker diarization pipeline. A [good example](https:\u002F\u002Fgithub.com\u002Fnarcotic-sh\u002Fsenko\u002Fcommit\u002F51dbd8bde764c3c6648dbbae57d6aff66c5ca15c) for how to integrate FluidAudio into a Python app |\n| **[Slipbox](https:\u002F\u002Fslipbox.ai\u002F)** | Privacy-first meeting assistant for real-time conversation intelligence. Uses Parakeet ASR (iOS) and speaker diarization across platforms. |\n| **[Whisper Mate](https:\u002F\u002Fwhisper.marksdo.com)** | Transcribes movies and audio locally; records and transcribes in real time from speakers or system apps. Uses speaker diarization. |\n| **[Altic\u002FFluid Voice](https:\u002F\u002Fgithub.com\u002Faltic-dev\u002FFluid-oss)** | Lightweight Fully free and Open Source Voice to Text dictation for macOS built using FluidAudio. Never pay for dictation apps |\n| **[Paraspeech](https:\u002F\u002Fparaspeech.com)** | AI powered voice to text. Fully offline. No subscriptions. |\n| **[mac-whisper-speedtest](https:\u002F\u002Fgithub.com\u002Fanvanvan\u002Fmac-whisper-speedtest)** | Comparison of different local ASR, including one of the first versions of FluidAudio's ASR models |\n| **[Starling](https:\u002F\u002Fgithub.com\u002FRyandonofrio3\u002FStarling)** | Open Source, fully local voice-to-text transcription with auto-paste at your cursor. |\n| **[BoltAI](https:\u002F\u002Fboltai.com\u002F)** | Write content 10x faster using parakeet models |\n| **[Voxeoflow](https:\u002F\u002Fwww.voxeoflow.app)** | Mac dictation app with real-time translation. Lightning-fast transcription in over 100 languages, instantly translated to your target language. |\n| **[Speakmac](https:\u002F\u002Fspeakmac.app)** | Mac app that lets you type anywhere on your Mac using your voice. Fully local, private dictation built on FluidAudio. |\n| **[SamScribe](https:\u002F\u002Fgithub.com\u002FSteven-Weng\u002FSamScribe)** | An open-source macOS app that captures and transcribes audio from your microphone and meeting applications (Zoom, Teams, Chrome) in real-time, with cross-session speaker recognition. |\n| **[WhisKey](https:\u002F\u002Fwhiskey.asktobuild.app\u002F)** | Privacy-first voice dictation keyboard for iOS and macOS. On-device transcription with 12+ languages, AI meeting summaries, and mindmap generation. Great for daily use and vibe-coding. Uses speaker diarization. |\n| **[Dictate Anywhere](https:\u002F\u002Fgithub.com\u002Fhoomanaskari\u002Fmac-dictate-anywhere)** | Native macOS dictation app with global Fn key activation. Dictate into any app with 25 language support. Uses Parakeet ASR. |\n| **[hongbomiao.com](https:\u002F\u002Fgithub.com\u002Fhongbo-miao\u002Fhongbomiao.com)** | A personal R&D lab that facilitates knowledge sharing. Uses Parakeet ASR. |\n| **[Hex](https:\u002F\u002Fgithub.com\u002Fkitlangton\u002FHex)** | macOS app that lets you press-and-hold a hotkey to record your voice, transcribe it, and paste into any application. Uses Parakeet ASR. |\n| **[Super Voice Assistant](https:\u002F\u002Fgithub.com\u002Fykdojo\u002Fsuper-voice-assistant)** | Open-source macOS voice assistant with local transcription. Uses Parakeet ASR. |\n| **[VoiceTypr](https:\u002F\u002Fgithub.com\u002Fmoinulmoin\u002Fvoicetypr)** | Open-source voice-to-text dictation for macOS and Windows. Uses Parakeet ASR. |\n| **[Summit AI Notes](https:\u002F\u002Fsummitnotes.app\u002F)** | Local meeting transcription and summarization with speaker identification. Supports 100+ languages. |\n| **[Ora](https:\u002F\u002Ffuturelab.studio\u002Fora)** | Local voice assistant for macOS with speech recognition and text-to-speech. |\n| **[Flowstay](https:\u002F\u002Fflowstay.app)** | Easy text-to-speech, local post-processing and Claude Code integration for macOS. Free forever. |\n| **[macos-speech-server](https:\u002F\u002Fgithub.com\u002Fdokterbob\u002Fmacos-speech-server)** | OpenAI compatible STT\u002Ftranscription and TTS\u002Fspeech API server. |\n| **[Snaply](https:\u002F\u002Fsnaply.ai)** |Free, Fast, 100% local AI dictation for Mac. |\n| **[OpenOats](https:\u002F\u002Fgithub.com\u002Fyazinsai\u002FOpenOats)** | Open-source meeting note-taker that transcribes conversations in real time and surfaces relevant notes from your knowledge base. Uses FluidAudio for local transcription. |\n| **[Enconvo](https:\u002F\u002Fenconvo.com)** | AI Agent Launcher for macOS with voice input, live captions, and text-to-speech. Uses Parakeet ASR for local speech recognition. |\n| **[Meeting Transcriber](https:\u002F\u002Fgithub.com\u002Fpasrom\u002Fmeeting-transcriber)** | macOS menu bar app that auto-detects, records, and transcribes meetings (Teams, Zoom, Webex) with dual-track speaker diarization. Uses Parakeet ASR, Qwen3-ASR, and speaker diarization. |\n| **[Hitoku Draft](https:\u002F\u002Fhitoku.me\u002Fdraft)** | A local, private, voice writing assistant on your macOS menu bar. Uses Parakeet ASR. |\n| **[Audite](https:\u002F\u002Fgithub.com\u002Fzachatrocity\u002Faudite)** | macOS menu-bar app that records meetings and transcribes them locally into Markdown notes for Obsidian. Uses Parakeet ASR via FluidAudio on the Apple Neural Engine. |\n| **[Muesli](https:\u002F\u002Fgithub.com\u002FpHequals7\u002Fmuesli)** | Native macOS dictation and meeting transcription with ~0.13s latency. Captures microphone and system audio with automatic speaker diarization. Uses Parakeet TDT and Qwen3 ASR. |\n| **[NanoVoice](https:\u002F\u002Fapps.apple.com\u002Fkz\u002Fapp\u002Fnanovoice\u002Fid6760539688)** | Free iOS voice keyboard for fast, private dictation in any app. Uses Parakeet ASR. |\n| **[MiniWhisper](https:\u002F\u002Fgithub.com\u002Fandyhtran\u002FMiniWhisper)** | Open-source macOS menu bar for quick local voice-to-text with minimal setup. Pick a shortcut, start talking. Uses Parakeet ASR. |\n| **[Talat](https:\u002F\u002Ftalat.app)** | Privacy-focused AI meeting notes app. Records and transcribes meetings locally on your Mac with speaker identification and LLM-powered summaries. Featured in [TechCrunch](https:\u002F\u002Ftechcrunch.com\u002F2026\u002F03\u002F24\u002Ftalats-ai-meeting-notes-stay-on-your-machine-not-in-the-cloud\u002F). Uses Parakeet ASR. |\n| **[Volocal](https:\u002F\u002Fgithub.com\u002Ffikrikarim\u002Fvolocal)** | Fully local voice AI on iOS. Uses streaming Parakeet EOU ASR and streaming PocketTTS. |\n| **[VivaDicta](https:\u002F\u002Fgithub.com\u002Fn0an\u002FVivaDicta)** | Open-source iOS voice-to-text app with system-wide AI voice keyboard — dictate and AI-process text in any app. 15+ AI providers, 40+ AI presets. Uses Parakeet ASR. |\n| **[MimicScribe](https:\u002F\u002Fmimicscribe.app\u002F)** | macOS menu bar app combining Parakeet TDT streaming ASR, PyanNote Community 1 speaker diarization, and cloud LLMs to provide AI-generated talking points during meetings, derived from the live transcript and user-provided instructions. Features meeting summarization, natural language search, an MCP server for agent integration, and a keyboard- and voice-forward UI. |\n| **[Action Phrase](https:\u002F\u002Factionphrase.com\u002F)** | Voice-controlled live production app for iOS, iPadOS, and macOS. Control cameras, graphics, layouts, and production workflows with natural voice commands. Integrates with popular tools including OBS, vMix, ProPresenter, Bitfocus Companion, and more. Uses Parakeet TDT ASR and Sortformer diarization. |\n\n## Installation\n\nAdd FluidAudio to your project using Swift Package Manager:\n\n```swift\ndependencies: [\n    .package(url: \"https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio.git\", from: \"0.12.4\"),\n],\n```\n\n**In Xcode:**\n1. Add the FluidAudio package to your project\n2. In the \"Add Package\" dialog, select `FluidAudio`\n3. Add it to your app target\n\n**In Package.swift:**\n```swift\n.product(name: \"FluidAudio\", package: \"FluidAudio\")\n```\n\n**CocoaPods:** We recommend using [cocoapods-spm](https:\u002F\u002Fgithub.com\u002Ftrinhngocthuyen\u002Fcocoapods-spm) for better SPM integration, but if needed, you can also use our podspec: `pod 'FluidAudio', '~> 0.12.4'`\n\n### Other Frameworks\n\nBuilding with a different framework? Use one of our official wrappers:\n\n| Platform | Package | Install |\n|----------|---------|---------|\n| **React Native \u002F Expo** | [@fluidinference\u002Freact-native-fluidaudio](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Freact-native-fluidaudio) | `npm install @fluidinference\u002Freact-native-fluidaudio` |\n| **Rust \u002F Tauri** | [fluidaudio-rs](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Ffluidaudio-rs) | `cargo add fluidaudio-rs` |\n\n### Post-Processing Tools\n\nEnhance ASR output with post-processing:\n\n| Tool | Description | Language |\n|------|-------------|----------|\n| **[text-processing-rs](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Ftext-processing-rs)** | Inverse Text Normalization (ITN) and Text Normalization (TN) across 7 languages (EN, DE, ES, FR, HI, JA, ZH). 100% NeMo test compatibility (3,011 tests). Converts spoken-form ASR output to written form (\"two hundred\" → \"200\", \"five dollars\" → \"$5\"). Rust port of [NVIDIA NeMo Text Processing](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-text-processing) with Swift wrapper. | Rust, Swift |\n\n## Configuration\n\n### Quick Reference\n\nBoth solve the same problem: **\"I can't reach HuggingFace directly.\"** They're alternative approaches - pick whichever matches your setup:\n\n| Scenario | Solution | Configuration |\n|----------|----------|---|\n| You have a **local mirror or internal model server** | Use Registry URL override | `REGISTRY_URL=https:\u002F\u002Fyour-mirror.com` |\n| You're **behind a corporate firewall** with a proxy that can reach HuggingFace | Use Proxy configuration | `https_proxy=http:\u002F\u002Fproxy.company.com:8080` |\n\n**How they work:**\n- **Registry URL** - App requests from `your-mirror.com` instead of `huggingface.co`\n- **Proxy** - App still requests `huggingface.co`, but traffic routes through proxy to reach it\n\nIn most cases, you only need one. (You'd use both only if your mirror is behind the proxy and unreachable without it.)\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Model Registry URL\u003C\u002Fb> - Change download destination\u003C\u002Fsummary>\n\nBy default, FluidAudio downloads models from HuggingFace. You can override this to use a mirror, local server, or air-gapped environment.\n\n**Programmatic override (recommended for apps):**\n```swift\nimport FluidAudio\n\n\u002F\u002F Set custom registry before using any managers\nModelRegistry.baseURL = \"https:\u002F\u002Fyour-mirror.example.com\"\n\n\u002F\u002F Models will now download from the custom registry\nlet diarizer = DiarizerManager()\n```\n\n**Environment Variables (recommended for CLI\u002Ftesting):**\n```bash\n# Use custom registry\nexport REGISTRY_URL=https:\u002F\u002Fyour-mirror.example.com\nswift run fluidaudiocli transcribe audio.wav\n\n# Or use the MODEL_REGISTRY_URL alias\nexport MODEL_REGISTRY_URL=https:\u002F\u002Fmodels.internal.corp\nswift run fluidaudiocli diarization-benchmark --auto-download\n```\n\n**Xcode Scheme Configuration:**\n1. Edit Scheme → Run → Arguments\n2. Go to **Environment Variables** tab\n3. Click `+` and add: `REGISTRY_URL` = `https:\u002F\u002Fyour-mirror.example.com`\n4. The custom registry will apply to all debug runs\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Proxy Configuration\u003C\u002Fb> - Route downloads through a proxy server\u003C\u002Fsummary>\n\nIf you're behind a corporate firewall and cannot reach HuggingFace (or your registry) directly, configure a proxy to forward requests:\n\nSet the `https_proxy` environment variable:\n\n```bash\nexport https_proxy=http:\u002F\u002Fproxy.company.com:8080\n# or for authenticated proxies:\nexport https_proxy=http:\u002F\u002Fuser:password@proxy.company.com:8080\n\nswift run fluidaudiocli transcribe audio.wav\n```\n\n**Xcode Scheme Configuration for Proxy:**\n1. Edit Scheme → Run → Arguments\n2. Go to **Environment Variables** tab\n3. Click `+` and add: `https_proxy` = `http:\u002F\u002Fproxy.company.com:8080`\n4. FluidAudio will automatically route downloads through the proxy\n\n\u003C\u002Fdetails>\n\n## Documentation\n\n**[DeepWiki](https:\u002F\u002Fdeepwiki.com\u002FFluidInference\u002FFluidAudio)** for auto-generated docs for this repo.\n\n### Documentation Index\n\n- Guides\n  - [Audio Conversion for Inference](Documentation\u002FGuides\u002FAudioConversion.md)\n  - Manual model download & loading options: [ASR](Documentation\u002FASR\u002FManualModelLoading.md), [Diarizer](Documentation\u002FDiarization\u002FGettingStarted.md#manual-model-loading), [VAD](Documentation\u002FVAD\u002FGettingStarted.md#manual-model-loading)\n  - Routing Hugging Face (or compatible) requests through a proxy? Set `https_proxy` before running the download helpers (see [Documentation\u002FAPI.md](Documentation\u002FAPI.md:9)).\n- Models\n  - Automatic Speech Recognition\u002FTranscription\n    - [Getting Started](Documentation\u002FASR\u002FGettingStarted.md)\n    - [Last Chunk Handling](Documentation\u002FASR\u002FLastChunkHandling.md)\n  - Speaker Diarization\n    - [Speaker Diarization Guide](Documentation\u002FDiarization\u002FGettingStarted.md)\n  - VAD: [Getting Started](Documentation\u002FVAD\u002FGettingStarted.md)\n    - [Segmentation](Documentation\u002FVAD\u002FSegmentation.md)\n    - [Model Conversion Code](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Fmobius)\n- [Benchmarks](Documentation\u002FBenchmarks.md)\n- [API Reference](Documentation\u002FAPI.md)\n- [Command Line Guide](Documentation\u002FCLI.md)\n\n### MCP Server\n\nThe repo is indexed by DeepWiki MCP server, so your coding tool can access the docs:\n\n```json\n{\n  \"mcpServers\": {\n    \"deepwiki\": {\n      \"url\": \"https:\u002F\u002Fmcp.deepwiki.com\u002Fmcp\"\n    }\n  }\n}\n```\n\nFor claude code:\n\n```bash\nclaude mcp add -s user -t http deepwiki https:\u002F\u002Fmcp.deepwiki.com\u002Fmcp\n```\n\n## Automatic Speech Recognition (ASR) \u002F Transcription\n\n- **Models**:\n  - `FluidInference\u002Fparakeet-tdt-0.6b-v3-coreml` (multilingual, 25 European languages)\n  - `FluidInference\u002Fparakeet-tdt-0.6b-v2-coreml` (English-only, highest recall)\n- **Processing Mode**: Batch transcription for complete audio files\n- **Real-time Factor**: ~190x on M4 Pro (processes 1 hour of audio in ~19 seconds)\n- **Streaming Support**: Real-time streaming via `SlidingWindowAsrManager` with sliding window processing and cancellation support\n- **Backend**: Same Parakeet TDT v3 model powers our backend ASR\n\n### ASR Quick Start\n\n```swift\nimport FluidAudio\n\n\u002F\u002F Batch transcription from an audio file\nTask {\n    \u002F\u002F 1) Initialize ASR manager and load models\n    let models = try await AsrModels.downloadAndLoad(version: .v3)  \u002F\u002F Switch to .v2 for English-only work\n    let asrManager = AsrManager(config: .default)\n    try await asrManager.loadModels(models)\n\n    \u002F\u002F 3) Transcribe the audio 16hz, already converted\n    let result = try await asrManager.transcribe(samples)\n\n    \u002F\u002F 3) Transcribe a file\n    \u002F\u002F let url = URL(fileURLWithPath: sample.audioPath)\n\n    \u002F\u002F 3) Transcribe AVAudioPCMBuffer\n    \u002F\u002F let result = try await asrManager.transcribe(audioBuffer)\n    print(\"Transcription: \\(result.text)\")\n}\n```\n\n```bash\n# Transcribe an audio file (batch)\nswift run fluidaudiocli transcribe audio.wav\n\n# English-only run with higher recall\nswift run fluidaudiocli transcribe audio.wav --model-version v2\n```\n\n## Speaker Diarization\n\n### Offline Speaker Diarization Pipeline\n\nPyannote Community-1 pipeline (powerset segmentation + WeSpeaker + VBx) for offline speaker diarization. Use this for most use cases, see Benchmarks.md for benchmarks.\n\n```swift\nimport FluidAudio\n\nlet config = OfflineDiarizerConfig()\nlet manager = OfflineDiarizerManager(config: config)\ntry await manager.prepareModels()  \u002F\u002F Downloads + compiles Core ML bundles if they are missing\n\nlet samples = try AudioConverter().resampleAudioFile(path: \"meeting.wav\")\nlet result = try await manager.process(audio: samples)\n\nfor segment in result.segments {\n    print(\"\\(segment.speakerId) \\(segment.startTimeSeconds)s → \\(segment.endTimeSeconds)s\")\n}\n```\n\nFor processing audio files, use the file-based API which automatically uses memory-mapped streaming for efficiency:\n\n```swift\nlet url = URL(fileURLWithPath: \"meeting.wav\")\nlet result = try await manager.process(url)\n\nfor segment in result.segments {\n    print(\"\\(segment.speakerId) \\(segment.startTimeSeconds)s → \\(segment.endTimeSeconds)s\")\n}\n```\n\n```bash\n# Process a meeting with full VBx clustering\nswift run fluidaudiocli process ~\u002FFluidAudioDatasets\u002Fami_official\u002Fsdm\u002FES2004a.Mix-Headset.wav \\\n  --mode offline --threshold 0.6 --output es2004a_offline.json\n\n# Run the AMI single-file benchmark with automatic downloads\nswift run fluidaudiocli diarization-benchmark --mode offline --auto-download \\\n  --single-file ES2004a --threshold 0.6 --output offline_results.json\n```\n\n`offline_results.json` contains DER\u002FJER\u002FRTFx along with timing breakdowns for segmentation, embedding extraction, and VBx clustering. CI now runs this workflow on every PR to ensure the offline models stay healthy and the Hugging Face assets remain accessible.\n\n### LS-EEND (LongForm Streaming End-to-End Neural Diarization)\n\nEnd-to-end streaming diarization with CoreML inference. Default choice for online diarization — single model, no clustering pipeline, up to 10 speakers, 100ms frame updates with 900ms tentative preview. Supports both streaming and complete-buffer processing. See [Documentation\u002FDiarization\u002FGettingStarted.md](Documentation\u002FDiarization\u002FGettingStarted.md) for details.\n\n```swift\nimport FluidAudio\n\nTask {\n    let diarizer = LSEENDDiarizer()\n    try await diarizer.initialize(variant: .dihard3)\n\n    let samples = try await loadSamples16kMono(path: \"path\u002Fto\u002Fmeeting.wav\")\n    let timeline = try diarizer.processComplete(samples, sourceSampleRate: 16_000)\n\n    for segment in timeline.segments {\n        print(\"Speaker \\(segment.speakerId): \\(segment.startTimeSeconds)s - \\(segment.endTimeSeconds)s\")\n    }\n}\n```\n\n### Sortformer (End-to-End Neural Diarization)\n\nEnd-to-end neural diarization using [NVIDIA's Sortformer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06656). Secondary streaming diarizer — trades LS-EEND's higher speaker capacity and benchmark results for better speaker identity stability. Limited to 4 speakers. No separate VAD, segmentation, or clustering needed. Licensed under NVIDIA Open Model License.\n\nBoth LS-EEND and Sortformer emit results into a `DiarizerTimeline` with ultra-low-latency updates. See [Documentation\u002FDiarization\u002FSortformer.md](Documentation\u002FDiarization\u002FSortformer.md) for usage and comparison.\n\n### Streaming\u002FOnline Speaker Diarization (Pyannote)\n\nThis pipeline uses segmentation plus speaker embeddings and is the third choice behind LS-EEND and Sortformer. It can be useful if you specifically want the classic multi-stage pipeline, but it is much slower than LS-EEND or Sortformer for live diarization.\n\nWhy use the WeSpeaker\u002FPyannote pipeline:\n- More modular pipeline if you want separate segmentation and embedding stages\n- Better fit when you need to integrate external speaker identification or clustering logic\n- Speaker pre-enrollment is reliable\n- Speaker database management is much easier\n- Purging or updating individual speakers is straightforward\n- Not recommended when low-latency live diarization is the priority\n\nIn most applications:\n- Use LS-EEND as the default online diarizer\n- Use Sortformer as the second choice when its stronger identity stability and participant focus matter more than the 4-speaker limit\n- Use the WeSpeaker\u002FPyannote pipeline only when you specifically need its modular design despite the speed cost\n\nTradeoffs:\n- Slower in both inference time and practical latency than LS-EEND or Sortformer\n- Needs larger chunks, with at least 5 seconds usually required for decent results\n- Unlike LS-EEND and Sortformer, speaker state is much easier to manipulate explicitly\n\n```swift\nimport FluidAudio\n\n\u002F\u002F Diarize an audio file\nTask {\n    let models = try await DiarizerModels.downloadIfNeeded()\n    let diarizer = DiarizerManager() \n    diarizer.initialize(models: models)\n\n    \u002F\u002F Prepare 16 kHz mono samples (see: Audio Conversion)\n    let samples = try await loadSamples16kMono(path: \"path\u002Fto\u002Fmeeting.wav\")\n\n    \u002F\u002F Run diarization\n    let result = try diarizer.performCompleteDiarization(samples)\n    for segment in result.segments {\n        print(\"Speaker \\(segment.speakerId): \\(segment.startTimeSeconds)s - \\(segment.endTimeSeconds)s\")\n    }\n}\n```\n\nFor diarization streaming see [Documentation\u002FDiarization\u002FGettingStarted.md](Documentation\u002FDiarization\u002FGettingStarted.md)\n\n```bash\nswift run fluidaudiocli diarization-benchmark --single-file ES2004a \\\n  --chunk-seconds 3 --overlap-seconds 2\n```\n\n### CLI\n\n```bash\n# Process an individual file and save JSON\nswift run fluidaudiocli process meeting.wav --output results.json --threshold 0.6\n```\n\n## Voice Activity Detection (VAD)\n\nSilero VAD powers our on-device detector. The latest release surfaces the same\ntimestamp extraction and streaming heuristics as the upstream PyTorch\nimplementation. Ping us on Discord if you need help tuning it for your\nenvironment.\n\n### VAD Quick Start (Offline Segmentation)\n\nSimple call to return chunk-level probabilities every 256 ms hop:\n\n```swift\nlet results = try await manager.process(samples)\nfor (index, chunk) in results.enumerated() {\n    print(\n        String(\n            format: \"Chunk %02d: prob=%.3f, inference=%.4fs\",\n            index,\n            chunk.probability,\n            chunk.processingTime\n        )\n    )\n}\n```\n\nThe following are higher level APIs better suited to integrate with other systems\n\n```swift\nimport FluidAudio\n\nTask {\n    let manager = try await VadManager(\n        config: VadConfig(defaultThreshold: 0.75)\n    )\n\n    let audioURL = URL(fileURLWithPath: \"path\u002Fto\u002Faudio.wav\")\n    let samples = try AudioConverter().resampleAudioFile(audioURL)\n\n    var segmentation = VadSegmentationConfig.default\n    segmentation.minSpeechDuration = 0.25\n    segmentation.minSilenceDuration = 0.4\n\n    let segments = try await manager.segmentSpeech(samples, config: segmentation)\n    for segment in segments {\n        print(\n            String(format: \"Speech %.2f–%.2fs\", segment.startTime, segment.endTime)\n        )\n    }\n}\n```\n\n### Streaming\n\n```swift\nimport FluidAudio\n\nTask {\n    let manager = try await VadManager()\n    var state = await manager.makeStreamState()\n\n    for chunk in microphoneChunks {\n        let result = try await manager.processStreamingChunk(\n            chunk,\n            state: state,\n            config: .default,\n            returnSeconds: true,\n            timeResolution: 2\n        )\n\n        state = result.state\n\n        \u002F\u002F Access raw probability (0.0-1.0) for custom logic\n        print(String(format: \"Probability: %.3f\", result.probability))\n\n        if let event = result.event {\n            let label = event.kind == .speechStart ? \"Start\" : \"End\"\n            print(\"\\(label) @ \\(event.time ?? 0)s\")\n        }\n    }\n}\n```\n\n### CLI\n\nStart with the general-purpose `process` command, which runs the diarization\npipeline (and therefore VAD) end-to-end on a single file:\n\n```bash\nswift run fluidaudiocli process path\u002Fto\u002Faudio.wav\n```\n\nOnce you need to experiment with VAD-specific knobs directly, reach for:\n\n```bash\n# Inspect offline segments (default mode)\nswift run fluidaudiocli vad-analyze path\u002Fto\u002Faudio.wav\n\n# Streaming simulation only (timestamps printed in seconds by default)\nswift run fluidaudiocli vad-analyze path\u002Fto\u002Faudio.wav --streaming\n\n# Benchmark accuracy\u002Fprecision trade-offs\nswift run fluidaudiocli vad-benchmark --num-files 50 --threshold 0.3\n```\n\n`swift run fluidaudiocli vad-analyze --help` lists every tuning option, including\nnegative-threshold overrides, max-speech splitting, padding, and chunk size.\nOffline mode also reports RTFx using the model's per-chunk processing time.\n\n## Text‑To‑Speech (TTS)\n\n> **⚠️ Beta:** TTS currently supports American English only. Additional language support is planned.\n\nFluidAudio ships two TTS backends:\n\n| | PocketTTS | Kokoro |\n|---|---|---|\n| **GPL dependencies** | None | None |\n| **Tokenizer** | SentencePiece | CoreML G2P → IPA phonemes |\n| **Generation** | Frame-by-frame autoregressive (80ms) | Parallel (all frames at once) |\n| **Streaming** | Yes | No |\n| **Voice cloning** | Yes (1–30s audio sample) | No |\n| **Pronunciation control** | No | Yes (SSML, custom lexicon) |\n| **Output** | 24 kHz mono WAV | 24 kHz mono WAV |\n\n### PocketTTS\n\nStreaming-friendly TTS with voice cloning support from short audio samples.\n\n```swift\nimport FluidAudio\n\nTask {\n    let manager = try await PocketTtsManager()\n    let audioData = try await manager.synthesize(\"Hello from FluidAudio.\")\n    try audioData.write(to: URL(fileURLWithPath: \"out.wav\"))\n}\n```\n\n```bash\n# Synthesize with default voice\nswift run fluidaudiocli tts \"Hello from FluidAudio.\" --output out.wav --backend pocket\n\n# Clone a voice from an audio sample\nswift run fluidaudiocli tts \"Hello world.\" --output out.wav --backend pocket --clone-voice speaker.wav\n```\n\n### Kokoro\n\nHigh-quality parallel TTS with SSML and phoneme-level pronunciation control. Uses a CoreML G2P (grapheme-to-phoneme) model for out-of-vocabulary words — no external dependencies required.\n\n```swift\nimport FluidAudio\n\nTask {\n    let manager = KokoroTtsManager()\n    try await manager.initialize()\n    let data = try await manager.synthesize(text: \"Hello from FluidAudio.\")\n    try data.write(to: URL(fileURLWithPath: \"out.wav\"))\n}\n```\n\n```bash\nswift run fluidaudiocli tts \"Hello from FluidAudio.\" --auto-download --output out.wav\n```\n\nDictionary and model assets are cached under `~\u002F.cache\u002Ffluidaudio\u002FModels\u002Fkokoro`.\n\n## Continuous Integration\n\n- `tests.yml`: Default build matrix covering SwiftPM tests and an iOS archive smoke test.\n- `diarizer-benchmark.yml`: Runs the streaming diarization benchmark on ES2004a for regression tracking.\n- `offline-pipeline.yml`: Executes the VBx offline pipeline end-to-end (`fluidaudio diarization-benchmark --mode offline`) and fails if DER\u002FJER drift beyond guardrails or if models fail to download. Use this workflow as a reference for provisioning model caches in your own CI.\n\n## Everything Else\n\n### FAQs\n\n- CLI is available on macOS only. For iOS, use the library programmatically.\n- Models auto-download on first use. If your network restricts Hugging Face access, set an HTTPS proxy: `export https_proxy=http:\u002F\u002F127.0.0.1:7890`.\n- Windows alternative in development: [fluid-server](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Ffluid-server)\n- If you're looking to get the system audio on a Mac, take a look at this repo for reference [AudioCap](https:\u002F\u002Fgithub.com\u002Finsidegui\u002FAudioCap\u002Ftree\u002Fmain)\n\n### License\n\nApache 2.0 — see `LICENSE` for details.\n\n### Acknowledgments\n\nThis project builds upon the excellent work of the [sherpa-onnx](https:\u002F\u002Fgithub.com\u002Fk2-fsa\u002Fsherpa-onnx) project for speaker diarization algorithms and techniques.\n\nPyannote: \u003Chttps:\u002F\u002Fgithub.com\u002Fpyannote\u002Fpyannote-audio>\n\nWeSpeaker: \u003Chttps:\u002F\u002Fgithub.com\u002Fwenet-e2e\u002Fwespeaker>\n\nParakeet-mlx: \u003Chttps:\u002F\u002Fgithub.com\u002Fsenstella\u002Fparakeet-mlx>\n\nsilero-vad: \u003Chttps:\u002F\u002Fgithub.com\u002Fsnakers4\u002Fsilero-vad>\n\nKokoro-82M: \u003Chttps:\u002F\u002Fhuggingface.co\u002Fhexgrad\u002FKokoro-82M>\n\n### Citation\n\nIf you use FluidAudio in your work, please cite:\n\nFluidInference Team. (2025). FluidAudio: Local Speaker Diarization, ASR, and VAD for Apple Platforms (Version 0.12.4) [Computer software]. GitHub. \u003Chttps:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio>\n\n```bibtex\n@software{FluidInferenceTeam_FluidAudio_2025,\n  author = {{FluidInference Team}},\n  title = {{FluidAudio: Local Speaker Diarization, ASR, and VAD for Apple Platforms}},\n  year = {2025},\n  month = {3},\n  version = {0.12.4},\n  url = {https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio},\n  note = {Computer software}\n}\n```\n\n---\n\n## Show Your Support\n\nHelp the Fluid Inference community grow by adding a \"Powered by Fluid Inference\" badge to your project!\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Ffluidinference.com\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_84a39d3f1e87.png\" alt=\"Powered by Fluid Inference\" height=\"80\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\nCopy and paste this prompt to your coding agent where you host your homepage:\n\n```text\nAdd a centered 'Powered by Fluid Inference' badge to the footer linking to fluidinference.com. Image: https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_84a39d3f1e87.png. Whitelist the image hostname in your framework's config.\n```\n\nOr use one of these code snippets:\n\n\u003Cdetails>\n\u003Csummary>React\u002FNext.js\u003C\u002Fsummary>\n\n```jsx\n\u003Cdiv className=\"flex justify-center py-8\">\n  \u003Ca href=\"https:\u002F\u002Ffluidinference.com\">\n    \u003Cimg\n      src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_84a39d3f1e87.png\"\n      alt=\"Powered by Fluid Inference\"\n      height={80}\n    \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>HTML\u003C\u002Fsummary>\n\n```html\n\u003Cdiv style=\"text-align: center; padding: 20px;\">\n  \u003Ca href=\"https:\u002F\u002Ffluidinference.com\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_84a39d3f1e87.png\" alt=\"Powered by Fluid Inference\" height=\"80\">\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Markdown\u003C\u002Fsummary>\n\n```markdown\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Ffluidinference.com\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_84a39d3f1e87.png\" alt=\"Powered by Fluid Inference\" height=\"80\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n```\n\n\u003C\u002Fdetails>\n","![https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_de63c6ec5a63.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_de63c6ec5a63.png)\n\n# FluidAudio - 使用 CoreML 模型进行转录、文本转语音、VAD 和说话人分离\n\n[![Swift](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSwift-6.0+-orange.svg)](https:\u002F\u002Fswift.org)\n[![平台](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPlatform-macOS%20%7C%20iOS-blue.svg)](https:\u002F\u002Fdeveloper.apple.com)\n[![文档](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocumentation-docs.fluidinference.com-008574.svg)](https:\u002F\u002Fdocs.fluidinference.com\u002Fintroduction)\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-Join%20Chat-7289da.svg)](https:\u002F\u002Fdiscord.gg\u002FWNsvaCtmDe)\n[![Hugging Face 模型](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face%20Models-800k%2B%20downloads-brightgreen?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FFluidInference)\n[![Ask DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg)](https:\u002F\u002Fdeepwiki.com\u002FFluidInference\u002FFluidAudio)\n\nFluidAudio 是一个适用于 Apple 设备的 Swift SDK，可在本地实现低延迟的音频 AI 处理，并将推理任务卸载到 Apple 神经引擎（ANE），从而减少内存占用并显著提升推理速度。\n\n该 SDK 包含最先进的说话人分离、转录和语音活动检测功能，基于开源模型（MIT\u002FApache 2.0 许可）开发，只需几行代码即可轻松集成。这些模型针对后台处理、环境计算和始终在线的工作负载进行了优化，通过在 ANE 上运行推理来最大限度地降低 CPU 使用率，并完全避免使用 GPU\u002FMPS。\n\n如需定制用例、反馈、额外的模型支持或平台请求，请加入我们的 [Discord](https:\u002F\u002Fdiscord.gg\u002FWNsvaCtmDe) 社区。我们还将推出用于设备端的视觉、语言和 TTS 模型，并会在那里分享最新进展。\n\n以下是几款在 macOS 和 iOS 上使用 Fluid Audio 模型的优秀本地 AI 应用：\n\n\u003Cp align=\"left\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FBeingpax\u002FVoiceInk\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_11448ffcd106.png\" height=\"40\" alt=\"Voice Ink\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fspokenly.app\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_8f49214036ee.png\" height=\"40\" alt=\"Spokenly\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fslipbox.ai\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_4ff51848e56a.png\" height=\"40\" alt=\"Slipbox\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhex.kitlangton.com\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_baca799affef.png\" height=\"40\" alt=\"Hex\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fboltai.com\u002F\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_9158346dcc25.png\" height=\"40\" alt=\"BoltAI\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fparaspeech.com\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_f180f436c2bd.png\" height=\"40\" alt=\"Paraspeech\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Faltic.dev\u002Ffluid\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_fb36cf7b00bd.png\" height=\"40\" alt=\"Fluid Voice\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fsnaply.ai\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_a1a791729f62.png\" height=\"40\" alt=\"Snaply\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fyazinsai\u002FOpenOats\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_2f9dd891491a.png\" height=\"40\" alt=\"OpenOats\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Ftalat.app\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_dc813ca8a54e.png\" height=\"40\" alt=\"Talat\">\u003C\u002Fa>\n\u003C!-- 添加您的应用：通过 PR 提交 logo。Fluid Inference 团队会定期整理此列表，并每两周更新一次展示部分。感谢您的耐心等待。 -->\n\u003C\u002Fp>\n\n想转换您自己的模型吗？请查看 [möbius](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Fmobius)。\n\n## 亮点\n\n- **自动语音识别 (ASR)**：[Parakeet TDT v3](Documentation\u002FModels.md#batch-transcription-near-real-time)（0.6b）及其他 TDT\u002FCTC 模型，支持批量转录，涵盖 25 种欧洲语言、日语和中文；[Parakeet EOU](Documentation\u002FModels.md#streaming-transcription-true-real-time)（120m），用于流式 ASR 并具备话语结束检测功能（仅支持英语）。更多 ASR 模型请参阅 [Documentation\u002FModels.md#asr-models]。\n- **逆向文本规范化 (ITN)**：对 ASR 输出进行后处理，将口语形式转换为书面形式（“two hundred” → “200”）。详情请参阅 [text-processing-rs](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Ftext-processing-rs)。\n- **文本转语音 (TTS)**：Kokoro（82m），支持 SSML 和发音控制，可并行合成 9 种语言（EN、ES、FR、HI、IT、JA、PT、ZH）；PocketTTS 则支持流式 TTS 及语音克隆功能（仅限英语）。\n- **说话人分离（在线 + 离线）**：能够从音频流中分离并识别不同说话人。提供实时处理的流式管道以及具备高级聚类功能的离线批处理管道。\n- **说话人嵌入提取**：生成说话人嵌入，用于语音比较和聚类，也可用于说话人身份识别。\n- **语音活动检测 (VAD)**：采用 Silero 模型进行语音活动检测。\n- **Apple 神经引擎**：模型高效运行于 Apple 的 ANE 上，以最小的功耗实现最佳性能。\n- **开源模型**：所有模型均在 HuggingFace 上公开可用——由我们的团队转换并优化；采用宽松许可协议。完整模型目录请参阅 [Documentation\u002FModels.md]。\n\n## 视频演示\n\n| 链接 | 描述 |\n| --- | --- |\n| **[Spokenly 实时 ASR](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=9fXKKkyL8JE)** | FluidAudio 转录准确性和速度的视频演示 |\n| **[Senko 集成](https:\u002F\u002Fx.com\u002Fhamza_q_\u002Fstatus\u002F1970228971657928995)** | 使用 FluidAudio 分割模型在 Mac 上进行 Python 说话人分离 |\n| **[Kokoro TTS](https:\u002F\u002Fx.com\u002Fsach1n\u002Fstatus\u002F1977817056507793521)** | 在 iOS 上使用 FluidAudio 的 Kokoro 和 Silero 模型进行文本转语音演示 |\n| **[Parakeet 实时 EOU](https:\u002F\u002Fx.com\u002Fsach1n\u002Fstatus\u002F2003210626659680762)** | 在 iOS 上使用 Parakeet 进行流式 ASR，并具备话语结束检测功能 |\n| **[Sortformer 说话人分离](https:\u002F\u002Fx.com\u002FAlex_tra_memory\u002Fstatus\u002F2010530705667661843)** | 使用 Sortformer 在 iOS 上处理重叠语音的说话人分离 |\n| **[PocketTTS](https:\u002F\u002Fx.com\u002Fsach1n\u002Fstatus\u002F2017627657006158296)** | 在 iOS 上使用 PocketTTS 进行流式文本转语音 |\n| **[Parakeet EOU 超低延迟](https:\u002F\u002Fx.com\u002Fy_earu\u002Fstatus\u002F2038654262608064967)** | 在 iOS 上演示实时 Parakeet EOU 转录，展现超低延迟的语音转文字能力 |\n| **[动作短语直播制作控制](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=ykcvdTHHmrk)** | 基于 FluidAudio 的 ASR 和说话人分离技术，通过自然语音命令控制摄像机、图形和布局的直播制作流程 |\n\n## 展示区\n\n如果您希望将自己的应用加入展示列表，请提交 PR，并请按时间顺序排列。\n\n| 应用 | 描述 |\n| --- | --- |\n| **[Voice Ink](https:\u002F\u002Ftryvoiceink.com\u002F)** | 本地AI，实现即时、私密的转录，准确率接近完美。采用Parakeet ASR技术。 |\n| **[Spokenly](https:\u002F\u002Fspokenly.app\u002F)** | Mac上的语音输入应用，支持快速、精准的语音转文字；同时支持实时听写和文件转录。使用Parakeet ASR及说话人分离技术。 |\n| **[Senko](https:\u002F\u002Fgithub.com\u002Fnarcotic-sh\u002Fsenko)** | 高速且精确的说话人分离流水线。提供了一个很好的示例，展示如何将FluidAudio集成到Python应用中：[链接](https:\u002F\u002Fgithub.com\u002Fnarcotic-sh\u002Fsenko\u002Fcommit\u002F51dbd8bde764c3c6648dbbae57d6aff66c5ca15c)。 |\n| **[Slipbox](https:\u002F\u002Fslipbox.ai\u002F)** | 注重隐私的会议助手，提供实时对话智能分析。跨平台使用Parakeet ASR和说话人分离技术。 |\n| **[Whisper Mate](https:\u002F\u002Fwhisper.marksdo.com)** | 在本地转录电影和音频；可从扬声器或系统应用中实时录音并转录。使用说话人分离技术。 |\n| **[Altic\u002FFluid Voice](https:\u002F\u002Fgithub.com\u002Faltic-dev\u002FFluid-oss)** | 基于FluidAudio构建的轻量级、完全免费且开源的macOS语音转文字听写工具。再也不必为听写应用付费。 |\n| **[Paraspeech](https:\u002F\u002Fparaspeech.com)** | AI驱动的语音转文字服务。完全离线运行，无需订阅。 |\n| **[mac-whisper-speedtest](https:\u002F\u002Fgithub.com\u002Fanvanvan\u002Fmac-whisper-speedtest)** | 对不同本地ASR的比较，其中包括FluidAudio早期版本的ASR模型之一。 |\n| **[Starling](https:\u002F\u002Fgithub.com\u002FRyandonofrio3\u002FStarling)** | 开源、完全本地化的语音转文字转录工具，在光标处自动粘贴文本。 |\n| **[BoltAI](https:\u002F\u002Fboltai.com\u002F)** | 使用Parakeet模型，让内容创作速度提升10倍。 |\n| **[Voxeoflow](https:\u002F\u002Fwww.voxeoflow.app)** | Mac上的语音输入应用，具备实时翻译功能。以闪电般的速度转录超过100种语言，并即时翻译为目标语言。 |\n| **[Speakmac](https:\u002F\u002Fspeakmac.app)** | 一款Mac应用，允许用户通过语音在Mac的任何位置进行输入。基于FluidAudio构建的完全本地化、私密的语音输入工具。 |\n| **[SamScribe](https:\u002F\u002Fgithub.com\u002FSteven-Weng\u002FSamScribe)** | 一款开源的macOS应用，能够实时捕获并转录来自麦克风以及会议软件（Zoom、Teams、Chrome）的音频，并支持跨会话的说话人识别。 |\n| **[WhisKey](https:\u002F\u002Fwhiskey.asktobuild.app\u002F)** | 注重隐私的iOS和macOS语音输入键盘。设备端转录支持12种以上语言，提供AI会议摘要和思维导图生成功能。非常适合日常使用和氛围编码。使用说话人分离技术。 |\n| **[Dictate Anywhere](https:\u002F\u002Fgithub.com\u002Fhoomanaskari\u002Fmac-dictate-anywhere)** | 原生macOS语音输入应用，可通过全局Fn键激活。支持25种语言，可在任意应用中进行语音输入。使用Parakeet ASR技术。 |\n| **[hongbomiao.com](https:\u002F\u002Fgithub.com\u002Fhongbo-miao\u002Fhongbomiao.com)** | 一个个人研发实验室，致力于知识共享。使用Parakeet ASR技术。 |\n| **[Hex](https:\u002F\u002Fgithub.com\u002Fkitlangton\u002FHex)** | 一款macOS应用，按下并长按快捷键即可录制语音、转录并粘贴到任何应用程序中。使用Parakeet ASR技术。 |\n| **[Super Voice Assistant](https:\u002F\u002Fgithub.com\u002Fykdojo\u002Fsuper-voice-assistant)** | 开源的macOS语音助手，具备本地转录功能。使用Parakeet ASR技术。 |\n| **[VoiceTypr](https:\u002F\u002Fgithub.com\u002Fmoinulmoin\u002Fvoicetypr)** | 开源的macOS和Windows语音转文字听写工具。使用Parakeet ASR技术。 |\n| **[Summit AI Notes](https:\u002F\u002Fsummitnotes.app\u002F)** | 本地会议转录与总结，附带说话人识别功能。支持100多种语言。 |\n| **[Ora](https:\u002F\u002Ffuturelab.studio\u002Fora)** | macOS上的本地语音助手，具备语音识别和文本转语音功能。 |\n| **[Flowstay](https:\u002F\u002Fflowstay.app)** | 简单易用的文本转语音功能，结合本地后处理及Claude Code集成，适用于macOS。永久免费。 |\n| **[macos-speech-server](https:\u002F\u002Fgithub.com\u002Fdokterbob\u002Fmacos-speech-server)** | 兼容OpenAI的STT\u002F转录与TTS\u002F语音API服务器。 |\n| **[Snaply](https:\u002F\u002Fsnaply.ai)** | 免费、快速、100%本地化的Mac端AI语音输入工具。 |\n| **[OpenOats](https:\u002F\u002Fgithub.com\u002Fyazinsai\u002FOpenOats)** | 开源的会议记录工具，能够实时转录对话，并从知识库中提取相关笔记。使用FluidAudio进行本地转录。 |\n| **[Enconvo](https:\u002F\u002Fenconvo.com)** | macOS上的AI智能体启动器，支持语音输入、实时字幕和文本转语音功能。采用Parakeet ASR进行本地语音识别。 |\n| **[Meeting Transcriber](https:\u002F\u002Fgithub.com\u002Fpasrom\u002Fmeeting-transcriber)** | 一款macOS菜单栏应用，可自动检测、录制并转录Teams、Zoom、Webex等会议，配备双轨说话人分离功能。使用Parakeet ASR、Qwen3-ASR及说话人分离技术。 |\n| **[Hitoku Draft](https:\u002F\u002Fhitoku.me\u002Fdraft)** | 一款位于macOS菜单栏的本地、私密的语音写作助手。使用Parakeet ASR技术。 |\n| **[Audite](https:\u002F\u002Fgithub.com\u002Fzachatrocity\u002Faudite)** | 一款macOS菜单栏应用，用于录制会议并将内容本地转录为Markdown格式的笔记，导入Obsidian使用。通过FluidAudio在Apple Neural Engine上运行Parakeet ASR。 |\n| **[Muesli](https:\u002F\u002Fgithub.com\u002FpHequals7\u002Fmuesli)** | 原生macOS语音输入与会议转录工具，延迟约0.13秒。可捕获麦克风和系统音频，并自动进行说话人分离。使用Parakeet TDT和Qwen3 ASR技术。 |\n| **[NanoVoice](https:\u002F\u002Fapps.apple.com\u002Fkz\u002Fapp\u002Fnanovoice\u002Fid6760539688)** | 免费的iOS语音键盘，适用于任何应用中的快速、私密语音输入。使用Parakeet ASR技术。 |\n| **[MiniWhisper](https:\u002F\u002Fgithub.com\u002Fandyhtran\u002FMiniWhisper)** | 开源的macOS菜单栏工具，只需简单设置即可实现快速的本地语音转文字。选择快捷键，开始说话。使用Parakeet ASR技术。 |\n| **[Talat](https:\u002F\u002Ftalat.app)** | 注重隐私的AI会议笔记应用。在Mac本地录制并转录会议，具备说话人识别和LLM驱动的摘要功能。曾被[TechCrunch](https:\u002F\u002Ftechcrunch.com\u002F2026\u002F03\u002F24\u002Ftalats-ai-meeting-notes-stay-on-your-machine-not-in-the-cloud\u002F)报道。使用Parakeet ASR技术。 |\n| **[Volocal](https:\u002F\u002Fgithub.com\u002Ffikrikarim\u002Fvolocal)** | iOS上的全本地语音AI。采用流式Parakeet EOU ASR和流式PocketTTS技术。 |\n| **[VivaDicta](https:\u002F\u002Fgithub.com\u002Fn0an\u002FVivaDicta)** | 开源的iOS语音转文字应用，内置全系统AI语音键盘——可在任何应用中进行语音输入并由AI处理文本。支持15+家AI服务商、40+种AI预设。使用Parakeet ASR技术。 |\n| **[MimicScribe](https:\u002F\u002Fmimicscribe.app\u002F)** | 一款macOS菜单栏应用，结合Parakeet TDT流式ASR、PyanNote Community的单说话人分离技术以及云端LLMs，根据实时转录内容和用户指令生成会议讨论要点。具备会议总结、自然语言搜索、用于智能体集成的MCP服务器，以及以键盘和语音为主的用户界面。 |\n| **[Action Phrase](https:\u002F\u002Factionphrase.com\u002F)** | 一款面向iOS、iPadOS和macOS的语音控制直播制作应用。可通过自然语音命令控制摄像机、图形、布局和制作流程。可与OBS、vMix、ProPresenter、Bitfocus Companion等主流工具无缝集成。使用Parakeet TDT ASR和Sortformer说话人分离技术。 |\n\n## 安装\n\n使用 Swift 包管理器将 FluidAudio 添加到您的项目中：\n\n```swift\ndependencies: [\n    .package(url: \"https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio.git\", from: \"0.12.4\"),\n],\n```\n\n**在 Xcode 中：**\n1. 将 FluidAudio 包添加到您的项目中。\n2. 在“添加包”对话框中，选择 `FluidAudio`。\n3. 将其添加到您的应用目标。\n\n**在 Package.swift 中：**\n```swift\n.product(name: \"FluidAudio\", package: \"FluidAudio\")\n```\n\n**CocoaPods：** 我们建议使用 [cocoapods-spm](https:\u002F\u002Fgithub.com\u002Ftrinhngocthuyen\u002Fcocoapods-spm) 以获得更好的 SPM 集成，但如有需要，您也可以使用我们的 podspec：`pod 'FluidAudio', '~> 0.12.4'`。\n\n### 其他框架\n\n如果您使用的是其他框架，请使用我们的官方封装库：\n\n| 平台 | 包 | 安装 |\n|----------|---------|---------|\n| **React Native \u002F Expo** | [@fluidinference\u002Freact-native-fluidaudio](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Freact-native-fluidaudio) | `npm install @fluidinference\u002Freact-native-fluidaudio` |\n| **Rust \u002F Tauri** | [fluidaudio-rs](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Ffluidaudio-rs) | `cargo add fluidaudio-rs` |\n\n### 后处理工具\n\n使用后处理工具增强 ASR 输出：\n\n| 工具 | 描述 | 语言 |\n|------|-------------|----------|\n| **[text-processing-rs](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Ftext-processing-rs)** | 反向文本规范化 (ITN) 和文本规范化 (TN)，支持 7 种语言（EN、DE、ES、FR、HI、JA、ZH）。100% 兼容 NeMo 测试（3,011 个测试）。将口语形式的 ASR 输出转换为书面形式（“two hundred” → “200”，“five dollars” → “$5”）。基于 [NVIDIA NeMo Text Processing](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-text-processing) 的 Rust 实现，并提供 Swift 封装。 | Rust、Swift |\n\n## 配置\n\n### 快速参考\n\n两者解决的是同一个问题：“我无法直接访问 HuggingFace。” 它们是不同的方法，您可以根据自己的设置选择适合的一种：\n\n| 场景 | 解决方案 | 配置 |\n|----------|----------|---|\n| 您有一个 **本地镜像或内部模型服务器** | 使用注册表 URL 替换 | `REGISTRY_URL=https:\u002F\u002Fyour-mirror.com` |\n| 您位于 **公司防火墙之后**，且代理可以访问 HuggingFace | 使用代理配置 | `https_proxy=http:\u002F\u002Fproxy.company.com:8080` |\n\n**工作原理：**\n- **注册表 URL**：应用程序从 `your-mirror.com` 而不是 `huggingface.co` 请求。\n- **代理**：应用程序仍然请求 `huggingface.co`，但流量通过代理路由以到达该地址。\n\n在大多数情况下，您只需要其中一种。只有当您的镜像位于代理之后且没有代理无法访问时，才需要同时使用两者。\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>模型注册表 URL\u003C\u002Fb> - 更改下载目标\u003C\u002Fsummary>\n\n默认情况下，FluidAudio 会从 HuggingFace 下载模型。您可以将其覆盖以使用镜像、本地服务器或空气隔离环境。\n\n**程序化覆盖（推荐用于应用程序）：**\n```swift\nimport FluidAudio\n\n\u002F\u002F 在使用任何管理器之前设置自定义注册表\nModelRegistry.baseURL = \"https:\u002F\u002Fyour-mirror.example.com\"\n\n\u002F\u002F 现在模型将从自定义注册表下载\nlet diarizer = DiarizerManager()\n```\n\n**环境变量（推荐用于 CLI\u002F测试）：**\n```bash\n# 使用自定义注册表\nexport REGISTRY_URL=https:\u002F\u002Fyour-mirror.example.com\nswift run fluidaudiocli transcribe audio.wav\n\n# 或者使用 MODEL_REGISTRY_URL 别名\nexport MODEL_REGISTRY_URL=https:\u002F\u002Fmodels.internal.corp\nswift run fluidaudiocli diarization-benchmark --auto-download\n```\n\n**Xcode 方案配置：**\n1. 编辑方案 → 运行 → 参数\n2. 转到 **环境变量** 选项卡\n3. 点击 `+` 并添加：`REGISTRY_URL` = `https:\u002F\u002Fyour-mirror.example.com`\n4. 自定义注册表将应用于所有调试运行\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>代理配置\u003C\u002Fb> - 通过代理服务器路由下载\u003C\u002Fsummary>\n\n如果您位于公司防火墙之后，无法直接访问 HuggingFace（或您的注册表），请配置代理以转发请求：\n\n设置 `https_proxy` 环境变量：\n\n```bash\nexport https_proxy=http:\u002F\u002Fproxy.company.com:8080\n# 或对于需要身份验证的代理：\nexport https_proxy=http:\u002F\u002Fuser:password@proxy.company.com:8080\n\nswift run fluidaudiocli transcribe audio.wav\n```\n\n**Xcode 方案中的代理配置：**\n1. 编辑方案 → 运行 → 参数\n2. 转到 **环境变量** 选项卡\n3. 点击 `+` 并添加：`https_proxy` = `http:\u002F\u002Fproxy.company.com:8080`\n4. FluidAudio 将自动通过代理路由下载\n\n\u003C\u002Fdetails>\n\n## 文档\n\n**[DeepWiki](https:\u002F\u002Fdeepwiki.com\u002FFluidInference\u002FFluidAudio)** 提供此仓库的自动生成文档。\n\n### 文档索引\n\n- 指南\n  - [推理用音频转换](Documentation\u002FGuides\u002FAudioConversion.md)\n  - 手动下载和加载模型的选项：[ASR](Documentation\u002FASR\u002FManualModelLoading.md)、[Diarizer](Documentation\u002FDiarization\u002FGettingStarted.md#manual-model-loading)、[VAD](Documentation\u002FVAD\u002FGettingStarted.md#manual-model-loading)\n  - 如果需要通过代理路由 Hugging Face（或兼容）请求，请在运行下载助手之前设置 `https_proxy`（参见 [Documentation\u002FAPI.md](Documentation\u002FAPI.md:9)）。\n- 模型\n  - 自动语音识别\u002F转录\n    - [入门指南](Documentation\u002FASR\u002FGettingStarted.md)\n    - [最后片段处理](Documentation\u002FASR\u002FLastChunkHandling.md)\n  - 说话人分离\n    - [说话人分离指南](Documentation\u002FDiarization\u002FGettingStarted.md)\n  - VAD：[入门指南](Documentation\u002FVAD\u002FGettingStarted.md)\n    - [分割](Documentation\u002FVAD\u002FSegmentation.md)\n    - [模型转换代码](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Fmobius)\n- [基准测试](Documentation\u002FBenchmarks.md)\n- [API 参考](Documentation\u002FAPI.md)\n- [命令行指南](Documentation\u002FCLI.md)\n\n### MCP 服务器\n\n该仓库已编入 DeepWiki MCP 服务器索引，因此您的编码工具可以访问文档：\n\n```json\n{\n  \"mcpServers\": {\n    \"deepwiki\": {\n      \"url\": \"https:\u002F\u002Fmcp.deepwiki.com\u002Fmcp\"\n    }\n  }\n}\n```\n\n对于 claude 代码：\n\n```bash\nclaude mcp add -s user -t http deepwiki https:\u002F\u002Fmcp.deepwiki.com\u002Fmcp\n```\n\n## 自动语音识别 (ASR) \u002F 转录\n\n- **模型**：\n  - `FluidInference\u002Fparakeet-tdt-0.6b-v3-coreml`（多语言，涵盖 25 种欧洲语言）\n  - `FluidInference\u002Fparakeet-tdt-0.6b-v2-coreml`（仅英语，召回率最高）\n- **处理模式**：批量转录完整音频文件\n- **实时因子**：在 M4 Pro 上约为 190 倍（约 19 秒内处理 1 小时音频）\n- **流式支持**：通过 `SlidingWindowAsrManager` 支持实时流式处理和取消操作\n- **后端**：我们的后端 ASR 使用相同的 Parakeet TDT v3 模型。\n\n### ASR 快速入门\n\n```swift\nimport FluidAudio\n\n\u002F\u002F 从音频文件进行批量转录\nTask {\n    \u002F\u002F 1) 初始化 ASR 管理器并加载模型\n    let models = try await AsrModels.downloadAndLoad(version: .v3)  \u002F\u002F 切换到 .v2 以仅支持英语\n    let asrManager = AsrManager(config: .default)\n    try await asrManager.loadModels(models)\n\n    \u002F\u002F 3) 转录音频（已转换为 16Hz）\n    let result = try await asrManager.transcribe(samples)\n\n    \u002F\u002F 3) 转录文件\n    \u002F\u002F let url = URL(fileURLWithPath: sample.audioPath)\n\n    \u002F\u002F 3) 转录 AVAudioPCMBuffer\n    \u002F\u002F let result = try await asrManager.transcribe(audioBuffer)\n    print(\"转录结果：\\(result.text)\")\n}\n```\n\n```bash\n# 批量转录音频文件\nswift run fluidaudiocli transcribe audio.wav\n\n# 仅限英语且召回率更高的运行\nswift run fluidaudiocli transcribe audio.wav --model-version v2\n```\n\n## 发言人区分\n\n### 离线发言人区分流程\n\nPyannote Community-1 流程（幂集分割 + WeSpeaker + VBx）用于离线发言人区分。适用于大多数用例，基准测试请参阅 Benchmarks.md。\n\n```swift\nimport FluidAudio\n\nlet config = OfflineDiarizerConfig()\nlet manager = OfflineDiarizerManager(config: config)\ntry await manager.prepareModels()  \u002F\u002F 如果缺少 Core ML 包，则下载并编译\n\nlet samples = try AudioConverter().resampleAudioFile(path: \"meeting.wav\")\nlet result = try await manager.process(audio: samples)\n\nfor segment in result.segments {\n    print(\"\\(segment.speakerId) \\(segment.startTimeSeconds)s → \\(segment.endTimeSeconds)s\")\n}\n```\n\n对于音频文件处理，使用基于文件的 API，它会自动采用内存映射流式传输以提高效率：\n\n```swift\nlet url = URL(fileURLWithPath: \"meeting.wav\")\nlet result = try await manager.process(url)\n\nfor segment in result.segments {\n    print(\"\\(segment.speakerId) \\(segment.startTimeSeconds)s → \\(segment.endTimeSeconds)s\")\n}\n```\n\n```bash\n# 使用完整的 VBx 聚类处理会议\nswift run fluidaudiocli process ~\u002FFluidAudioDatasets\u002Fami_official\u002Fsdm\u002FES2004a.Mix-Headset.wav \\\n  --mode offline --threshold 0.6 --output es2004a_offline.json\n\n# 运行 AMI 单文件基准测试并自动下载\nswift run fluidaudiocli diarization-benchmark --mode offline --auto-download \\\n  --single-file ES2004a --threshold 0.6 --output offline_results.json\n```\n\n`offline_results.json` 包含 DER\u002FJER\u002FRTFx 以及分割、嵌入提取和 VBx 聚类的时间分解。CI 现在会在每个 PR 上运行此工作流，以确保离线模型保持健康，并且 Hugging Face 资产始终可访问。\n\n### LS-EEND（长时流式端到端神经网络发言人区分）\n\n使用 CoreML 推理的端到端流式发言人区分。是在线发言者区分的默认选择——单个模型，无需聚类流程，最多支持 10 名说话人，每 100 毫秒更新一次帧，并提供 900 毫秒的暂定预览。支持流式处理和完整缓冲区处理。详细信息请参阅 [Documentation\u002FDiarization\u002FGettingStarted.md](Documentation\u002FDiarization\u002FGettingStarted.md)。\n\n```swift\nimport FluidAudio\n\nTask {\n    let diarizer = LSEENDDiarizer()\n    try await diarizer.initialize(variant: .dihard3)\n\n    let samples = try await loadSamples16kMono(path: \"path\u002Fto\u002Fmeeting.wav\")\n    let timeline = try diarizer.processComplete(samples, sourceSampleRate: 16_000)\n\n    for segment in timeline.segments {\n        print(\"说话人 \\(segment.speakerId): \\(segment.startTimeSeconds)s - \\(segment.endTimeSeconds)s\")\n    }\n}\n```\n\n### Sortformer（端到端神经网络发言人区分）\n\n使用 [NVIDIA 的 Sortformer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06656) 的端到端神经网络发言人区分。作为次要的流式发言人区分工具——虽然其说话人身份稳定性更好，参与者关注度更高，但相比 LS-EEND 在说话人容量和基准测试结果上有所牺牲。仅限 4 名说话人。无需单独的 VAD、分割或聚类。受 NVIDIA 开放模型许可证保护。\n\nLS-EEND 和 Sortformer 都会以超低延迟更新的方式输出 `DiarizerTimeline`。使用方法及对比请参阅 [Documentation\u002FDiarization\u002FSortformer.md](Documentation\u002FDiarization\u002FSortformer.md)。\n\n### 流式\u002F在线发言人区分（Pyannote）\n\n该流程使用分割加说话人嵌入，是继 LS-EEND 和 Sortformer 之后的第三选择。如果您特别需要经典的多阶段流程，它可以派上用场，但在实时发言人区分方面，它的速度远不及 LS-EEND 或 Sortformer。\n\n为何使用 WeSpeaker\u002FPyannote 流程：\n- 如果您希望分离分割和嵌入阶段，此流程更具模块化\n- 当您需要集成外部说话人识别或聚类逻辑时，此流程更合适\n- 说话人预先注册可靠\n- 说话人数据库管理更加容易\n- 清除或更新个别说话人非常简单\n- 不建议在优先考虑低延迟实时发言人区分时使用\n\n在大多数应用中：\n- 将 LS-EEND 作为默认的在线发言人区分工具\n- 当 Sortformer 更强的身份稳定性和参与者关注度比 4 名说话人的限制更重要时，将其作为第二选择\n- 仅当您特别需要其模块化设计，即使会牺牲速度时，才使用 WeSpeaker\u002FPyannote 流程\n\n权衡：\n- 在推理时间和实际延迟方面都比 LS-EEND 或 Sortformer 慢\n- 需要较大的数据块，通常至少需要 5 秒才能获得较好的结果\n- 与 LS-EEND 和 Sortformer 不同，说话人状态更容易被显式操控\n\n```swift\nimport FluidAudio\n\n\u002F\u002F 对音频文件进行发言人区分\nTask {\n    let models = try await DiarizerModels.downloadIfNeeded()\n    let diarizer = DiarizerManager()\n    diarizer.initialize(models: models)\n\n    \u002F\u002F 准备 16 kHz 单声道样本（参见：音频转换）\n    let samples = try await loadSamples16kMono(path: \"path\u002Fto\u002Fmeeting.wav\")\n\n    \u002F\u002F 进行发言人区分\n    let result = try diarizer.performCompleteDiarization(samples)\n    for segment in result.segments {\n        print(\"说话人 \\(segment.speakerId): \\(segment.startTimeSeconds)s - \\(segment.endTimeSeconds)s\")\n    }\n}\n```\n\n关于流式发言人区分，请参阅 [Documentation\u002FDiarization\u002FGettingStarted.md](Documentation\u002FDiarization\u002FGettingStarted.md)。\n\n```bash\nswift run fluidaudiocli diarization-benchmark --single-file ES2004a \\\n  --chunk-seconds 3 --overlap-seconds 2\n```\n\n### CLI\n\n```bash\n# 处理单个文件并保存 JSON\nswift run fluidaudiocli process meeting.wav --output results.json --threshold 0.6\n```\n\n## 语音活动检测（VAD）\n\nSilero VAD 为我们设备上的检测器提供支持。最新版本沿用了上游 PyTorch 实现中的相同时间戳提取和流式启发式方法。如果您需要针对您的环境调整设置，请在 Discord 上联系我们。\n\n### VAD 快速入门（离线分割）\n\n简单调用，每 256 毫秒跳动一次返回分块级别的概率：\n\n```swift\nlet results = try await manager.process(samples)\nfor (index, chunk) in results.enumerated() {\n    print(\n        String(\n            format: \"分块 %02d: 概率=%.3f, 推理时间=%.4fs\",\n            index,\n            chunk.probability,\n            chunk.processingTime\n        )\n    )\n}\n```\n\n以下是更适合与其他系统集成的高级 API：\n\n```swift\nimport FluidAudio\n\nTask {\n    let manager = try await VadManager(\n        config: VadConfig(defaultThreshold: 0.75)\n    )\n\n    let audioURL = URL(fileURLWithPath: \"path\u002Fto\u002Faudio.wav\")\n    let samples = try AudioConverter().resampleAudioFile(audioURL)\n\n    var segmentation = VadSegmentationConfig.default\n    segmentation.minSpeechDuration = 0.25\n    segmentation.minSilenceDuration = 0.4\n\n    let segments = try await manager.segmentSpeech(samples, config: segmentation)\n    for segment in segments {\n        print(\n            String(format: \"语音 %.2f–%.2fs\", segment.startTime, segment.endTime)\n        )\n    }\n}\n```\n\n### 流式处理\n\n```swift\nimport FluidAudio\n\nTask {\n    let manager = try await VadManager()\n    var state = await manager.makeStreamState()\n\n    for chunk in microphoneChunks {\n        let result = try await manager.processStreamingChunk(\n            chunk,\n            state: state,\n            config: .default,\n            returnSeconds: true,\n            timeResolution: 2\n        )\n\n        state = result.state\n\n        \u002F\u002F 访问原始概率（0.0-1.0）以用于自定义逻辑\n        print(String(format: \"概率: %.3f\", result.probability))\n\n        if let event = result.event {\n            let label = event.kind == .speechStart ? \"开始\" : \"结束\"\n            print(\"\\(label) @ \\(event.time ?? 0)s\")\n        }\n    }\n}\n```\n\n### 命令行界面\n\n从通用的 `process` 命令开始，该命令会端到端地对单个文件运行说话人分离管道（因此也包括 VAD）：\n\n```bash\nswift run fluidaudiocli process path\u002Fto\u002Faudio.wav\n```\n\n一旦你需要直接试验与 VAD 相关的参数，可以使用以下命令：\n\n```bash\n# 检查离线分割结果（默认模式）\nswift run fluidaudiocli vad-analyze path\u002Fto\u002Faudio.wav\n\n# 仅进行流式模拟（默认按秒打印时间戳）\nswift run fluidaudiocli vad-analyze path\u002Fto\u002Faudio.wav --streaming\n\n# 基准测试准确率与精度之间的权衡\nswift run fluidaudiocli vad-benchmark --num-files 50 --threshold 0.3\n```\n\n`swift run fluidaudiocli vad-analyze --help` 列出了所有可调参数，包括负阈值覆盖、最大语音分割、填充以及分块大小。离线模式还会根据模型的每分块处理时间报告 RTFx。\n\n## 文本转语音（TTS）\n\n> **⚠️ 测试版：** TTS 目前仅支持美式英语。计划增加更多语言支持。\n\nFluidAudio 提供两种 TTS 后端：\n\n| | PocketTTS | Kokoro |\n|---|---|---|\n| **GPL 依赖项** | 无 | 无 |\n| **分词器** | SentencePiece | CoreML G2P → IPA 音素 |\n| **生成** | 帧对帧自回归（80 毫秒） | 并行（一次性生成所有帧） |\n| **流式传输** | 是 | 否 |\n| **语音克隆** | 是（1–30 秒音频样本） | 否 |\n| **发音控制** | 否 | 是（SSML，自定义词典） |\n| **输出** | 24 kHz 单声道 WAV | 24 kHz 单声道 WAV |\n\n### PocketTTS\n\n适合流式传输的 TTS，支持从短音频样本中进行语音克隆。\n\n```swift\nimport FluidAudio\n\nTask {\n    let manager = try await PocketTtsManager()\n    let audioData = try await manager.synthesize(\"Hello from FluidAudio.\")\n    try audioData.write(to: URL(fileURLWithPath: \"out.wav\"))\n}\n```\n\n```bash\n# 使用默认语音合成\nswift run fluidaudiocli tts \"Hello from FluidAudio.\" --output out.wav --backend pocket\n\n# 从音频样本克隆语音\nswift run fluidaudiocli tts \"Hello world.\" --output out.wav --backend pocket --clone-voice speaker.wav\n```\n\n### Kokoro\n\n高质量并行 TTS，支持 SSML 和音素级发音控制。对于未收录在词汇表中的单词，使用 CoreML G2P（字素到音素）模型——无需外部依赖。\n\n```swift\nimport FluidAudio\n\nTask {\n    let manager = KokoroTtsManager()\n    try await manager.initialize()\n    let data = try await manager.synthesize(text: \"Hello from FluidAudio.\")\n    try data.write(to: URL(fileURLWithPath: \"out.wav\"))\n}\n```\n\n```bash\nswift run fluidaudiocli tts \"Hello from FluidAudio.\" --auto-download --output out.wav\n```\n\n词典和模型资产会缓存在 `~\u002F.cache\u002Ffluidaudio\u002FModels\u002Fkokoro` 下。\n\n## 持续集成\n\n- `tests.yml`：默认构建矩阵，涵盖 SwiftPM 测试和 iOS 归档冒烟测试。\n- `diarizer-benchmark.yml`：在 ES2004a 上运行流式说话人分离基准测试，用于跟踪回归。\n- `offline-pipeline.yml`：端到端执行 VBx 离线管道（`fluidaudio diarization-benchmark --mode offline`），如果 DER\u002FJER 超出限制或模型下载失败，则构建失败。可将此工作流作为在您自己的 CI 中预置模型缓存的参考。\n\n## 其他信息\n\n### 常见问题解答\n\n- 命令行界面仅适用于 macOS。对于 iOS，请以编程方式使用库。\n- 模型会在首次使用时自动下载。如果您的网络限制了对 Hugging Face 的访问，请设置 HTTPS 代理：`export https_proxy=http:\u002F\u002F127.0.0.1:7890`。\n- Windows 替代方案正在开发中：[fluid-server](https:\u002F\u002Fgithub.com\u002FFluidInference\u002Ffluid-server)。\n- 如果您想在 Mac 上获取系统音频，可以参考这个仓库 [AudioCap](https:\u002F\u002Fgithub.com\u002Finsidegui\u002FAudioCap\u002Ftree\u002Fmain)。\n\n### 许可证\n\nApache 2.0 — 详情请参阅 `LICENSE` 文件。\n\n### 致谢\n\n本项目基于 [sherpa-onnx](https:\u002F\u002Fgithub.com\u002Fk2-fsa\u002Fsherpa-onnx) 项目的优秀工作，该团队在说话人分离算法和技术方面做出了卓越贡献。\n\nPyannote：\u003Chttps:\u002F\u002Fgithub.com\u002Fpyannote\u002Fpyannote-audio>\n\nWeSpeaker：\u003Chttps:\u002F\u002Fgithub.com\u002Fwenet-e2e\u002Fwespeaker>\n\nParakeet-mlx：\u003Chttps:\u002F\u002Fgithub.com\u002Fsenstella\u002Fparakeet-mlx>\n\nsilero-vad：\u003Chttps:\u002F\u002Fgithub.com\u002Fsnakers4\u002Fsilero-vad>\n\nKokoro-82M：\u003Chttps:\u002F\u002Fhuggingface.co\u002Fhexgrad\u002FKokoro-82M>\n\n### 引用\n\n如果您在工作中使用了 FluidAudio，请引用以下内容：\n\nFluidInference 团队. (2025). FluidAudio：适用于 Apple 平台的本地说话人分离、ASR 和 VAD（版本 0.12.4）[计算机软件]. GitHub. \u003Chttps:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio>\n\n```bibtex\n@software{FluidInferenceTeam_FluidAudio_2025,\n  author = {{FluidInference Team}},\n  title = {{FluidAudio: 本地说话人分离、ASR 和 VAD，适用于 Apple 平台}},\n  year = {2025},\n  month = {3},\n  version = {0.12.4},\n  url = {https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio},\n  note = {计算机软件}\n}\n```\n\n---\n\n## 表达您的支持\n\n通过在您的项目中添加“由 Fluid Inference 提供支持”徽章，帮助 Fluid Inference 社区发展壮大吧！\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Ffluidinference.com\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_84a39d3f1e87.png\" alt=\"由 Fluid Inference 提供支持\" height=\"80\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n将以下提示复制并粘贴到您托管主页的代码代理中：\n\n```text\n在页脚中添加一个居中的“由 Fluid Inference 提供支持”徽章，链接至 fluidinference.com。图片地址：https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_84a39d3f1e87.png。请在您的框架配置中将该图片的主机名加入白名单。\n```\n\n或者使用以下代码片段之一：\n\n\u003Cdetails>\n\u003Csummary>React\u002FNext.js\u003C\u002Fsummary>\n\n```jsx\n\u003Cdiv className=\"flex justify-center py-8\">\n  \u003Ca href=\"https:\u002F\u002Ffluidinference.com\">\n    \u003Cimg\n      src=\"https.\u002F\u002Fassets.inference.plus\u002Ffi-badge.png\"\n      alt=\"由 Fluid Inference 提供支持\"\n      height={80}\n    \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>HTML\u003C\u002Fsummary>\n\n```html\n\u003Cdiv style=\"text-align: center; padding: 20px;\">\n  \u003Ca href=\"https:\u002F\u002Ffluidinference.com\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_84a39d3f1e87.png\" alt=\"由 Fluid Inference 提供支持\" height=\"80\">\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Markdown\u003C\u002Fsummary>\n\n```markdown\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Ffluidinference.com\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_readme_84a39d3f1e87.png\" alt=\"由 Fluid Inference 提供支持\" height=\"80\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n```\n\n\u003C\u002Fdetails>","# FluidAudio 快速上手指南\n\nFluidAudio 是一个专为 Apple 设备（macOS\u002FiOS）设计的 Swift SDK，支持完全本地化、低延迟的音频 AI 处理。它利用 Apple Neural Engine (ANE) 进行推理加速，提供语音识别 (ASR)、说话人分离 (Diarization)、语音活动检测 (VAD) 和文本转语音 (TTS) 等功能。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**：macOS 14.0+ 或 iOS 17.0+\n*   **开发工具**：Xcode 15.0+ (包含 Swift 6.0+)\n*   **硬件要求**：配备 Apple Silicon 芯片 (M1\u002FM2\u002FM3 等) 的设备，以利用 ANE 加速。\n*   **依赖管理**：推荐使用 Swift Package Manager (SPM)。\n\n> **注意**：本项目主要针对 Apple 生态优化，不支持 Windows 或 Linux 服务器环境。目前暂无官方国内镜像源，若访问 Hugging Face 下载模型受阻，请自行配置网络代理或使用国内 Hugging Face 镜像站。\n\n## 安装步骤\n\n### 1. 通过 Swift Package Manager 集成\n\n在 Xcode 项目中添加 FluidAudio：\n\n1.  打开 Xcode 项目，点击菜单栏 `File` > `Add Package Dependencies...`。\n2.  输入仓库地址：\n    ```text\n    https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\n    ```\n3.  选择版本规则（建议 `Up to Next Major Version`），点击 `Add Package`。\n\n或者，直接在 `Package.swift` 文件中添加依赖：\n\n```swift\ndependencies: [\n    .package(url: \"https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\", from: \"1.0.0\")\n]\n```\n\n然后在目标依赖中添加 `\"FluidAudio\"`。\n\n### 2. 下载模型\n\nFluidAudio 的模型托管在 Hugging Face 上。首次运行时，SDK 通常会自动下载所需模型到本地缓存。如需手动预下载或自定义模型路径，请访问 [FluidInference Hugging Face Organization](https:\u002F\u002Fhuggingface.co\u002FFluidInference)。\n\n常用模型包括：\n*   **ASR**: `parakeet-tdt-v3` (批量转录), `parakeet-eou` (流式转录)\n*   **TTS**: `kokoro`, `PocketTTS`\n*   **Diarization**: `pyannote`, `sortformer`\n\n## 基本使用\n\n以下示例展示如何使用 FluidAudio 进行最简单的**语音转文字 (ASR)** 和 **说话人分离**。\n\n### 1. 导入模块\n\n```swift\nimport FluidAudio\nimport AVFoundation\n```\n\n### 2. 初始化引擎\n\n配置并启动音频处理引擎。默认情况下，FluidAudio 会自动尝试将计算任务卸载到 ANE。\n\n```swift\n\u002F\u002F 创建配置\nvar config = FluidAudioConfig()\nconfig.device = .ane \u002F\u002F 强制使用 Apple Neural Engine\n\n\u002F\u002F 初始化引擎\nlet engine = try await FluidAudioEngine(configuration: config)\n```\n\n### 3. 执行批量语音转录 (ASR)\n\n假设你有一个本地的音频文件 URL：\n\n```swift\nlet audioURL = URL(fileURLWithPath: \"\u002Fpath\u002Fto\u002Fyour\u002Faudio.m4a\")\n\ndo {\n    \u002F\u002F 执行转录\n    let result = try await engine.transcribe(\n        url: audioURL,\n        model: .parakeetTDTv3 \u002F\u002F 指定使用 Parakeet TDT v3 模型\n    )\n    \n    \u002F\u002F 输出结果\n    print(\"转录文本：\\(result.text)\")\n    \n    \u002F\u002F 如果启用了说话人分离，可以遍历片段\n    for segment in result.segments {\n        print(\"说话人 \\(segment.speakerID ?? \"未知\"): \\(segment.text)\")\n    }\n} catch {\n    print(\"转录失败：\\(error)\")\n}\n```\n\n### 4. 流式转录 (实时监听)\n\n对于实时麦克风输入，可以使用流式管道：\n\n```swift\n\u002F\u002F 设置流式处理器\nlet streamer = engine.createStreamingTranscriber(model: .parakeetEOU)\n\n\u002F\u002F 开始监听 (需配合 AVAudioEngine 或类似音频输入源)\ntry streamer.start()\n\n\u002F\u002F 设置回调处理实时结果\nstreamer.onResult = { partialResult in\n    print(\"实时识别：\\(partialResult.text) [结束标记：\\(partialResult.isFinal)]\")\n}\n\n\u002F\u002F 停止监听\n\u002F\u002F try streamer.stop()\n```\n\n### 5. 文本转语音 (TTS)\n\n使用 Kokoro 模型生成语音：\n\n```swift\nlet text = \"你好，这是 FluidAudio 的测试声音。\"\nlet outputURL = URL(fileURLWithPath: \"\u002Ftmp\u002Foutput.wav\")\n\ndo {\n    try await engine.synthesize(\n        text: text,\n        outputURL: outputURL,\n        model: .kokoro,\n        language: .zh \u002F\u002F 指定语言\n    )\n    print(\"语音生成完毕：\\(outputURL.path)\")\n} catch {\n    print(\"TTS 失败：\\(error)\")\n}\n```\n\n---\n**提示**：更多高级功能（如自定义说话人嵌入提取、SSML 控制、逆文本标准化 ITN）请参考官方文档 [docs.fluidinference.com](https:\u002F\u002Fdocs.fluidinference.com\u002Fintroduction)。","一位 iOS 开发者正在构建一款面向记者的离线会议记录应用，需要在设备上实时将多人对话转为文字并区分发言人。\n\n### 没有 FluidAudio 时\n- **依赖云端导致延迟高**：音频必须上传至服务器处理，网络波动会导致转写结果严重滞后，无法实现“边说边记”。\n- **隐私合规风险大**：敏感的会议录音需传出设备，难以满足企业对数据不出域的严格隐私要求。\n- **电量与发热失控**：传统本地模型占用大量 CPU\u002FGPU 资源，导致手机迅速发烫且电量在半小时会议中耗尽。\n- **开发集成复杂**：自行移植和优化开源语音模型（如 Whisper）到 CoreML 耗时数周，且难以实现说话人分离功能。\n\n### 使用 FluidAudio 后\n- **端侧实时响应**：利用苹果神经网络引擎（ANE）加速，实现毫秒级低延迟流式转写，说完即现文字。\n- **数据完全本地化**：所有语音识别、说话人日记（Diarization）均在设备本地完成，彻底消除隐私泄露隐患。\n- **极致能效表现**：推理任务卸载至 ANE，几乎不占用主处理器资源，长时间录音仅消耗微量电量且机身清凉。\n- **快速落地多语种**：通过几行 Swift 代码即可集成支持中、日、英等 25 种语言的 SOTA 模型及说话人区分功能。\n\nFluidAudio 让开发者能以极低的代码成本，在 Apple 设备上构建出兼具隐私安全、实时响应与超长续航的专业级音频 AI 应用。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFluidInference_FluidAudio_cf983dce.png","FluidInference","Fluid Inference","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FFluidInference_cf07ed57.png","intelligence everywhere ",null,"dev@fluidinference.com","https:\u002F\u002Ffluidinference.com","https:\u002F\u002Fgithub.com\u002FFluidInference",[81,85,89,93,97,101],{"name":82,"color":83,"percentage":84},"Swift","#F05138",95.9,{"name":86,"color":87,"percentage":88},"C++","#f34b7d",1.7,{"name":90,"color":91,"percentage":92},"Python","#3572A5",1.2,{"name":94,"color":95,"percentage":96},"Shell","#89e051",1,{"name":98,"color":99,"percentage":100},"Ruby","#701516",0.1,{"name":102,"color":103,"percentage":100},"C","#555555",1841,248,"2026-04-08T20:00:04","Apache-2.0","macOS, iOS","不需要独立 GPU 或 CUDA。模型专门优化以在 Apple Neural Engine (ANE) 上运行，完全避免使用 GPU\u002FMPS，以降低功耗并减少内存占用。","未说明（强调低内存占用，适合后台处理和始终在线的工作负载）",{"notes":112,"python":113,"dependencies":114},"该工具是专为苹果生态设计的 Swift SDK，不支持 Linux 或 Windows。其核心优势是利用 Apple Neural Engine (ANE) 进行本地推理，无需 NVIDIA GPU 或 CUDA 环境。支持的功能包括语音识别 (ASR)、文本转语音 (TTS)、说话人日记和语音活动检测 (VAD)。若需在非 Swift 环境（如 Python）中使用，需参考第三方集成示例（如 Senko），但原生支持仅限于 macOS 和 iOS。","不适用（主要提供 Swift SDK；虽提及可通过 Senko 等工具在 Python 中集成，但未指定具体 Python 版本要求）",[115,116,117,118],"Swift 6.0+","Xcode (用于 macOS\u002FiOS 开发)","CoreML","Apple Neural Engine (ANE)",[14,120],"音频",[122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140],"coreml","ios","macos","speaker-diarization","speaker-embedding","speaker-identification","speaker-recognition","swift","audio","avfoundation","real-time","vad","voice-activity-detection","asr","automatic-speech-recognition","speech-to-text","parakeet","ane","nvidia","2026-03-27T02:49:30.150509","2026-04-11T18:30:59.878697",[144,149,154,159,164,168,173],{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},26422,"在 Xcode 26.4 (Swift 6) 下编译时遇到 'sending risks causing data races' 并发错误怎么办？","这是由于 Swift 6 更严格的并发检查导致的。如果无法升级到 FluidAudio 0.12.5+（例如受限于 WhisperKit 对 swift-transformers 版本的依赖），可以在 0.12.4 版本中手动修复。找到 `Sources\u002FFluidAudio\u002FASR\u002FAsrManager.swift` 文件，将 `AsrManager` 类声明修改为遵循 `@unchecked Sendable` 协议：\n\npublic final class AsrManager: @unchecked Sendable {\n\n这样做是安全的，因为 `AsrManager` 仅在 `StreamingAsrManager` 的 Actor 隔离域内被访问。","https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fissues\u002F448",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},26423,"如何为本地音频任务选择合适的 Ollama 模型？","根据任务类型推荐以下模型配置：\n1. 通用任务：使用 llama3.1:8b。\n2. 代码相关任务：使用 codellama 或 deepseek-coder-v2（特别适合 Swift\u002FRust 互操作代码生成）。\n3. 音频辅助工作（如推理转录输出或提取音频元数据）：qwen2.5:7b 在结构化提取方面表现惊人。\n\n7B-8B 参数量级是在 Mac 上保持快速推理同时获得有用输出的最佳平衡点。","https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fissues\u002F415",{"id":155,"question_zh":156,"answer_zh":157,"source_url":158},26424,"FluidAudio 是否支持 Kokoro-82M-v1.1-zh 中文语音合成模型？","目前 FluidAudio 主要构建用于 CoreML，而 Kokoro 等 TTS 模型通常大量使用 GPU 运算而非 ANE，导致 CoreML 支持不佳且可能损失质量。\n\n建议方案：该模型已在 `mlx-audio` 项目中得到支持（基于 MLX 框架），MLX 能在不损失质量的情况下维持性能。请前往 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-audio\u002Fpull\u002F341 查看相关合并请求，使用 mlx-audio 可能是更合适的选择。","https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fissues\u002F163",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},26425,"如何实现流式说话人日志（Streaming Speaker Diarization）以减少内存复制？","FluidAudio 已通过 PR #81 支持通用的 `RandomAccessCollection` 进行说话人日志处理。现在 Diarization Manager 接受 `ArraySlice`、`ContiguousArray` 和其他集合类型，并支持零拷贝内存操作。\n\n对于需要频繁实时更新的应用（如滑动缓冲区），您可以直接传入 `ArraySlice\u003CFloat>` 而不是完整的数组，从而避免在处理重叠片段时产生多余的内存复制。","https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fissues\u002F33",{"id":165,"question_zh":166,"answer_zh":167,"source_url":153},26426,"如何在 FluidAudio 中精细控制语音识别中的停顿和间隙处理？","虽然 Apple 的 DictationTranscriber 引擎配置较为封闭，但可以通过配置四个关键的时间阈值（单位为秒）来逐步处理不同阶段的间隙，以获得最大灵活性：\n\nfragmentIgnoreGap \u003C restartGap \u003C continueGap \u003C idleFlushDelay\n\n这些参数分别对应：忽略片段间隙、重启决策、继续决策和空闲刷新延迟。目前的调优主要基于启发式方法，针对不同录音环境和语言可能需要自动化调整。",{"id":169,"question_zh":170,"answer_zh":171,"source_url":172},26427,"将基于 FluidAudio 的 App 提交到 App Store 时遇到 ESpeakNG 签名错误（ITMS-90238\u002FITMS-90385）如何解决？","该错误通常是因为嵌套的 ESpeakNG.framework 签名资源不匹配或使用了过时的 v1 签名。\n\n主要问题包括：\n1. 代码没有资源但签名指示必须存在资源。\n2. Bundle 标识符 'com.kokoro.espeakng' 缺少 macOS 10.9 及以后版本所需的 v2 签名。\n\n解决方案：确保在 macOS 10.9 或更高版本上使用 Xcode 重新签署代码，生成 v2 签名。检查 Framework 是否正确包含资源文件或调整签名脚本以匹配实际内容。","https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fissues\u002F159",{"id":174,"question_zh":175,"answer_zh":176,"source_url":148},26428,"在将新模型（如 Cohere）转换为 CoreML 之前需要注意什么？","在投入时间进行模型转换之前，关键问题是确认模型架构是否与 CoreML 转换兼容。\n\n某些在 GPU 上表现良好的注意力机制模式可能无法干净地转换为 CoreML。建议先检查是否有人已经完成了该模型的 ONNX 或 CoreML 导出工作，以避免无效的转换尝试。",[178,183,188,193,198,203,208,213,218,223,228,233,238,243,248,253,258,263,268,273],{"id":179,"version":180,"summary_zh":181,"released_at":182},171658,"v0.13.6","## v0.13.6 新增内容\n\n### 功能\n- 增加日语 ASR 支持，使用 JSUT 和 Common Voice 数据集 (#478)\n- 为离线说话人分离流水线添加可选的嵌入跳过策略 (#480)\n- 为 Kokoro 模型添加可配置的 computeUnits 参数 (#482)\n  - 解决 iOS 26 ANE 编译器回归问题\n  - 在需要时可通过 `.cpuAndGPU` 跳过 ANE 使用\n  - 向后兼容（默认仍为 `.all`）\n\n### 改进\n- 在用户主动取消时跳过错误恢复 (#481)\n\n### 文档\n- 在 README 中添加 Parakeet EOU 超低延迟演示视频\n\n---\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.13.5...v0.13.6","2026-04-04T17:48:00",{"id":184,"version":185,"summary_zh":186,"released_at":187},171659,"v0.13.5","## v0.13.5 新增内容\n\n### 功能\n- 添加实验性 CTC 中文普通话 ASR（在 THCHS-30 数据集上 CER 为 8.23%）(#476)\n- 添加带有持久化 KV 缓存的 PocketTTS 会话 (#471)\n- 添加用于标点符号感知流式 ASR 的 PunctuationCommitLayer (#466)\n\n### 改进\n- 重构 TDT 解码器：提取可重用组件 (#474)\n- ASR 架构清理：命名规范、移除死代码、优化文件组织 (#468)\n- 澄清自定义词汇表模型的兼容性及方案选择 (#469)\n\n### 错误修复\n- 修复 SlidingWindowAsrManager 中的 Swift 6 并发错误 (#472, #476)\n- 修复麦克风和系统转录同时运行时的使用后释放问题 (#473)\n- 修复 levenshteinDistance 在空数组情况下出现的致命错误 (#476)\n\n### 文档\n- 修复 ASR 文档中的过时引用 (#462)\n- 更新文档索引，移除 espeak-ng 许可证 (#461)\n- 清理 CI 工作流 (#463, #464)\n\n---\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.13.4...v0.13.5\n\n**注意**：CTC 中文普通话功能目前处于实验阶段。API 可能在未来的版本中发生变化。","2026-04-03T03:25:59",{"id":189,"version":190,"summary_zh":191,"released_at":192},171660,"v0.13.4","## 自 v0.13.3 以来的变更\n\n- 添加用于自定义词汇表的独立 CTC 头部 (#435, #450)\n- 使 parakeetTdtCtc110m 的文件夹名称与其他 Parakeet 模型保持一致 (#453)\n- 将 swift-transformers 替换为极简 BPE 分词器 (#449)\n- 在所有基准测试工作流中添加 RTFx 跟踪和验证 (#458)","2026-03-29T03:45:41",{"id":194,"version":195,"summary_zh":196,"released_at":197},171661,"v0.13.2.6","## 变更内容\n\n### 文档\n- 新增了全面的 ASR 目录结构文档，详细说明了旧布局与新布局的区别、SlidingWindow 与 Streaming 的区分，以及设计决策。\n- 为 PR #440 的回归测试添加了主分支基线基准测试结果（在 M2 16GB 上测试了 6 个模型）。\n- 添加了 PR 分支的基准测试结果，显示所有 6 个模型均未出现性能下降：\n  - TDT v3：2.6% WER（保持不变）\n  - TDT v2：3.8% WER（保持不变）\n  - CTC-TDT 110M：3.6% WER（保持不变）\n  - EOU 320ms：7.11% WER（保持不变）\n  - Nemotron 1120ms：1.99% WER（保持不变）\n  - CTC Earnings：16.51% WER（与 16.54% 的噪声水平相当）\n\n### 重构\n- 在 TdtDecoderV3 中去除了解码器投影归一化的重复代码，将 prepareDecoderProjection 和 populatePreparedDecoderProjection 合并为一个 normalizeDecoderProjection 方法（行为无变化，2.6% WER 保持不变）。\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.13.2.5...v0.13.2.6\n\n","2026-03-28T06:02:16",{"id":199,"version":200,"summary_zh":201,"released_at":202},171662,"v0.13.2.5","## 变更内容\n\n### 目录结构重构\n- 按模型家族重新组织 ASR 目录（Parakeet\u002F、Qwen3\u002F）\n- 将 Streaming\u002F 目录拆分为 EOU\u002F 和 Nemotron\u002F 子目录\n- 移除 Parakeet\u002FShared\u002F 子目录，将文件移至 Parakeet\u002F 根目录\n- 添加 StreamingAsrEngine 协议，用于统一的流式接口\n\n### 错误修复\n- 通过将梅尔谱填充至预期帧数，修复了 EOU 对于短音频的形状不匹配问题（#444）\n- 修正 EOU 分块样本数，使其与 computeFlat 帧公式一致\n- 修复分块大小初始化中的竞态条件\n- 修复 Kokoro v2 的源噪声数据类型和分布问题（#447）\n\n### 依赖更新\n- 将 swift-transformers 从 1.2.0 更新至 1.3.0（依赖项从 28 个减少至 11 个）（#439）\n\n### 已移除\n- 移除不支持的 Nemotron 80ms\u002F160ms 流式变体\n- 将 KittenTTS 和 Qwen3-TTS 标记为不支持（#437）\n\n### 文档更新\n- 更新文档中的文件路径，以匹配新的 ASR 结构\n- 新增 MimicScribe 展示案例（#446）\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.13.2...v0.13.2.5","2026-03-28T03:40:57",{"id":204,"version":205,"summary_zh":206,"released_at":207},171663,"v0.13.2","## 自动语音识别\n\n- **Parakeet-TDT-CTC-110M 混合模型** (#433) - 融合的预处理模块+编码器，命令行参数：`--model-version tdt-ctc-110m`。关闭 #383\n- **带有 ARPA 语言模型的 CTC 解码器** (#436) - 贪心搜索\u002F束搜索解码，在领域专用语言模型下 WER 为 9.4%。关闭 #384\n\n## 文本转语音\n\n- **修复 Kokoro TTS 归档构建失败问题** (#426) - 将 Float16.bitPattern 替换为 vImage 转换。关闭 #423\n\n---\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.13.1...v0.13.2","2026-03-26T21:40:15",{"id":209,"version":210,"summary_zh":211,"released_at":212},171664,"v0.13.1","## 自动语音识别\n\n- **Nemotron 语音流式处理 0.6B** (#432) - 带 vDSP 优化的流式 ASR，字错率 2.12%，实时性提升 6.4 倍。关闭 #389\n\n## 说话人分离\n\n- **时间线同步与 LS-EEND 最终化** (#421) - 统一了离线\u002F流式处理的最终化流程，并改进了 Sortformer 的刷新行为\n\n---\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.13.0...v0.13.1","2026-03-26T19:39:43",{"id":214,"version":215,"summary_zh":216,"released_at":217},171665,"v0.13.0","**完整更新日志**: https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.12.6...v0.13.0","2026-03-26T19:35:01",{"id":219,"version":220,"summary_zh":221,"released_at":222},171666,"v0.12.6","## 变更内容\n\n### Swift 6 并发安全性 🔒\n- 将 `AsrManager` 转换为 actor 类型 (#419)，以实现正确的 Swift 6 并发安全性\n- 移除 `StreamingAsrManager` 中的 `nonisolated(unsafe)` 临时解决方案\n- 在所有调用 `AsrManager` 的地方添加适当的 `await`\n- 修复了在 Xcode 16.4 RC 更严格的并发检查下出现的数据竞争警告\n\n### 性能优化 ⚡\n- **Qwen3 ASR ANE 优化** (#410)：将音频编码器转换为 fp16 格式，以利用 Apple Neural Engine\n- **Kokoro TTS ANE 优化** (#411)：将模型转换为 fp16 格式，以提升 Neural Engine 的性能\n\n### 错误修复 🐛\n- **修复 iOS 上 `KokoroTtsManager.initialize()` 卡死问题** (#418)：解决了 iOS 设备上的初始化死锁问题\n- **修复 Kokoro TTS 缺少 `source_noise` 输入的问题** (#412)：修正了模型输入配置\n\n### 破坏性变更 ⚠️\n\n现在对外部调用 `AsrManager` 方法必须使用 `await`：\n\n```swift\n\u002F\u002F 修改前\nlet manager = AsrManager()\nmanager.cleanup()\n\n\u002F\u002F 修改后\nlet manager = AsrManager()\nawait manager.cleanup()\n``` \n\n### 测试\n\n✅ 所有 CI 测试均已通过（共 13 项测试，无失败）\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.12.5...v0.12.6","2026-03-24T21:31:59",{"id":224,"version":225,"summary_zh":226,"released_at":227},171667,"v0.12.5","## 新增功能\n\n- **LS-EEND 麦客分离器** (#376)：端到端流式麦客分离，支持最多 10 名说话人\n  - 针对不同场景优化的五种变体（AMI、CALLHOME、DIHARD2\u002F3、VoxConverse）\n  - 100 毫秒帧更新，提供 900 毫秒暂定预览\n  - 统一的 `DiarizerTimeline` API，与 Sortformer 共享\n  - CLI 命令：`lseend` 和 `lseend-benchmark`\n  - 完整文档位于 `Documentation\u002FDiarization\u002FLSEEND.md`\n\n- **Parakeet EOU 1280ms** (#388)：新增对 1280 毫秒流式分块大小的支持\n\n## API 改进\n\n- **DiarizerTimeline** (#402)：使说话人属性公开可变，便于自定义说话人管理\n- **TDT 解码器** (#382)：填充 tokenDurations，以获得准确的词结束时间\n\n## 修复\n\n- **G2P 多语言** (#400)：修复多语言路径解析问题\n- **EmbeddingExtractor** (#398)：限制 numMasksInChunk 的值，防止堆缓冲区溢出\n- **Swift Transformers** (#378)：将最低版本提升至 1.2.0（修复尾随逗号问题）\n\n## 文档\n\n- **LS-EEND 与 Sortformer** (#397)：添加集成测试中的注册反馈\n- **模型转换指南** (#391、#392)：新增包含现有基准数据集的指南\n- **PocketTTS 架构** (#380)：添加流水线架构注释\n- **展示内容更新**：新增 Audite (#396)、Hitoku Draft (#385)，并更新 OpenOats (#399)\n- **Hugging Face 徽章**：更新为超过 80 万次下载","2026-03-21T01:29:21",{"id":229,"version":230,"summary_zh":231,"released_at":232},171668,"v0.12.4","## What's New\n- **PocketTTS Streaming API** (#369): Added streaming synthesis support for PocketTTS\n- **All 21 PocketTTS Voices** (#375): Support for all upstream PocketTTS voices with on-demand download from HuggingFace\n- **Multilingual G2P Model** (#367): Added grapheme-to-phoneme functionality supporting French, German, Spanish, Japanese, Chinese, Hindi with a FLEURS benchmark CLI command\n\n## Fixes\n- Fixed SDK guard for `MLMultiArrayDataType.int8` to use proper availability checks (#364)\n- Corrected licensing documentation from MIT to Apache 2.0 in PocketTTS (#361)\n\n## Docs\n- Updated README highlights with TTS and streaming ASR info (#374)\n- Added Kokoro English voice quality report (#373)\n- Fixed CLI name references to fluidaudiocli (#372)\n- Fixed stale references, versions, and missing CLI commands (#370)\n- Enhanced text-processing-rs description to reflect multilingual capabilities (#365)\n- Updated Kokoro TTS documentation, clarifying it remains actively maintained (#360)\n- Added Meeting Transcriber to project showcase (#362)\n- Added Enconvo to project showcase (#359)","2026-03-15T01:07:15",{"id":234,"version":235,"summary_zh":236,"released_at":237},171669,"v0.12.3","## What's New\n\n- **CoreML G2P Model for TTS** (#350): Replace eSpeak with a CoreML grapheme-to-phoneme model, rename product to FluidAudioTTS\n- **Speaker Pre-Enrollment APIs** (#355): Add `extractSpeakerEmbedding(from:)` and `primeWithAudio(_:)` for priming diarizers with known speaker audio\n- **Download Progress Callbacks** (#354): Byte-level progress reporting for model downloads\n\n## Fixes\n\n- Fix `CustomVocabularyContext.minSimilarity` not being respected in rescoring (#349)\n- Fix iOS build and CI benchmark failures (#353)\n- Fix release build data race and currency number spelling (#352)\n\n## Other\n\n- Remove ESpeakNG framework and update docs (#351)\n- Update README with current versions and product names (#348)\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.12.2...v0.12.3","2026-03-08T08:31:29",{"id":239,"version":240,"summary_zh":241,"released_at":242},171670,"v0.12.2","## What's New\n\n- **Qwen3-ASR int8 Support** (#312): Add int8 quantized variant for Qwen3-ASR, reducing model size\n- **ITN Post-Processor** (#308): Inverse text normalization with NLTagger context spotting\n- **clearAllModelCaches()** (#309): New API to clear all cached models\n- **Directory Override** (#327): Added directory override for model storage\n- **Task Cancellation** (#340): Add Task cancellation support to ASR transcription\n\n## Fixes\n\n- Fix ASR MLMultiArray cache lifecycle in chunked transcription (#321)\n- Fix Float16 build failure in Qwen3-ASR for Archive builds (#305)\n- Replace force-unwraps with guard in tdtDecodeWithTimings (#324)\n- Add type annotation to withUnsafeBytes for Swift 6.2 compatibility (#341)\n- Use 2-model pipeline for Qwen3-ASR download validation (#307)\n\n## Other\n\n- Rename FluidAudioTTS to FluidAudioEspeak (#302)\n- Add PocketTTS smoke test CI workflow (#313)\n- Showcase additions: Flowstay, Snaply, macos-speech-server\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.12.1...v0.12.2","2026-03-06T01:13:59",{"id":244,"version":245,"summary_zh":246,"released_at":247},171671,"v0.12.1","## What's New\n\n- **Qwen3-ASR Beta** (#281): Add Qwen3-ASR-0.6B CoreML speech recognition with multilingual support\n- **PocketTTS GPL Separation** (#301): Move PocketTTS to core FluidAudio, now available without GPL dependencies for closed-source apps\n- **Voice Cloning for PocketTTS** (#289): Clone any voice from 1-30 second audio samples using Mimi encoder\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.12.0...v0.12.1","2026-02-12T02:09:22",{"id":249,"version":250,"summary_zh":251,"released_at":252},171672,"v0.12.0","## New TTS model converted from the [Pocket TTS ](https:\u002F\u002Fhuggingface.co\u002FFluidInference\u002Fpocket-tts-coreml)\r\n- simpler chunking algorithm while supporting longer token counts\r\n- EOS detection stops generation naturally instead of hitting a fixed token wall\r\n- real time support\r\n- No espeak dependency\r\n- iOS RAM friendly \r\n\r\n##  Bug Fixes\r\n\r\n  - #282 — Custom vocabulary rescoring now applies to chunked long audio (was silently skipped on >15s\r\n  audio) — thanks @Beingpax\r\n\r\n## Note\r\n- we plan to deprecate kokoro tts in the future soon ","2026-02-03T05:09:28",{"id":254,"version":255,"summary_zh":256,"released_at":257},171673,"v0.11.0","## What’s New\r\n-   Custom Vocabulary Support (#251): Major feature for custom vocab with recognized domain terms\r\n-   Vocabulary Pipeline Refactor (#276): Restructure, dead code removal, and pure Swift dataset\r\n-   download Float16 Xcode GUI Build Failure (#270): Add architecture checks to avoid build break by @schmatz \r\n-   Documentation Refresh (#280): Reorganize docs and remove stale content README Updates: Link and description cleanup\r\n\r\nFull Changelog: https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.10.1...v0.11.0","2026-01-31T02:40:05",{"id":259,"version":260,"summary_zh":261,"released_at":262},171674,"v0.10.1","## What's New\r\n\r\n- **Streaming Audio Processing** (#257): Memory-efficient transcription for large files — 99.5% reduction (230MB → ~1.2MB constant)\r\n- **TTS De-esser** (#267): Reduces harsh sibilant sounds in Kokoro TTS output, on by default\r\n\r\n## Bug Fixes\r\n\r\n- **macOS 26 Sortformer compatibility** (#266): Switch to V2 models to fix BNNS compiler error\r\n- **Chunk boundary transcription loss** (#264): Fix speech truncation by prepending mel context to non-first chunks\r\n- **Legacy FileHandle.write** (#262): Replace deprecated API with throwing `write(contentsOf:)`\r\n\r\n## New Contributors\r\n\r\n- @starkdmi — chunk boundary transcription fix (#264)\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.10.0...v0.10.1","2026-01-28T19:07:45",{"id":264,"version":265,"summary_zh":266,"released_at":267},171675,"v0.10.0","## Sortformer: Real-Time Speaker Diarization\r\n\r\nCoreML version of Nividia's [Sortformer](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fdiar_streaming_sortformer_4spk-v2)\r\n\r\n- **Real-time streaming** - Speaker labels as audio comes in\r\n- **Noisy environment support** - Works where traditional pipelines fail\r\n- **Overlapping speech** - Scores all 4 speakers independently per frame, multiple can be active simultaneously\r\n- **Single neural model** - No complex pipeline, just one model\r\n\r\n---\r\n\r\nCredit to @SGD2718 for the Sortformer implementation & model conversion.","2026-01-12T01:10:04",{"id":269,"version":270,"summary_zh":271,"released_at":272},171676,"v0.9.1","## What's Changed\n\n### Bug Fixes\n- fix: Swift 6 Sendable errors with macOS 26.2 SDK (#245) by @tacshi\n- fix: Swift 6 concurrency errors in audio conversion (#239) by @Alex-Wengg\n- fix: rename CLI executable to fluidaudiocli to avoid Xcode name collision by @Alex-Wengg\n- fix(diarizer): use K-Means centroids when speaker count constraint is applied (#236) by @beshkenadze\n- Preventing loops with non-blank tokens (#244) by @Steven-Weng\n\n## New Contributors\n- @tacshi made their first contribution in #245\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.9.0...v0.9.1","2026-01-03T19:09:20",{"id":274,"version":275,"summary_zh":276,"released_at":277},171677,"v0.9.0","## What's New\r\n\r\n### Swift 6 Support\r\n- Full Swift 6 compatibility\r\n- Updated swift-tools-version\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FFluidInference\u002FFluidAudio\u002Fcompare\u002Fv0.8.2...v0.9.0","2025-12-31T20:20:59"]