[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-argmaxinc--WhisperKit":3,"tool-argmaxinc--WhisperKit":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",145895,2,"2026-04-08T11:32:59",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108111,"2026-04-08T11:23:26",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":72,"owner_website":78,"owner_url":79,"languages":80,"stars":97,"forks":98,"last_commit_at":99,"license":100,"difficulty_score":101,"env_os":102,"env_gpu":103,"env_ram":104,"env_deps":105,"category_tags":113,"github_topics":115,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":125,"updated_at":126,"faqs":127,"releases":157},5564,"argmaxinc\u002FWhisperKit","WhisperKit","On-device Speech Recognition for Apple Silicon","WhisperKit 是由 Argmax 推出的开源框架，专为在 Apple Silicon 设备（如 iPhone、iPad 和 Mac）上高效运行顶级语音识别模型（如 Whisper）而设计。它成功解决了传统语音转文字方案依赖云端服务器、存在延迟高、隐私泄露风险以及联网成本高等痛点，让高质量的语音识别完全在本地离线完成。\n\n这款工具非常适合 iOS\u002FmacOS 开发者、人工智能研究人员以及注重数据隐私的应用构建者使用。通过 Swift 包管理器即可轻松集成，开发者能快速为应用添加实时流式转录、精确的字词时间戳、语音活动检测甚至说话人区分等高级功能。\n\nWhisperKit 的核心亮点在于其深度的端侧优化，充分利用 Apple 神经引擎加速推理，在保障低延迟的同时维持了极高的识别准确率。除了基础的语音转文字，它还支持构建本地服务器以便非 Swift 项目调用，并提供了配套的 TTSKit 用于文本转语音。无论是想快速验证原型的开发者，还是希望在不牺牲用户隐私前提下提升交互体验的产品团队，WhisperKit 都是一个轻量且强大的入门选择。","\n\u003Cdiv align=\"center\">\n  \n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit#gh-light-mode-only\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargmaxinc_WhisperKit_readme_e6f80a3c47a4.png\" alt=\"WhisperKit\" width=\"20%\" \u002F>\n\u003C\u002Fa>\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit#gh-dark-mode-only\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargmaxinc_WhisperKit_readme_bee3f9f65994.png\" alt=\"WhisperKit\" width=\"20%\" \u002F>\n\u003C\u002Fa>\n\n# WhisperKit\n\n[![Tests](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkit\u002Factions\u002Fworkflows\u002Frelease-tests.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkit\u002Factions\u002Fworkflows\u002Frelease-tests.yml)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fargmaxinc\u002Fwhisperkit?logo=github&logoColor=969da4&label=License&labelColor=353a41&color=32d058)](LICENSE.md)\n[![Supported Swift Version](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https%3A%2F%2Fswiftpackageindex.com%2Fapi%2Fpackages%2Fargmaxinc%2FWhisperKit%2Fbadge%3Ftype%3Dswift-versions&labelColor=353a41&color=32d058)](https:\u002F\u002Fswiftpackageindex.com\u002Fargmaxinc\u002FWhisperKit) [![Supported Platforms](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https%3A%2F%2Fswiftpackageindex.com%2Fapi%2Fpackages%2Fargmaxinc%2FWhisperKit%2Fbadge%3Ftype%3Dplatforms&labelColor=353a41&color=32d058)](https:\u002F\u002Fswiftpackageindex.com\u002Fargmaxinc\u002FWhisperKit)\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fdiscord\u002F1171912382512115722?style=flat&logo=discord&logoColor=969da4&label=Discord&labelColor=353a41&color=32d058&link=https%3A%2F%2Fdiscord.gg%2FG5F5GZGecC)](https:\u002F\u002Fdiscord.gg\u002FG5F5GZGecC)\n\n\n\u003C\u002Fdiv>\n\nWhisperKit is an [Argmax](https:\u002F\u002Fwww.takeargmax.com) framework for deploying state-of-the-art speech-to-text systems (e.g. [Whisper](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper)) on device with advanced features such as real-time streaming, word timestamps, voice activity detection, speaker diarization, and more.\n\n[[TestFlight Demo App]](https:\u002F\u002Ftestflight.apple.com\u002Fjoin\u002FQ1cywTJw) [[Python Tools]](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkittools) [[Benchmarks & Device Support]](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fargmaxinc\u002Fwhisperkit-benchmarks) [[WhisperKit Android]](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKitAndroid)\n\n> [!IMPORTANT]\n> WhisperKit is ideal for getting started with on-device speech-to-text. When you are ready to scale your on-device deployment with real-time transcription and speaker diarization, start your [14-day trial](https:\u002F\u002Fapp.argmaxinc.com) for [Argmax Pro SDK](https:\u002F\u002Fwww.argmaxinc.com\u002F#SDK) with 9x faster and higher accuracy models such as Nvidia Parakeet V3, [Nvidia Sortformer](https:\u002F\u002Fwww.argmaxinc.com\u002Fblog\u002Fargmax-sdk-2) streaming speaker diarization model, and a Deepgram-compatible WebSocket [local server](https:\u002F\u002Fwww.argmaxinc.com\u002Fblog\u002Fargmax-local-server) for easy integration into non-Swift projects.\n\n## Table of Contents\n\n- [Installation](#installation)\n  - [Swift Package Manager](#swift-package-manager)\n  - [Prerequisites](#prerequisites)\n  - [Xcode Steps](#xcode-steps)\n  - [Package.swift](#packageswift)\n  - [Homebrew](#homebrew)\n- [Getting Started](#getting-started)\n  - [Quick Example](#quick-example)\n  - [Model Selection](#model-selection)\n  - [Generating Models](#generating-models)\n  - [Swift CLI](#swift-cli)\n  - [WhisperKit Local Server](#whisperkit-local-server)\n    - [Building the Server](#building-the-server)\n    - [Starting the Server](#starting-the-server)\n    - [API Endpoints](#api-endpoints)\n    - [Supported Parameters](#supported-parameters)\n    - [Client Examples](#client-examples)\n    - [Generating the API Specification](#generating-the-api-specification)\n    - [Client Generation](#client-generation)\n    - [API Limitations](#api-limitations)\n    - [Fully Supported Features](#fully-supported-features)\n- [TTSKit](#ttskit)\n  - [Quick Example](#quick-example-1)\n  - [Model Selection](#model-selection-1)\n    - [Custom Voices](#custom-voices)\n    - [Real-Time Streaming Playback](#real-time-streaming-playback)\n  - [Generation Options](#generation-options)\n    - [Style Instructions (1.7B only)](#style-instructions-17b-only)\n  - [Saving Audio](#saving-audio)\n  - [Progress Callbacks](#progress-callbacks)\n  - [Swift CLI](#swift-cli-1)\n  - [Demo App](#demo-app)\n- [SpeakerKit](#speakerkit)\n  - [Quick Example](#quick-example-2)\n  - [Diarization Options](#diarization-options)\n  - [Combining with Transcription](#combining-with-transcription)\n  - [RTTM Output](#rttm-output)\n  - [Swift CLI](#swift-cli-2)\n- [Contributing \\& Roadmap](#contributing--roadmap)\n- [License](#license)\n- [Citation](#citation)\n\n## Installation\n\n### Swift Package Manager\n\nWhisperKit, TTSKit, and SpeakerKit are separate library products in the same Swift package. Add the package once and pick the products you need.\n\n### Prerequisites\n\n- macOS 14.0 or later.\n- Xcode 16.0 or later.\n\n### Xcode Steps\n\n1. Open your Swift project in Xcode.\n2. Navigate to `File` > `Add Package Dependencies...`.\n3. Enter the package repository URL: `https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkit`.\n4. Choose the version range or specific version.\n5. When prompted to choose library products, select **WhisperKit**, **TTSKit**, **SpeakerKit**, or any combination.\n\n### Package.swift\n\nIf you're using WhisperKit, TTSKit, or SpeakerKit as part of a swift package, you can include it in your Package.swift dependencies as follows:\n\n```swift\ndependencies: [\n    .package(url: \"https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit.git\", from: \"0.9.0\"),\n],\n```\n\nThen add the products you need as target dependencies:\n\n```swift\n.target(\n    name: \"YourApp\",\n    dependencies: [\n        \"WhisperKit\",   \u002F\u002F speech-to-text\n        \"TTSKit\",       \u002F\u002F text-to-speech\n        \"SpeakerKit\",   \u002F\u002F speaker diarization\n    ]\n),\n```\n\n### Homebrew\n\nYou can install `WhisperKit` command line app using [Homebrew](https:\u002F\u002Fbrew.sh) by running the following command:\n\n```bash\nbrew install whisperkit-cli\n```  \n\n## Getting Started\n\nTo get started with WhisperKit, you need to initialize it in your project.\n\n### Quick Example\n\nThis example demonstrates how to transcribe a local audio file:\n\n```swift\nimport WhisperKit\n\n\u002F\u002F Initialize WhisperKit with default settings\nTask {\n   let pipe = try? await WhisperKit()\n   let transcription = try? await pipe!.transcribe(audioPath: \"path\u002Fto\u002Fyour\u002Faudio.{wav,mp3,m4a,flac}\")?.text\n    print(transcription)\n}\n```\n\n### Model Selection\n\nWhisperKit automatically downloads the recommended model for the device if not specified. You can also select a specific model by passing in the model name:\n\n```swift\nlet pipe = try? await WhisperKit(WhisperKitConfig(model: \"large-v3\"))\n```\n\nThis method also supports glob search, so you can use wildcards to select a model:\n\n```swift\nlet pipe = try? await WhisperKit(WhisperKitConfig(model: \"distil*large-v3\"))\n```\n\nNote that the model search must return a single model from the source repo, otherwise an error will be thrown.\n\nFor a list of available models, see our [HuggingFace repo](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fwhisperkit-coreml).\n\n### Generating Models\n\nWhisperKit also comes with the supporting repo [`whisperkittools`](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkittools) which lets you create and deploy your own fine tuned versions of Whisper in CoreML format to HuggingFace. Once generated, they can be loaded by simply changing the repo name to the one used to upload the model:\n\n```swift\nlet config = WhisperKitConfig(model: \"large-v3\", modelRepo: \"username\u002Fyour-model-repo\")\nlet pipe = try? await WhisperKit(config)\n```\n\n### Swift CLI\n\nThe Swift CLI allows for quick testing and debugging outside of an Xcode project. To install it, run the following:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkit.git\ncd whisperkit\n```\n\nThen, setup the environment and download your desired model.\n\n```bash\nmake setup\nmake download-model MODEL=large-v3\n```\n\n**Note**:\n\n1. This will download only the model specified by `MODEL` (see what's available in our [HuggingFace repo](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fwhisperkit-coreml), where we use the prefix `openai_whisper-{MODEL}`)\n2. Before running `download-model`, make sure [git-lfs](https:\u002F\u002Fgit-lfs.com) is installed\n\nIf you would like download all available models to your local folder, use this command instead:\n\n```bash\nmake download-models\n```\n\nYou can then run them via the CLI with:\n\n```bash\nswift run whisperkit-cli transcribe --model-path \"Models\u002Fwhisperkit-coreml\u002Fopenai_whisper-large-v3\" --audio-path \"path\u002Fto\u002Fyour\u002Faudio.{wav,mp3,m4a,flac}\" \n```\n\nWhich should print a transcription of the audio file. If you would like to stream the audio directly from a microphone, use:\n\n```bash\nswift run whisperkit-cli transcribe --model-path \"Models\u002Fwhisperkit-coreml\u002Fopenai_whisper-large-v3\" --stream\n```\n\n### WhisperKit Local Server\n\nWhisperKit includes a local server that implements the OpenAI Audio API, allowing you to use existing OpenAI SDK clients or generate new ones. The server supports transcription and translation with **output streaming** capabilities (real-time transcription results as they're generated).\n\n> [!NOTE]\n> **For real-time transcription server with full-duplex streaming capabilities**, check out [WhisperKit Pro Local Server](https:\u002F\u002Fwww.argmaxinc.com\u002Fblog\u002Fargmax-local-server) which provides live audio streaming and real-time transcription for applications requiring continuous audio processing.\n\n#### Building the Server\n\n```bash\n# Build with server support\nmake build-local-server\n\n# Or manually with the build flag\nBUILD_ALL=1 swift build --product whisperkit-cli\n```\n\n#### Starting the Server\n\n```bash\n# Start server with default settings\nBUILD_ALL=1 swift run whisperkit-cli serve\n\n# Custom host and port\nBUILD_ALL=1 swift run whisperkit-cli serve --host 0.0.0.0 --port 8080\n\n# With specific model and verbose logging\nBUILD_ALL=1 swift run whisperkit-cli serve --model tiny --verbose\n\n# See all configurable parameters\nBUILD_ALL=1 swift run whisperkit-cli serve --help\n```\n\n#### API Endpoints\n\n- **POST** `\u002Fv1\u002Faudio\u002Ftranscriptions` - Transcribe audio to text\n- **POST** `\u002Fv1\u002Faudio\u002Ftranslations` - Translate audio to English\n\n#### Supported Parameters\n\n| Parameter | Description | Default |\n|-----------|-------------|---------|\n| `file` | Audio file (wav, mp3, m4a, flac) | Required |\n| `model` | Model identifier | Server default |\n| `language` | Source language code | Auto-detect |\n| `prompt` | Text to guide transcription | None |\n| `response_format` | Output format (json, verbose_json) | verbose_json |\n| `temperature` | Sampling temperature (0.0-1.0) | 0.0 |\n| `timestamp_granularities[]` | Timing detail (word, segment) | segment |\n| `stream` | Enable streaming | false |\n\n#### Client Examples\n\n**Python Client (OpenAI SDK)**\n```bash\ncd Examples\u002FServeCLIClient\u002FPython\nuv sync\npython whisperkit_client.py transcribe --file audio.wav --language en\npython whisperkit_client.py translate --file audio.wav\n```\n\nQuick Python example:\n```python\nfrom openai import OpenAI\nclient = OpenAI(base_url=\"http:\u002F\u002Flocalhost:50060\u002Fv1\")\nresult = client.audio.transcriptions.create(\n    file=open(\"audio.wav\", \"rb\"),\n    model=\"tiny\"  # Model parameter is required\n)\nprint(result.text)\n```\n\n**Swift Client (Generated from OpenAPI Spec, see ServeCLIClient\u002FSwift\u002FupdateClient.sh)**\n```bash\ncd Examples\u002FServeCLIClient\u002FSwift\nswift run whisperkit-client transcribe audio.wav --language en\nswift run whisperkit-client translate audio.wav\n```\n\n**CurlClient (Shell Scripts)**\n```bash\ncd Examples\u002FServeCLIClient\u002FCurl\nchmod +x *.sh\n.\u002Ftranscribe.sh audio.wav --language en\n.\u002Ftranslate.sh audio.wav --language es\n.\u002Ftest.sh  # Run comprehensive test suite\n```\n\n#### Generating the API Specification\n\nThe server's OpenAPI specification and code are generated from the official OpenAI API:\n\n```bash\n# Generate latest spec and server code\nmake generate-server\n```\n\n#### Client Generation\n\nYou can generate clients for any language using the OpenAPI specification, for example:\n\n```bash\n# Generate Python client\nswift run swift-openapi-generator generate scripts\u002Fspecs\u002Flocalserver_openapi.yaml \\\n  --output-directory python-client \\\n  --mode client \\\n  --mode types\n\n# Generate TypeScript client\nnpx @openapitools\u002Fopenapi-generator-cli generate \\\n  -i scripts\u002Fspecs\u002Flocalserver_openapi.yaml \\\n  -g typescript-fetch \\\n  -o typescript-client\n```\n\n#### API Limitations\n\nCompared to the official OpenAI API, the local server has these limitations:\n\n- **Response formats**: Only `json` and `verbose_json` supported (no plain text, SRT, VTT formats)\n- **Model selection**: Client must launch server with desired model via `--model` flag\n\n#### Fully Supported Features\n\nThe local server fully supports these OpenAI API features:\n\n- **Include parameters**: `logprobs` parameter for detailed token-level log probabilities\n- **Streaming responses**: Server-Sent Events (SSE) for real-time transcription\n- **Timestamp granularities**: Both `word` and `segment` level timing\n- **Language detection**: Automatic language detection or manual specification\n- **Temperature control**: Sampling temperature for transcription randomness\n- **Prompt text**: Text guidance for transcription style and context\n\n## TTSKit\n\nTTSKit is an on-device text-to-speech framework built on Core ML. It runs [Qwen3 TTS](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-TTS) models entirely on Apple silicon with real-time streaming playback, no server required.\n\n- macOS 15.0 or later.\n- iOS 18.0 or later.\n\n### Quick Example\n\nThis example demonstrates how to generate speech from text:\n\n```swift\nimport TTSKit\n\nTask {\n    let tts = try await TTSKit()\n    let result = try await tts.generate(text: \"Hello from TTSKit!\")\n    print(\"Generated \\(result.audioDuration)s of audio at \\(result.sampleRate)Hz\")\n}\n```\n\n`TTSKit()` automatically downloads the default 0.6B model on first run. The tokenizer and CoreML models are loaded lazily on the first `generate()` call.\n\n### Model Selection\n\nTTSKit ships two model sizes. You can select the model by passing a variant to `TTSKitConfig`:\n\n```swift\n\u002F\u002F Fast, runs on all platforms (~1 GB download)\nlet tts = try await TTSKit(TTSKitConfig(model: .qwen3TTS_0_6b))\n\n\u002F\u002F Higher quality, macOS only (~2.2 GB download, supports style instructions)\nlet tts = try await TTSKit(TTSKitConfig(model: .qwen3TTS_1_7b))\n```\n\nModels are hosted on [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fttskit-coreml) and cached locally after the first download.\n\n#### Custom Voices\n\nYou can choose from 9 built-in voices and 10 languages:\n\n```swift\nlet result = try await tts.generate(\n    text: \"こんにちは世界\",\n    speaker: .onoAnna,\n    language: .japanese\n)\n```\n\n**Voices:** `.ryan`, `.aiden`, `.onoAnna` (`\"ono-anna\"`), `.sohee`, `.eric`, `.dylan`, `.serena`, `.vivian`, `.uncleFu` (`\"uncle-fu\"`)\n\n**Languages:** `.english`, `.chinese`, `.japanese`, `.korean`, `.german`, `.french`, `.russian`, `.portuguese`, `.spanish`, `.italian`\n\n#### Real-Time Streaming Playback\n\n`play` streams audio to the device speakers frame-by-frame as it is generated:\n\n```swift\ntry await tts.play(text: \"This starts playing before generation finishes.\")\n```\n\nYou can control how much audio is buffered before playback begins. The default `.auto` strategy measures the first generation step and pre-buffers just enough to avoid underruns:\n\n```swift\ntry await tts.play(\n    text: \"Long passage...\",\n    playbackStrategy: .auto\n)\n```\n\nOther strategies include `.stream` (immediate, no buffer), `.buffered(seconds:)` (fixed pre-buffer), and `.generateFirst` (generate all audio first, then play).\n\n### Generation Options\n\nYou can customize sampling, chunking, and concurrency via `GenerationOptions`:\n\n```swift\n\u002F\u002F Defaults recommended by Qwen\nvar options = GenerationOptions()\noptions.temperature = 0.9\noptions.topK = 50\noptions.repetitionPenalty = 1.05\noptions.maxNewTokens = 245\n\n\u002F\u002F Long text is automatically split at sentence boundaries\noptions.chunkingStrategy = .sentence\noptions.concurrentWorkerCount = nil  \u002F\u002F nil = all chunks run concurrently with a good default for the device\n\nlet result = try await tts.generate(text: longArticle, options: options)\n```\n\n#### Style Instructions (1.7B only)\n\nThe 1.7B model accepts a natural-language style instruction that controls prosody:\n\n```swift\nvar options = GenerationOptions()\noptions.instruction = \"Speak slowly and warmly, like a storyteller.\"\n\nlet result = try await tts.generate(\n    text: \"Once upon a time...\",\n    speaker: .ryan,\n    options: options\n)\n```\n\n### Saving Audio\n\nGenerated audio can be saved to WAV or M4A:\n\n```swift\nlet result = try await tts.generate(text: \"Save me!\")\nlet outputDir = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]\n\n\u002F\u002F Save as .wav or .m4a (AAC)\ntry await AudioOutput.saveAudio(result.audio, toFolder: outputDir, filename: \"output\", format: .m4a)\n```\n\n### Progress Callbacks\n\nYou can receive per-step audio during generation. Return `false` from the callback to cancel early:\n\n```swift\nlet result = try await tts.generate(text: \"Hello!\") { progress in\n    print(\"Audio chunk: \\(progress.audio.count) samples\")\n    if let stepTime = progress.stepTime {\n        print(\"First step took \\(stepTime)s\")\n    }\n    return true  \u002F\u002F return false to cancel\n}\n```\n\n### Swift CLI\n\nThe TTS command is available through the same `whisperkit-cli` tool. You can generate speech and optionally play it back in real time:\n\n```bash\nswift run whisperkit-cli tts --text \"Hello from the command line\" --play\nswift run whisperkit-cli tts --text \"Save to file\" --output-path output.wav\nswift run whisperkit-cli tts --text \"日本語テスト\" --speaker ono-anna --language japanese\nswift run whisperkit-cli tts --text-file article.txt --model 1.7b --instruction \"Read cheerfully\"\nswift run whisperkit-cli tts --help\n```\n\n### Demo App\n\nThe [TTSKitExample](Examples\u002FTTS\u002FTTSKitExample\u002F) example app showcases real-time streaming, model management, waveform visualization, and generation history on macOS and iOS. See the [TTSKitExample README](Examples\u002FTTS\u002FTTSKitExample\u002FREADME.md) for build instructions.\n\n## SpeakerKit\n\nSpeakerKit is an on-device speaker diarization framework built on Core ML. It runs [Pyannote v4 (community-1)](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fspeakerkit-coreml) segmentation and embedding models on Apple silicon to identify and label speakers in audio. Read the [blog post](https:\u002F\u002Fwww.argmaxinc.com\u002Fblog\u002Fspeakerkit) for architecture details and benchmarks.\n\n- macOS 13.0 or later.\n- iOS 16.0 or later.\n\n### Quick Example\n\nThis example demonstrates how to diarize an audio file:\n\n```swift\nimport SpeakerKit\n\nTask {\n    let speakerKit = try await SpeakerKit()\n\n    let audioArray = try AudioProcessor.loadAudioAsFloatArray(fromPath: \"audio.wav\")\n    let result = try await speakerKit.diarize(audioArray: audioArray)\n\n    print(\"Detected \\(result.speakerCount) speakers\")\n    for segment in result.segments {\n        print(segment)\n    }\n}\n```\n\n`SpeakerKit()` uses `PyannoteConfig()` defaults, automatically downloading models from [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fspeakerkit-coreml) on first run. The segmenter and embedder CoreML models are loaded lazily (unless `load` is set on config) on the first `diarize()` call.\n\n### Diarization Options\n\nYou can control speaker detection via `PyannoteDiarizationOptions`:\n\n```swift\nlet audioArray = try AudioProcessor.loadAudioAsFloatArray(fromPath: \"audio.wav\")\nlet options = PyannoteDiarizationOptions(\n    numberOfSpeakers: 2,               \u002F\u002F nil = automatic detection\n    clusterDistanceThreshold: 0.6,     \u002F\u002F clustering threshold\n    useExclusiveReconciliation: false   \u002F\u002F exclusive speaker assignment per frame\n)\nlet result = try await speakerKit.diarize(audioArray: audioArray, options: options)\n```\n\nFor local models, skip the download step:\n\n```swift\nlet config = PyannoteConfig(modelFolder: \"\u002Fpath\u002Fto\u002Fmodels\")\nlet speakerKit = try await SpeakerKit(config)\n```\n\n### Combining with Transcription\n\nSpeakerKit can merge diarization results with WhisperKit transcriptions to produce speaker-attributed segments:\n\n```swift\nimport WhisperKit\nimport SpeakerKit\n\nlet whisperKit = try await WhisperKit()\nlet speakerKit = try await SpeakerKit()\n\nlet audioArray = try AudioProcessor.loadAudioAsFloatArray(fromPath: \"audio.wav\")\nlet transcription = try await whisperKit.transcribe(audioArray: audioArray)\nlet diarization = try await speakerKit.diarize(audioArray: audioArray)\n\nlet speakerSegments = diarization.addSpeakerInfo(to: transcription)\n\nfor group in speakerSegments {\n    for segment in group {\n        print(\"\\(segment.speaker): \\(segment.text)\")\n    }\n}\n```\n\nTwo strategies are available for matching speakers to transcription:\n- `.subsegment` (default) -- splits segments at word gaps, then assigns speakers\n- `.segment` -- assigns a speaker to each transcription segment as a whole\n\n### RTTM Output\n\nGenerate RTTM output:\n\n```swift\nlet speakerKit = try await SpeakerKit()\n\nlet audioArray = try AudioProcessor.loadAudioAsFloatArray(fromPath: \"meeting.wav\")\nlet diarization = try await speakerKit.diarize(audioArray: audioArray)\n\nlet rttmLines = SpeakerKit.generateRTTM(from: diarization, fileName: \"meeting\")\nfor line in rttmLines {\n    print(line)\n}\n```\n\n### Swift CLI\n\nThe diarization commands are available through the `whisperkit-cli` tool:\n\n```bash\n# Standalone diarization\nswift run whisperkit-cli diarize --audio-path audio.wav --verbose\n\n# Save RTTM output\nswift run whisperkit-cli diarize --audio-path audio.wav --rttm-path output.rttm\n\n# Specify number of speakers\nswift run whisperkit-cli diarize --audio-path audio.wav --num-speakers 3\n\n# Transcription with diarization\nswift run whisperkit-cli transcribe --audio-path audio.wav --diarization\n\n# See all options\nswift run whisperkit-cli diarize --help\n```\n\n## Contributing & Roadmap\n\nOur goal is to make WhisperKit better and better over time and we'd love your help! Just search the code for \"TODO\" for a variety of features that are yet to be built. Please refer to our [contribution guidelines](CONTRIBUTING.md) for submitting issues, pull requests, and coding standards, where we also have a public roadmap of features we are looking forward to building in the future.\n\n## License\n\nWhisperKit is released under the MIT License. See [LICENSE](LICENSE) for more details.\n\n## Citation\n\nIf you use WhisperKit for something cool or just find it useful, please drop us a note at [info@argmaxinc.com](mailto:info@argmaxinc.com)!\n\nIf you use WhisperKit for academic work, here is the BibTeX:\n\n```bibtex\n@misc{whisperkit-argmax,\n   title = {WhisperKit},\n   author = {Argmax, Inc.},\n   year = {2024},\n   URL = {https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit}\n}\n```\n","\u003Cdiv align=\"center\">\n  \n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit#gh-light-mode-only\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargmaxinc_WhisperKit_readme_e6f80a3c47a4.png\" alt=\"WhisperKit\" width=\"20%\" \u002F>\n\u003C\u002Fa>\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit#gh-dark-mode-only\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargmaxinc_WhisperKit_readme_bee3f9f65994.png\" alt=\"WhisperKit\" width=\"20%\" \u002F>\n\u003C\u002Fa>\n\n# WhisperKit\n\n[![Tests](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkit\u002Factions\u002Fworkflows\u002Frelease-tests.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkit\u002Factions\u002Fworkflows\u002Frelease-tests.yml)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fargmaxinc\u002Fwhisperkit?logo=github&logoColor=969da4&label=License&labelColor=353a41&color=32d058)](LICENSE.md)\n[![Supported Swift Version](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https%3A%2F%2Fswiftpackageindex.com%2Fapi%2Fpackages%2Fargmaxinc%2FWhisperKit%2Fbadge%3Ftype%3Dswift-versions&labelColor=353a41&color=32d058)](https:\u002F\u002Fswiftpackageindex.com\u002Fargmaxinc\u002FWhisperKit) [![Supported Platforms](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https%3A%2F%2Fswiftpackageindex.com%2Fapi%2Fpackages%2Fargmaxinc%2FWhisperKit%2Fbadge%3Ftype%3Dplatforms&labelColor=353a41&color=32d058)](https:\u002F\u002Fswiftpackageindex.com\u002Fargmaxinc\u002FWhisperKit)\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fdiscord\u002F1171912382512115722?style=flat&logo=discord&logoColor=969da4&label=Discord&labelColor=353a41&color=32d058&link=https%3A%2F%2Fdiscord.gg%2FG5F5GZGecC)](https:\u002F\u002Fdiscord.gg\u002FG5F5GZGecC)\n\n\n\u003C\u002Fdiv>\n\nWhisperKit 是一个由 [Argmax](https:\u002F\u002Fwww.takeargmax.com) 开发的框架，用于在设备端部署最先进的语音转文本系统（例如 [Whisper](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper)），并提供实时流式传输、词级时间戳、语音活动检测、说话人分离等高级功能。\n\n[[TestFlight 演示应用]](https:\u002F\u002Ftestflight.apple.com\u002Fjoin\u002FQ1cywTJw) [[Python 工具]](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkittools) [[基准测试与设备支持]](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fargmaxinc\u002Fwhisperkit-benchmarks) [[WhisperKit Android]](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKitAndroid)\n\n> [!IMPORTANT]\n> WhisperKit 非常适合开始使用设备端语音转文本功能。当您准备好通过实时转录和说话人分离来扩展设备端部署时，请启动您的 [14 天试用](https:\u002F\u002Fapp.argmaxinc.com)，体验 [Argmax Pro SDK](https:\u002F\u002Fwww.argmaxinc.com\u002F#SDK)，其中包含速度提升 9 倍且准确率更高的模型，如 Nvidia Parakeet V3、[Nvidia Sortformer](https:\u002F\u002Fwww.argmaxinc.com\u002Fblog\u002Fargmax-sdk-2) 流式说话人分离模型，以及兼容 Deepgram 的 WebSocket [本地服务器](https:\u002F\u002Fwww.argmaxinc.com\u002Fblog\u002Fargmax-local-server)，便于集成到非 Swift 项目中。\n\n## 目录\n\n- [安装](#installation)\n  - [Swift 包管理器](#swift-package-manager)\n  - [先决条件](#prerequisites)\n  - [Xcode 步骤](#xcode-steps)\n  - [Package.swift](#packageswift)\n  - [Homebrew](#homebrew)\n- [快速入门](#getting-started)\n  - [快速示例](#quick-example)\n  - [模型选择](#model-selection)\n  - [生成模型](#generating-models)\n  - [Swift CLI](#swift-cli)\n  - [WhisperKit 本地服务器](#whisperkit-local-server)\n    - [构建服务器](#building-the-server)\n    - [启动服务器](#starting-the-server)\n    - [API 端点](#api-endpoints)\n    - [支持的参数](#supported-parameters)\n    - [客户端示例](#client-examples)\n    - [生成 API 规范](#generating-the-api-specification)\n    - [客户端生成](#client-generation)\n    - [API 限制](#api-limitations)\n    - [完全支持的功能](#fully-supported-features)\n- [TTSKit](#ttskit)\n  - [快速示例](#quick-example-1)\n  - [模型选择](#model-selection-1)\n    - [自定义声音](#custom-voices)\n    - [实时流式播放](#real-time-streaming-playback)\n  - [生成选项](#generation-options)\n    - [风格指令（仅限 1.7B 模型）](#style-instructions-17b-only)\n  - [保存音频](#saving-audio)\n  - [进度回调](#progress-callbacks)\n  - [Swift CLI](#swift-cli-1)\n  - [演示应用](#demo-app)\n- [SpeakerKit](#speakerkit)\n  - [快速示例](#quick-example-2)\n  - [分离选项](#diarization-options)\n  - [与转录结合](#combining-with-transcription)\n  - [RTTM 输出](#rttm-output)\n  - [Swift CLI](#swift-cli-2)\n- [贡献与路线图](#contributing--roadmap)\n- [许可证](#license)\n- [引用](#citation)\n\n## 安装\n\n### Swift 包管理器\n\nWhisperKit、TTSKit 和 SpeakerKit 是同一个 Swift 包中的独立库产品。只需添加一次包，即可根据需要选择所需的产品。\n\n### 先决条件\n\n- macOS 14.0 或更高版本。\n- Xcode 16.0 或更高版本。\n\n### Xcode 步骤\n\n1. 在 Xcode 中打开您的 Swift 项目。\n2. 导航至 `File` > `Add Package Dependencies...`。\n3. 输入包仓库 URL：`https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkit`。\n4. 选择版本范围或特定版本。\n5. 当提示选择库产品时，选择 **WhisperKit**、**TTSKit**、**SpeakerKit**，或任意组合。\n\n### Package.swift\n\n如果您将 WhisperKit、TTSKit 或 SpeakerKit 作为 Swift 包的一部分使用，可以在您的 Package.swift 依赖项中按如下方式添加：\n\n```swift\ndependencies: [\n    .package(url: \"https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit.git\", from: \"0.9.0\"),\n],\n```\n\n然后将所需的产品作为目标依赖项添加：\n\n```swift\n.target(\n    name: \"YourApp\",\n    dependencies: [\n        \"WhisperKit\",   \u002F\u002F 语音转文本\n        \"TTSKit\",       \u002F\u002F 文本转语音\n        \"SpeakerKit\",   \u002F\u002F 说话人分离\n    ]\n),\n```\n\n### Homebrew\n\n您可以使用 [Homebrew](https:\u002F\u002Fbrew.sh) 安装 `WhisperKit` 命令行应用程序，方法是运行以下命令：\n\n```bash\nbrew install whisperkit-cli\n```  \n\n## 快速入门\n\n要开始使用 WhisperKit，您需要在项目中对其进行初始化。\n\n### 快速示例\n\n此示例演示如何转录本地音频文件：\n\n```swift\nimport WhisperKit\n\n\u002F\u002F 使用默认设置初始化 WhisperKit\nTask {\n   let pipe = try? await WhisperKit()\n   let transcription = try? await pipe!.transcribe(audioPath: \"path\u002Fto\u002Fyour\u002Faudio.{wav,mp3,m4a,flac}\")?.text\n    print(transcription)\n}\n```\n\n### 模型选择\n\n如果未指定，WhisperKit 会自动下载适用于当前设备的推荐模型。您也可以通过传递模型名称来选择特定模型：\n\n```swift\nlet pipe = try? await WhisperKit(WhisperKitConfig(model: \"large-v3\"))\n```\n\n此方法还支持通配符搜索，因此您可以使用通配符来选择模型：\n\n```swift\nlet pipe = try? await WhisperKit(WhisperKitConfig(model: \"distil*large-v3\"))\n```\n\n请注意，模型搜索必须从源仓库返回单个模型，否则将抛出错误。\n\n有关可用模型的列表，请参阅我们的 [HuggingFace 仓库](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fwhisperkit-coreml)。\n\n### 生成模型\n\nWhisperKit 还附带了一个支持性仓库 [`whisperkittools`](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkittools)，它允许你创建并以 CoreML 格式将自己微调过的 Whisper 版本部署到 HuggingFace。一旦生成，只需将仓库名称更改为用于上传模型的那个即可加载：\n\n```swift\nlet config = WhisperKitConfig(model: \"large-v3\", modelRepo: \"username\u002Fyour-model-repo\")\nlet pipe = try? await WhisperKit(config)\n```\n\n### Swift CLI\n\nSwift CLI 允许在 Xcode 项目之外快速进行测试和调试。要安装它，请运行以下命令：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkit.git\ncd whisperkit\n```\n\n然后设置环境并下载你想要的模型。\n\n```bash\nmake setup\nmake download-model MODEL=large-v3\n```\n\n**注意**：\n\n1. 这只会下载由 `MODEL` 指定的模型（请参阅我们 [HuggingFace 仓库](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fwhisperkit-coreml) 中可用的模型，我们在那里使用前缀 `openai_whisper-{MODEL}`）。\n2. 在运行 `download-model` 之前，请确保已安装 [git-lfs](https:\u002F\u002Fgit-lfs.com)。\n\n如果你想将所有可用模型下载到本地文件夹，可以使用以下命令代替：\n\n```bash\nmake download-models\n```\n\n之后，你可以通过 CLI 运行它们：\n\n```bash\nswift run whisperkit-cli transcribe --model-path \"Models\u002Fwhisperkit-coreml\u002Fopenai_whisper-large-v3\" --audio-path \"path\u002Fto\u002Fyour\u002Faudio.{wav,mp3,m4a,flac}\"\n```\n\n这应该会打印出音频文件的转录内容。如果你希望直接从麦克风流式传输音频，可以使用：\n\n```bash\nswift run whisperkit-cli transcribe --model-path \"Models\u002Fwhisperkit-coreml\u002Fopenai_whisper-large-v3\" --stream\n```\n\n### WhisperKit 本地服务器\n\nWhisperKit 包含一个实现了 OpenAI 音频 API 的本地服务器，允许你使用现有的 OpenAI SDK 客户端或生成新的客户端。该服务器支持转录和翻译，并具备 **输出流式传输** 功能（即在生成转录结果时实时显示）。\n\n> [!NOTE]\n> **对于具有全双工流式传输功能的实时转录服务器**，请查看 [WhisperKit Pro 本地服务器](https:\u002F\u002Fwww.argmaxinc.com\u002Fblog\u002Fargmax-local-server)，它为需要连续音频处理的应用程序提供实时音频流和转录功能。\n\n#### 构建服务器\n\n```bash\n# 带服务器支持的构建\nmake build-local-server\n\n# 或者手动使用构建标志\nBUILD_ALL=1 swift build --product whisperkit-cli\n```\n\n#### 启动服务器\n\n```bash\n# 使用默认设置启动服务器\nBUILD_ALL=1 swift run whisperkit-cli serve\n\n# 自定义主机和端口\nBUILD_ALL=1 swift run whisperkit-cli serve --host 0.0.0.0 --port 8080\n\n# 使用特定模型并启用详细日志记录\nBUILD_ALL=1 swift run whisperkit-cli serve --model tiny --verbose\n\n# 查看所有可配置参数\nBUILD_ALL=1 swift run whisperkit-cli serve --help\n```\n\n#### API 端点\n\n- **POST** `\u002Fv1\u002Faudio\u002Ftranscriptions` - 将音频转录为文本\n- **POST** `\u002Fv1\u002Faudio\u002Ftranslations` - 将音频翻译成英语\n\n#### 支持的参数\n\n| 参数 | 描述 | 默认值 |\n|-----------|-------------|---------|\n| `file` | 音频文件（wav、mp3、m4a、flac） | 必需 |\n| `model` | 模型标识符 | 服务器默认 |\n| `language` | 源语言代码 | 自动检测 |\n| `prompt` | 用于指导转录的文本 | 无 |\n| `response_format` | 输出格式（json、verbose_json） | verbose_json |\n| `temperature` | 采样温度（0.0–1.0） | 0.0 |\n| `timestamp_granularities[]` | 时间精度（词、片段） | 片段 |\n| `stream` | 启用流式传输 | false |\n\n#### 客户端示例\n\n**Python 客户端（OpenAI SDK）**\n```bash\ncd Examples\u002FServeCLIClient\u002FPython\nuv sync\npython whisperkit_client.py transcribe --file audio.wav --language en\npython whisperkit_client.py translate --file audio.wav\n```\n\n快速 Python 示例：\n```python\nfrom openai import OpenAI\nclient = OpenAI(base_url=\"http:\u002F\u002Flocalhost:50060\u002Fv1\")\nresult = client.audio.transcriptions.create(\n    file=open(\"audio.wav\", \"rb\"),\n    model=\"tiny\"  # 模型参数是必需的\n)\nprint(result.text)\n```\n\n**Swift 客户端（根据 OpenAPI 规范生成，详见 ServeCLIClient\u002FSwift\u002FupdateClient.sh）**\n```bash\ncd Examples\u002FServeCLIClient\u002FSwift\nswift run whisperkit-client transcribe audio.wav --language en\nswift run whisperkit-client translate audio.wav\n```\n\n**Curl 客户端（Shell 脚本）**\n```bash\ncd Examples\u002FServeCLIClient\u002FCurl\nchmod +x *.sh\n.\u002Ftranscribe.sh audio.wav --language en\n.\u002Ftranslate.sh audio.wav --language es\n.\u002Ftest.sh  # 运行全面的测试套件\n```\n\n#### 生成 API 规范\n\n服务器的 OpenAPI 规范和代码是从官方 OpenAI API 生成的：\n\n```bash\n# 生成最新的规范和服务器代码\nmake generate-server\n```\n\n#### 客户端生成\n\n你可以使用 OpenAPI 规范为任何语言生成客户端，例如：\n\n```bash\n# 生成 Python 客户端\nswift run swift-openapi-generator generate scripts\u002Fspecs\u002Flocalserver_openapi.yaml \\\n  --output-directory python-client \\\n  --mode client \\\n  --mode types\n\n# 生成 TypeScript 客户端\nnpx @openapitools\u002Fopenapi-generator-cli generate \\\n  -i scripts\u002Fspecs\u002Flocalserver_openapi.yaml \\\n  -g typescript-fetch \\\n  -o typescript-client\n```\n\n#### API 的局限性\n\n与官方 OpenAI API 相比，本地服务器存在以下限制：\n\n- **响应格式**：仅支持 `json` 和 `verbose_json`（不支持纯文本、SRT、VTT 格式）。\n- **模型选择**：客户端必须通过 `--model` 标志指定所需的模型来启动服务器。\n\n#### 完全支持的功能\n\n本地服务器完全支持以下 OpenAI API 功能：\n\n- **包含参数**：`logprobs` 参数可用于获取详细的标记级对数概率。\n- **流式响应**：使用服务器发送事件（SSE）实现实时转录。\n- **时间戳粒度**：支持词级和片段级时间戳。\n- **语言检测**：自动检测语言或手动指定。\n- **温度控制**：用于控制转录随机性的采样温度。\n- **提示文本**：用于指导转录风格和上下文的文本。\n\n## TTSKit\n\nTTSKit 是一个基于 Core ML 的设备端文本到语音框架。它在 Apple 芯片上完全运行 [Qwen3 TTS](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-TTS) 模型，支持实时流式播放，无需服务器。\n\n- macOS 15.0 或更高版本。\n- iOS 18.0 或更高版本。\n\n### 快速示例\n\n此示例演示如何从文本生成语音：\n\n```swift\nimport TTSKit\n\nTask {\n    let tts = try await TTSKit()\n    let result = try await tts.generate(text: \"Hello from TTSKit!\")\n    print(\"生成了 \\(result.audioDuration)s 的音频，采样率为 \\(result.sampleRate)Hz\")\n}\n```\n\n`TTSKit()` 会在首次运行时自动下载默认的 0.6B 模型。分词器和 CoreML 模型会在首次调用 `generate()` 时按需加载。\n\n### 模型选择\n\nTTSKit 提供两种模型尺寸。您可以通过将变体传递给 `TTSKitConfig` 来选择模型：\n\n```swift\n\u002F\u002F 速度快，可在所有平台上运行（约 1 GB 下载）\nlet tts = try await TTSKit(TTSKitConfig(model: .qwen3TTS_0_6b))\n\n\u002F\u002F 质量更高，仅限 macOS 平台（约 2.2 GB 下载，支持风格指令）\nlet tts = try await TTSKit(TTSKitConfig(model: .qwen3TTS_1_7b))\n```\n\n模型托管在 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fttskit-coreml) 上，并在首次下载后缓存到本地。\n\n#### 自定义语音\n\n您可以从 9 种内置语音和 10 种语言中进行选择：\n\n```swift\nlet result = try await tts.generate(\n    text: \"こんにちは世界\",\n    speaker: .onoAnna,\n    language: .japanese\n)\n```\n\n**语音：** `.ryan`、`.aiden`、`.onoAnna`（`\"ono-anna\"`）、`.sohee`、`.eric`、`.dylan`、`.serena`、`.vivian`、`.uncleFu`（`\"uncle-fu\"`）\n\n**语言：** `.english`、`.chinese`、`.japanese`、`.korean`、`.german`、`.french`、`.russian`、`.portuguese`、`.spanish`、`.italian`\n\n#### 实时流式播放\n\n`play` 会在音频生成的同时逐帧流式传输到设备扬声器：\n\n```swift\ntry await tts.play(text: \"这会在生成完成之前就开始播放。\")\n```\n\n您可以控制在开始播放之前缓冲的音频量。默认的 `.auto` 策略会测量第一个生成步骤，并预缓冲刚好足以避免欠载的音频：\n\n```swift\ntry await tts.play(\n    text: \"长段文字...\",\n    playbackStrategy: .auto\n)\n```\n\n其他策略包括 `.stream`（立即播放，无缓冲）、`.buffered(seconds:)`（固定预缓冲）和 `.generateFirst`（先生成所有音频，再播放）。\n\n### 生成选项\n\n您可以通过 `GenerationOptions` 自定义采样、分块和并发设置：\n\n```swift\n\u002F\u002F Qwen 推荐的默认值\nvar options = GenerationOptions()\noptions.temperature = 0.9\noptions.topK = 50\noptions.repetitionPenalty = 1.05\noptions.maxNewTokens = 245\n\n\u002F\u002F 长文本会自动按句号分割\noptions.chunkingStrategy = .sentence\noptions.concurrentWorkerCount = nil  \u002F\u002F nil 表示所有分块同时运行，为当前设备提供一个良好的默认值\n\nlet result = try await tts.generate(text: longArticle, options: options)\n```\n\n#### 风格指令（仅 1.7B 模型）\n\n1.7B 模型接受自然语言风格指令来控制韵律：\n\n```swift\nvar options = GenerationOptions()\noptions.instruction = \"像说书人一样缓慢而温暖地说话。\"\n\nlet result = try await tts.generate(\n    text: \"从前有……\",\n    speaker: .ryan,\n    options: options\n)\n```\n\n### 保存音频\n\n生成的音频可以保存为 WAV 或 M4A 格式：\n\n```swift\nlet result = try await tts.generate(text: \"救救我！\")\nlet outputDir = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]\n\n\u002F\u002F 可保存为 .wav 或 .m4a（AAC）\ntry await AudioOutput.saveAudio(result.audio, toFolder: outputDir, filename: \"output\", format: .m4a)\n```\n\n### 进度回调\n\n您可以在生成过程中接收每一步的音频。如果回调返回 `false`，则会提前取消生成：\n\n```swift\nlet result = try await tts.generate(text: \"你好！\") { progress in\n    print(\"音频片段：\\(progress.audio.count) 个样本\")\n    if let stepTime = progress.stepTime {\n        print(\"第一步耗时 \\(stepTime) 秒\")\n    }\n    return true  \u002F\u002F 返回 false 取消\n}\n```\n\n### Swift 命令行工具\n\nTTS 命令可通过相同的 `whisperkit-cli` 工具使用。您可以生成语音，并可选择实时播放：\n\n```bash\nswift run whisperkit-cli tts --text \"命令行问候语\" --play\nswift run whisperkit-cli tts --text \"保存到文件\" --output-path output.wav\nswift run whisperkit-cli tts --text \"日本語テスト\" --speaker ono-anna --language japanese\nswift run whisperkit-cli tts --text-file article.txt --model 1.7b --instruction \"欢快地朗读\"\nswift run whisperkit-cli tts --help\n```\n\n### 示例应用\n\n[TTSKitExample](Examples\u002FTTS\u002FTTSKitExample\u002F) 示例应用展示了实时流式传输、模型管理、波形可视化以及 macOS 和 iOS 上的生成历史记录。请参阅 [TTSKitExample README](Examples\u002FTTS\u002FTTSKitExample\u002FREADME.md) 以获取构建说明。\n\n## SpeakerKit\n\nSpeakerKit 是一个基于 Core ML 的设备端说话人分离框架。它在 Apple 芯片上运行 [Pyannote v4 (community-1)](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fspeakerkit-coreml) 分割和嵌入模型，用于识别和标记音频中的说话人。有关架构细节和基准测试，请参阅[博客文章](https:\u002F\u002Fwww.argmaxinc.com\u002Fblog\u002Fspeakerkit)。\n\n- macOS 13.0 或更高版本。\n- iOS 16.0 或更高版本。\n\n### 快速示例\n\n此示例演示如何对音频文件进行说话人分离：\n\n```swift\nimport SpeakerKit\n\nTask {\n    let speakerKit = try await SpeakerKit()\n\n    let audioArray = try AudioProcessor.loadAudioAsFloatArray(fromPath: \"audio.wav\")\n    let result = try await speakerKit.diarize(audioArray: audioArray)\n\n    print(\"检测到 \\(result.speakerCount) 位说话人\")\n    for segment in result.segments {\n        print(segment)\n    }\n}\n```\n\n`SpeakerKit()` 使用 `PyannoteConfig()` 的默认设置，在首次运行时会自动从 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fspeakerkit-coreml) 下载模型。分段器和嵌入器的 CoreML 模型会在首次调用 `diarize()` 时延迟加载（除非在配置中设置了 `load`）。\n\n### 说话人分离选项\n\n您可以通过 `PyannoteDiarizationOptions` 控制说话人检测：\n\n```swift\nlet audioArray = try AudioProcessor.loadAudioAsFloatArray(fromPath: \"audio.wav\")\nlet options = PyannoteDiarizationOptions(\n    numberOfSpeakers: 2,               \u002F\u002F nil 表示自动检测\n    clusterDistanceThreshold: 0.6,     \u002F\u002F 聚类阈值\n    useExclusiveReconciliation: false   \u002F\u002F 每帧只分配一位说话人\n)\nlet result = try await speakerKit.diarize(audioArray: audioArray, options: options)\n```\n\n如果使用本地模型，则可跳过下载步骤：\n\n```swift\nlet config = PyannoteConfig(modelFolder: \"\u002Fpath\u002Fto\u002Fmodels\")\nlet speakerKit = try await SpeakerKit(config)\n```\n\n### 与转录结合\n\nSpeakerKit 可以将说话人分离结果与 WhisperKit 转录合并，生成带有说话人信息的片段：\n\n```swift\nimport WhisperKit\nimport SpeakerKit\n\nlet whisperKit = try await WhisperKit()\nlet speakerKit = try await SpeakerKit()\n\nlet audioArray = try AudioProcessor.loadAudioAsFloatArray(fromPath: \"audio.wav\")\nlet transcription = try await whisperKit.transcribe(audioArray: audioArray)\nlet diarization = try await speakerKit.diarize(audioArray: audioArray)\n\nlet speakerSegments = diarization.addSpeakerInfo(to: transcription)\n\nfor group in speakerSegments {\n    for segment in group {\n        print(\"\\(segment.speaker): \\(segment.text)\")\n    }\n}\n```\n\n有两种策略可用于将说话人与转录匹配：\n- `.subsegment`（默认）—— 在词间断处分割片段，然后分配说话人\n- `.segment`—— 将一位说话人分配给整个转录片段\n\n### RTTM 输出\n\n生成 RTTM 输出：\n\n```swift\nlet speakerKit = try await SpeakerKit()\n\nlet audioArray = try AudioProcessor.loadAudioAsFloatArray(fromPath: \"meeting.wav\")\nlet diarization = try await speakerKit.diarize(audioArray: audioArray)\n\nlet rttmLines = SpeakerKit.generateRTTM(from: diarization, fileName: \"meeting\")\nfor line in rttmLines {\n    print(line)\n}\n```\n\n### Swift 命令行工具\n\n说话人分离命令可通过 `whisperkit-cli` 工具使用：\n\n```bash\n# 独立的说话人分离\nswift run whisperkit-cli diarize --audio-path audio.wav --verbose\n\n# 保存 RTTM 输出\nswift run whisperkit-cli diarize --audio-path audio.wav --rttm-path output.rttm\n\n# 指定说话人数\nswift run whisperkit-cli diarize --audio-path audio.wav --num-speakers 3\n\n# 带有说话人分离的转录\nswift run whisperkit-cli transcribe --audio-path audio.wav --diarization\n\n# 查看所有选项\nswift run whisperkit-cli diarize --help\n```\n\n## 贡献与路线图\n\n我们的目标是不断改进 WhisperKit，非常欢迎您的帮助！只需在代码中搜索“TODO”，即可找到许多尚未实现的功能。请参阅我们的[贡献指南](CONTRIBUTING.md)，了解如何提交问题、拉取请求以及编码规范；我们也在其中公开了未来计划实现的功能路线图。\n\n## 许可证\n\nWhisperKit 采用 MIT 许可证发布。更多详情请参阅 [LICENSE](LICENSE) 文件。\n\n## 引用\n\n如果您使用 WhisperKit 完成了很棒的工作，或者只是觉得它很有用，请发送邮件至 [info@argmaxinc.com](mailto:info@argmaxinc.com) 告诉我们！\n\n如果您在学术研究中使用 WhisperKit，可以使用以下 BibTeX 格式：\n\n```bibtex\n@misc{whisperkit-argmax,\n   title = {WhisperKit},\n   author = {Argmax, Inc.},\n   year = {2024},\n   URL = {https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit}\n}\n```","# WhisperKit 快速上手指南\n\nWhisperKit 是由 Argmax 开发的框架，专为在 Apple 设备端（On-Device）部署先进的语音转文本（Speech-to-Text）系统而设计。它支持实时流式转录、单词级时间戳、语音活动检测（VAD）及说话人分离等功能。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**：macOS 14.0 (Sonoma) 或更高版本。\n*   **开发工具**：Xcode 16.0 或更高版本。\n*   **依赖工具**（仅 CLI 模式需要）：\n    *   [Git LFS](https:\u002F\u002Fgit-lfs.com)：用于下载大型模型文件。\n    *   Homebrew（可选）：用于快速安装命令行工具。\n\n## 安装步骤\n\n您可以选择将 WhisperKit 作为 Swift 包集成到项目中，或直接安装命令行工具。\n\n### 方式一：Swift Package Manager (推荐用于 App 开发)\n\n1.  在 Xcode 中打开您的 Swift 项目。\n2.  点击菜单栏 `File` > `Add Package Dependencies...`。\n3.  输入仓库地址：`https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkit`。\n4.  选择版本范围。\n5.  在产品选择界面，勾选您需要的模块：\n    *   `WhisperKit` (语音转文本)\n    *   `TTSKit` (文本转语音)\n    *   `SpeakerKit` (说话人分离)\n\n**或在 `Package.swift` 中手动配置：**\n\n```swift\ndependencies: [\n    .package(url: \"https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit.git\", from: \"0.9.0\"),\n],\n.target(\n    name: \"YourApp\",\n    dependencies: [\n        \"WhisperKit\",   \u002F\u002F 语音转文本\n        \"TTSKit\",       \u002F\u002F 文本转语音\n        \"SpeakerKit\",   \u002F\u002F 说话人分离\n    ]\n),\n```\n\n### 方式二：Homebrew (推荐用于命令行测试)\n\n如果您只想快速体验或进行脚本测试，可以使用 Homebrew 安装 CLI 工具：\n\n```bash\nbrew install whisperkit-cli\n```\n\n### 方式三：源码安装 CLI (用于调试或自定义构建)\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002Fwhisperkit.git\ncd whisperkit\n\n# 初始化环境并下载指定模型 (例如 large-v3)\nmake setup\nmake download-model MODEL=large-v3\n\n# 或者下载所有可用模型\n# make download-models\n```\n\n## 基本使用\n\n### 1. Swift 代码集成 (App 开发)\n\n以下是最简单的转录本地音频文件的示例：\n\n```swift\nimport WhisperKit\n\n\u002F\u002F 初始化 WhisperKit (默认会自动下载适合当前设备的模型)\nTask {\n   let pipe = try? await WhisperKit()\n   \n   \u002F\u002F 转录音频文件 (支持 wav, mp3, m4a, flac)\n   let transcription = try? await pipe!.transcribe(audioPath: \"path\u002Fto\u002Fyour\u002Faudio.wav\")?.text\n   \n   print(transcription)\n}\n```\n\n**指定模型：**\n您也可以手动指定模型名称（支持通配符）：\n\n```swift\nlet pipe = try? await WhisperKit(WhisperKitConfig(model: \"large-v3\"))\n\u002F\u002F 或使用通配符匹配\n\u002F\u002F let pipe = try? await WhisperKit(WhisperKitConfig(model: \"distil*large-v3\"))\n```\n\n### 2. 命令行使用 (CLI)\n\n如果您通过源码安装了 CLI，可以使用以下命令进行转录：\n\n**转录本地文件：**\n```bash\nswift run whisperkit-cli transcribe --model-path \"Models\u002Fwhisperkit-coreml\u002Fopenai_whisper-large-v3\" --audio-path \"path\u002Fto\u002Fyour\u002Faudio.wav\"\n```\n\n**麦克风实时流式转录：**\n```bash\nswift run whisperkit-cli transcribe --model-path \"Models\u002Fwhisperkit-coreml\u002Fopenai_whisper-large-v3\" --stream\n```\n\n### 3. 启动本地服务器 (Local Server)\n\nWhisperKit 包含一个兼容 OpenAI Audio API 的本地服务器，允许您使用现有的 OpenAI SDK 客户端调用本地模型。\n\n**启动服务器：**\n```bash\n# 使用默认设置启动\nBUILD_ALL=1 swift run whisperkit-cli serve\n\n# 自定义端口和模型\nBUILD_ALL=1 swift run whisperkit-cli serve --host 0.0.0.0 --port 8080 --model tiny\n```\n\n**Python 客户端调用示例：**\n```python\nfrom openai import OpenAI\n\n# 指向本地服务器\nclient = OpenAI(base_url=\"http:\u002F\u002Flocalhost:50060\u002Fv1\")\n\nresult = client.audio.transcriptions.create(\n    file=open(\"audio.wav\", \"rb\"),\n    model=\"tiny\"  # 必须指定模型参数\n)\nprint(result.text)\n```\n\n> **提示**：对于生产环境或需要更高性能（如 Nvidia Parakeet 模型）的场景，可以参考 Argmax Pro SDK 获取更高级的本地服务器解决方案。","一位独立开发者正在为视障用户打造一款运行在 iPhone 上的实时会议记录助手，需要在无网络环境下将语音瞬间转化为带时间戳的文字。\n\n### 没有 WhisperKit 时\n- **依赖云端服务**：必须将录音上传至服务器处理，不仅产生高昂的 API 调用费用，且在地铁或地下室等弱网环境下完全无法使用。\n- **隐私泄露风险**：敏感的会议对话内容需传输至第三方服务器，难以向注重隐私的用户承诺数据本地化存储。\n- **延迟体验糟糕**：受限于网络往返时间和服务器排队，用户说完话后往往需要等待数秒才能看到文字，打断沟通流畅度。\n- **功能集成困难**：若要实现说话人区分（Diarization）和单词级时间戳，需自行拼接多个开源模型，导致 App 体积臃肿且维护成本极高。\n\n### 使用 WhisperKit 后\n- **纯本地离线运行**：利用 Apple Silicon 芯片算力，直接在 iPhone 端完成高精度语音识别，彻底摆脱对网络连接的依赖。\n- **数据绝对安全**：所有音频数据仅在设备内存中处理，从不离开用户手机，完美满足医疗、法律等敏感场景的合规要求。\n- **毫秒级实时响应**：支持流式转录，话音未落文字即现，配合语音活动检测（VAD）自动启停，提供如字幕般丝滑的体验。\n- **开箱即用的高级特性**：通过简单的 Swift 包集成，即可原生获得说话人区分和精确到词的时间戳功能，大幅缩短开发周期。\n\nWhisperKit 让开发者能够以极低的成本，在苹果设备上构建出既保护隐私又具备专业级实时性能的语音交互应用。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fargmaxinc_WhisperKit_64a53f29.png","argmaxinc","Argmax","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fargmaxinc_6a0a939c.png","",null,"info@argmaxinc.com","argmaxinc.com","https:\u002F\u002Fgithub.com\u002Fargmaxinc",[81,85,89,93],{"name":82,"color":83,"percentage":84},"Swift","#F05138",97.6,{"name":86,"color":87,"percentage":88},"Ruby","#701516",1,{"name":90,"color":91,"percentage":92},"Python","#3572A5",0.9,{"name":94,"color":95,"percentage":96},"Makefile","#427819",0.5,5953,541,"2026-04-08T10:47:28","MIT",4,"macOS","未说明 (基于 Apple Silicon Neural Engine 或 GPU 加速，无需 NVIDIA CUDA)","未说明",{"notes":106,"python":107,"dependencies":108},"该工具主要面向 Apple 生态系统，核心运行环境为 macOS 14.0+ 和 Xcode 16.0+。它利用 CoreML 格式在设备端（On-device）运行，依赖 Apple Silicon 的神经网络引擎进行加速，而非传统的 NVIDIA GPU。虽然提供了 Python 辅助工具库 (whisperkittools) 用于模型转换，但核心推理引擎是 Swift 编写的。命令行工具可通过 Homebrew 安装或通过源码编译运行。","未说明 (主要基于 Swift，Python 仅用于辅助工具 whisperkittools)",[109,110,111,112],"Xcode 16.0+","Swift 5.9+","git-lfs","Homebrew (可选，用于安装 CLI)",[35,14,114],"音频",[116,117,118,119,120,121,122,123,124],"inference","ios","speech-recognition","swift","whisper","transformers","macos","visionos","watchos","2026-03-27T02:49:30.150509","2026-04-08T23:34:31.137741",[128,133,138,143,148,153],{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},25230,"如何处理非英语语言（如西班牙语、葡萄牙语）的自动翻译问题？","默认情况下，某些模型可能会自动将非英语音频翻译为英语。若要强制检测语言而不进行翻译，可以使用新增的 `detectLanguage` 解码选项。该选项会忽略 `usePrefillPrompt` 设置并强制执行语言检测。代码示例如下：\n\n```swift\nlet whisperKit = try await WhisperKit(\n    modelFolder: tinyModelPath(),\n    verbose: true,\n    logLevel: .debug\n)\n\n\u002F\u002F 仅检测语言时，将 sampleLength 设为 1 且不使用预填充提示\nlet optionsDetectOnly = DecodingOptions(task: .transcribe, temperatureFallbackCount: 0, sampleLength: 1, detectLanguage: true)\n\nlet result = try await whisperKit.transcribe(audioPath: audioFilePath, decodeOptions: optionsDetectOnly)\n```\n此功能在 `whisper-large-v3-turbo` 模型上表现不同，其他 v3 large 模型可能需要显式启用此选项。","https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fissues\u002F98",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},25231,"在 M1 设备上使用 Turbo 模型和 `.cpuAndGPU` 选项时出现内存泄漏怎么办？","在基础 M1 设备（如 Macmini9）上运行 Turbo 模型时，若强制使用 `.cpuAndGPU` 配置，多次销毁和重新实例化 WhisperKit 可能会导致内存泄漏。虽然这是一个已知限制（Turbo 模型在旧设备上不受官方支持），但可以通过以下方式缓解：\n1. 尽量避免频繁地销毁和重建实例，尝试复用现有的 WhisperKit 实例。\n2. 如果必须重新加载，请监控内存使用情况并在必要时重启应用。\n3. 社区建议记录具体的失败模式并提供复现步骤给维护者，以便未来优化。目前尚无完美的代码级修复方案，主要受限于硬件对特定模型架构的支持程度。","https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fissues\u002F265",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},25232,"为什么模型无法加载或卡在“specialization”阶段？","在 M1 Macbook Pro 等设备上，模型加载失败或卡在进度条（通常是 specialization 阶段）可能由以下原因引起：\n1. **模型文件损坏或不完整**：错误信息 `A valid manifest does not exist` 表明模型目录下的 `Manifest.json` 缺失或损坏。请尝试删除本地模型缓存并重新下载。\n2. **计算单元配置问题**：尝试调整 `ModelComputeOptions`。例如，将编码器和解码器的计算单元设置为 `.cpuAndNeuralEngine` 或仅 `.cpu` 进行测试。\n3. **等待时间不足**：首次加载大型模型（尤其是进行专用化编译时）可能需要较长时间（有时超过 30 分钟），请耐心等待。\n\n配置示例：\n```swift\nprivate var encoderComputeUnits: MLComputeUnits = .cpuAndNeuralEngine\nprivate var decoderComputeUnits: MLComputeUnits = .cpuAndNeuralEngine\nlet config = WhisperKitConfig(\n  model: transcriptionModel,\n  computeOptions: getComputeOptions(),\n  verbose: true,\n  logLevel: .info,\n  prewarm: true,\n  load: true\n)\nwhisperKit = try await WhisperKit(config)\n```","https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fissues\u002F7",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},25233,"使用 VAD（语音活动检测）分块策略时，单词时间戳不正确或出现偏移怎么办？","在使用 `chunkingStrategy: .vad` 和 `wordTimestamps: true` 时，曾出现过单词时间戳偏移（如出现错误的 `[ Silence ]` 标记或时间戳跳跃）的问题。该问题已在 **v0.8.0** 版本中修复。\n如果您遇到此类问题，请执行以下操作：\n1. 确保将 WhisperKit 升级到最新版本（v0.8.0 或更高）。\n2. 如果升级后问题仍然存在，请检查是否使用了特定的音频文件触发边缘情况，并向项目方提供复现样本。\n\n修复前的错误输出示例：\n```\nscoop, 39.96, 40.36\n[ Silence ], 70.36, 70.36  \u003C-- 错误的时间戳\nNow,, 46.44, 46.5         \u003C-- 时间倒流\n```","https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fissues\u002F160",{"id":149,"question_zh":150,"answer_zh":151,"source_url":152},25234,"如何在 iPhone 11 Pro 等较旧的 iOS 设备上运行示例应用？","在 iPhone 11 Pro 等较旧设备上运行示例应用失败，通常是因为模型过大导致内存不足或计算超时。虽然同样的模型在 M1 Max Mac 上运行良好，但移动设备资源受限。\n建议解决方案：\n1. **使用更小的模型**：避免在旧设备上使用 `large` 或 `turbo` 模型，改用 `tiny`、`base` 或 `small` 模型。\n2. **检查调试日志**：观察日志中的 `DECODER INPUTS DEBUG` 部分，如果看到数值异常或进程直接终止，通常是 OOM（内存溢出）导致的崩溃。\n3. **优化配置**：确保在初始化时正确设置了针对移动设备的计算选项，尽量利用 Neural Engine (ANE) 来分担 GPU 压力，但在内存极度紧张时，可能需要回退到纯 CPU 模式以防崩溃。","https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fissues\u002F10",{"id":154,"question_zh":155,"answer_zh":156,"source_url":137},25235,"如何平衡 GPU 和 ANE（神经引擎）的使用以获得最佳性能？","选择 GPU 还是 ANE 取决于您的具体应用场景和设备类型：\n- **追求极致速度（延迟最低）**：在插电且散热良好的设备（如配备风扇的 M 系列 Mac）上，最大化使用 **GPU** 通常能提供更高的推理速度，但功耗和发热量较大。\n- **追求能效比（省电\u002F低温）**：对于移动设备（iPhone\u002FiPad）或需要后台运行的场景，使用 **ANE** 是更好的选择，它在功耗和发热控制上表现优异。\n- **混合负载场景**：如果您的应用同时运行视频处理等其他模型，建议将 WhisperKit 调度到 **ANE**，从而将宝贵的 **GPU** 资源留给视频编码\u002F解码或其他图形任务，实现并发最大化。\n\n您可以通过自定义 `ModelComputeOptions` 来指定编码器（encoder）和解码器（decoder）分别使用的计算单元（`.cpuOnly`, `.gpuOnly`, `.aneOnly`, 或组合模式）。",[158,163,168,173,178,183,188,193,198,203,208,213,218,223,228,233,238,243,248,253],{"id":159,"version":160,"summary_zh":161,"released_at":162},154627,"v0.14.1","此补丁版本是升级到 swift-transformers >1.0.0 和 Swift 6 并发机制的初步尝试。它还包含了一些针对测试的体验优化修复，并为 `TranscribeTask` 增加了可继承性，使其更具灵活性。\n\n## 变更内容\n* 在 CI 脚本中为单元测试使用默认检出设置，由 @naykutguven 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F363 中实现。\n* 为公共结构体添加 `Sendable` 协议一致性，由 @naykutguven 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F362 中实现。\n* 将 watchOS 和 visionOS 的最低部署目标移至 Package 清单文件中进行配置，由 @naykutguven 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F360 中完成。\n* 为新的 swift-transformers 做准备，并添加 `TranscribeTask` 钩子方法，由 @ZachNagengast 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F367 中实现。\n\n## 新贡献者\n* @naykutguven 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F363 中完成了首次贡献。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.14.0...v0.14.1","2025-10-17T00:54:04",{"id":164,"version":165,"summary_zh":166,"released_at":167},154633,"v0.10.2","Small patch to support Xcode 15 (#292) and minor improvements to the regression test pipeline.\r\n\r\n## What's Changed\r\n* Use canImport for MLTensor checks by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F288\r\n* Add repo option to regression test matrix by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F293\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.10.1...v0.10.2","2025-01-16T06:28:30",{"id":169,"version":170,"summary_zh":171,"released_at":172},154634,"v0.10.1","Small patch for building on older macOS versions. Also includes a fix for early stopping callback logic that had regressed from 0.9.4.\r\n\r\n## What's Changed\r\n* Patch for \u003CmacOS 15 build systems by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F283\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.10.0...v0.10.1","2024-12-21T05:48:53",{"id":174,"version":175,"summary_zh":176,"released_at":177},154635,"v0.10.0","## Highlights\r\n\r\nThis release provides support for protocol-defined model inputs and output types, supporting full MLX or MLTensor pipelines without the need to convert to MLMultiArrays between encoder\u002Fdecoder stages. For example, instead of\r\n\r\n```swift\r\nfunc encodeFeatures(_ features: MLMultiArray) async throws -> MLMultiArray?\r\n```\r\nyou can now define the types by protocol:\r\n\r\n```swift\r\nfunc encodeFeatures(_ features: any FeatureExtractorOutputType) async throws -> (any AudioEncoderOutputType)?\r\n```\r\n\r\nwhere the types are defined as so:\r\n```swift\r\npublic protocol FeatureExtractorOutputType {}\r\nextension MLMultiArray: FeatureExtractorOutputType {}\r\npublic protocol AudioEncoderOutputType {}\r\nextension MLMultiArray: AudioEncoderOutputType {}\r\n```\r\nor for a type that is a struct:\r\n```swift\r\npublic struct TextDecoderMLMultiArrayOutputType: TextDecoderOutputType {\r\n    public var logits: MLMultiArray?\r\n    public var cache: DecodingCache?\r\n}\r\n```\r\nso the entire structure can be handled by any model that conforms to the protocol, adding more flexibility for passing different data types between models, and thus reducing the amount of conversion steps vs. previous where it was assumed to be all MLMultiArrays.\r\n\r\nWe've made a start in using different inference types by using the new [`MLTensor`](https:\u002F\u002Fdeveloper.apple.com\u002Fdocumentation\u002Fcoreml\u002Fmltensor) for token sampling on devices that have the latest OS support, which resulted in a 2x speedup for that operation. Future work will shift the entire pipeline to using these.\r\n\r\nThere are also some important fixes included:\r\n- Timestamp rules are now enabled when the `withoutTimestamps` decoding option is set to false, increasing parity with OpenAI's python implementation. This will significantly increase the amount of timestamps returned during decoding and shorten the average length of individual segments overall. \r\n\t- Previous: `\u003C|0.00|> So in college, I was a government major,\u003C|4.92|>\u003C|4.94|> which means I had to write a lot of papers.\u003C|7.38|>`\r\n\t- Now: `\u003C|0.00|> So in college,\u003C|2.00|>\u003C|3.36|> I was a government major,\u003C|4.88|>\u003C|4.90|> which means I had to write a lot of papers.\u003C|7.36|>`\r\n- Early stopping via callback (a way to stop the decoding loop early if repetition is detected) has been converted to use an actor to fix some concurrency issues noted by the community.\r\n- CI script now uploads failure results to github for better visibility.\r\n\r\n### ⚠️ Breaking changes\r\n- Changing the protocol may result in some unexpected behavior if you are using a custom implementation, please raise an [issue](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fissues\u002Fnew) if you notice anything.\r\n- `WhisperKit.sampleRate` has been moved to `Constants.defaultWindowSamples`\r\n\r\nFinally, there were some great open-source contributions listed below, with a broad range of improvements to the library. Huge thanks to all the contributors 🙏 \r\n\r\n## What's Changed\r\n* Fix audio processing edge case by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F237\r\n* Add public callbacks to help expose internal state a little more by @iandundas in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F240\r\n* Freeze loglevel enum by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F255\r\n* Update WhisperAX app icon for macOS to align with Apple HIG standards by @Stv-X in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F257\r\n* Add ability to prevent config.json being written to `~\u002FDocuments\u002Fhuggingface\u002F...` by @iandundas in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F262\r\n* Typo in Model Descriptions by @rk-helper in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F269\r\n* Audio: Fix taking a suffix of negative length from a collection by @mattisssa in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F278\r\n\r\n## New Contributors\r\n* @Stv-X made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F257\r\n* @rk-helper made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F269\r\n* @mattisssa made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F278\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.9.4...v0.10.0","2024-12-20T00:17:17",{"id":179,"version":180,"summary_zh":181,"released_at":182},154636,"v0.9.4","Minor patch to open up access to the logging callback and freeze the enum for `LogLevel`\r\n\r\n### Usage:\r\n```swift\r\nLogging.shared.loggingCallback = { message in\r\n    print(\"WhisperKit logs: \", message)\r\n}\r\n```\r\n\r\n## What's Changed\r\n* Freeze loglevel enum by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F255\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.9.3...v0.9.4","2024-11-07T01:51:16",{"id":184,"version":185,"summary_zh":186,"released_at":187},154637,"v0.9.3","This release adds a number of useful callbacks that you can receive updates from while the transcription is processing:\r\n\r\n```swift\r\n\u002F\u002F\u002F A callback that provides transcription segments as they are discovered.\r\n\u002F\u002F\u002F - Parameters:\r\n\u002F\u002F\u002F   - segments: An array of `TranscriptionSegment` objects representing the transcribed segments\r\npublic typealias SegmentDiscoveryCallback = (_ segments: [TranscriptionSegment]) -> Void\r\n\r\n\u002F\u002F\u002F A callback that reports changes in the model's state.\r\n\u002F\u002F\u002F - Parameters:\r\n\u002F\u002F\u002F   - oldState: The previous state of the model, if any\r\n\u002F\u002F\u002F   - newState: The current state of the model\r\npublic typealias ModelStateCallback = (_ oldState: ModelState?, _ newState: ModelState) -> Void\r\n\r\n\u002F\u002F\u002F A callback that reports changes in the transcription process.\r\n\u002F\u002F\u002F - Parameter state: The current `TranscriptionState` of the transcription process\r\npublic typealias TranscriptionStateCallback = (_ state: TranscriptionState) -> Void\r\n```\r\n\r\nThanks you @iandundas for the excellent contribution! ✨\r\n\r\n## What's Changed\r\n* Add public callbacks to help expose internal state a little more by @iandundas in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F240\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.9.2...v0.9.3","2024-11-05T18:01:25",{"id":189,"version":190,"summary_zh":191,"released_at":192},154638,"v0.9.2","## Highlights\r\n\r\nWith this release we are launching a comprehensive suite of benchmarks that you can run yourself on your own devices - or view the results that we've run on a wide variety of devices via our [WhisperKit Benchmarks](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fargmaxinc\u002Fwhisperkit-benchmarks) HuggingFace space! This was a huge effort kicked off by @Abhinay1997 so we're very excited to bring it to main. Read more in the [discussion here](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fdiscussions\u002F243) and let us know what you think!\r\n\r\nAlong with this, there are also several bug fixes and improvements included in this release based on recent reported issues, see below for the relevant PRs.\r\n\r\n## What's Changed\r\n* Fix expo release script by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F220\r\n* Fix progress for vad by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F223\r\n* Regression Test Pipeline by @Abhinay1997 in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F120\r\n* Update xcconfig tracking and provisioning by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F234\r\n* Fix audio processing edge case by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F237\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.9.0...v0.9.2","2024-11-02T20:20:56",{"id":194,"version":195,"summary_zh":196,"released_at":197},154639,"v0.9.0","\r\n\r\n## Highlights\r\n\r\n### Package Updates\r\n\r\nWith https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F216 the default for checking whether a model is supported on the device uses the [model repo](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fwhisperkit-coreml) [config.json](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fwhisperkit-coreml\u002Fblob\u002Fmain\u002Fconfig.json) as a source of truth. The need for this came about with the release of the new [large-v3 turbo model](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-large-v3-turbo), which is listed in the model repo as [openai_whisper-large-v3-v20240930](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fwhisperkit-coreml\u002Ftree\u002Fmain\u002Fopenai_whisper-large-v3-v20240930), which was recommended for devices that would crash if attempting to load. This situation can now be mitigated by updating this config.json without the need for a new release and can be called directly with the new static method `recommendedRemoteModels`:\r\n\r\n```swift\r\n    let recommendedModel =  await WhisperKit.recommendedRemoteModels().default\r\n\tlet pipe  = WhisperKit(model: recommendModel)\r\n```\r\n\r\nThe existing interface for `WhisperKit.recommendedModels()` remains the same, but now returns a `ModelSupport` object with a list of supported models for the current device.\r\n\r\n```swift\r\npublic struct ModelSupport: Codable, Equatable {\r\n    public let `default`: String\r\n    public let supported: [String]\r\n    public var disabled: [String] = []\r\n}\r\n```\r\n\r\nAlso, in an ongoing effort to improve modularity, extensibility, and code structure, there is a new way to initialize WhisperKit: using the new `WhisperKitConfig` class. The parameters are exactly the same and the previous init method is still in place, but this can assist in defining WhisperKit settings and protocol objects ahead of time and initialize WhisperKit more cleanly:\r\n\r\nPrevious:\r\n```swift\r\nlet pipe = try? await WhisperKit(model: \"your-custom-model\", modelRepo: \"username\u002Fyour-model-repo\")\r\n```\r\n\r\nNew:\r\n```swift\r\nlet config = WhisperKitConfig(model: \"your-custom-model\", modelRepo: \"username\u002Fyour-model-repo\") \u002F\u002F Initialize config\r\nconfig.model = \"your-custom-model\" \u002F\u002F Alternatively set parameters directly\r\nlet pipe = try? await WhisperKit(config) \u002F\u002F Pass into WhisperKit initializer\r\n```\r\n\r\n\r\n### WhisperAX example app and CLI\r\n\r\nThanks to some memory and audio processing optimizations in #195, #216, and #217, (shout out to @keleftheriou for finding a big improvement there) we've updated the example implementations to use VAD by default with a `concurrentWorkerCount` of 4. This will significantly improve default inference speed on long files for devices that support async prediction, as well as real time streaming for devices\u002Fmodel combinations that are greater than 1 real-time factor.\r\n\r\n### ⚠️ Deprecations and changed interfaces\r\n\r\n- The extension on `Process.processor` is now `ProcessInfo.processor` and includes a new property `ProcessInfo.hwModel` which will return a similar string as `uname(&utsname)` for non-macs.\r\n- `public func modelSupport(for deviceName: String) -> (default: String, disabled: [String])` is now a disfavored overload in preference of `public func modelSupport(for deviceName: String, from config: ModelSupportConfig? = nil) -> ModelSupport`\r\n\r\n\r\n\r\n\r\n## What's Changed\r\n* Make additional initializers, functions, members public for extensibility by @bpkeene in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F192\r\n* Fix start time logic for file loading  by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F195\r\n* Change `static var` stored properties to `static let` by @fumoboy007 in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F190\r\n* Add VoiceActivityDetector base class by @a2they in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F199\r\n* Set default concurrentWorkerCount  by @atiorh in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F205\r\n* Improving modularity and code structure by @a2they in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F212\r\n* Add model support config fetching from model repo by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F216\r\n* Example app VAD default + memory reduction by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F217\r\n\r\n## New Contributors\r\n* @bpkeene made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F192\r\n* @fumoboy007 made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F190\r\n* @a2they made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F199\r\n* @atiorh made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F205\r\n* @1amageek made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F216\r\n* @keleftheriou made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F217\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.8.0...v0.9.0","2024-10-09T02:10:17",{"id":199,"version":200,"summary_zh":201,"released_at":202},154640,"v0.8.0","With this release, we had a huge focus on reliability in terms of memory usage (especially for large files), common crashes, and various correctness errors that the community has reported in [issues.](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fissues) \r\n\r\n## Highlights\r\n\r\n- **Memory-efficient Handling of Large Files:** WhisperKit is much more memory-efficient for large files with some improvements to #158 by @finnvoor. This change speeds up the audio resampling significantly and removes a few other unnecessary data copies. It also fixes a buffer misalignment issue that caused #183 . For more aggressive memory savings, the default audio file chunking size can be configured through [maxReadFrameSize](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fblob\u002Fv0.8.0\u002FSources\u002FWhisperKit\u002FCore\u002FAudioProcessor.swift#L275). Here is the memory chart for a ~200 MB compressed audio file from #174, showing up to **3x faster resampling** with **50% less memory**. Note that WhisperKit requires uncompressed Float values for the MLModel input, so the compressed file becomes roughly ~1 GB minimum after read and resample to 16khz 1 channel.\r\n\r\n| Before | After |\r\n|--------|--------|\r\n| ![before](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fad74c37a-0599-4fb3-a24a-47b395dcff36) | ![after](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F44822337-992c-4c89-b8ff-6280e23a17b0) | \r\n  \r\n- **Progress Bar:** @finnvoor also contributed a fix to the progress when in VAD chunking mode. WhisperAX now shows an indicator while the file is being resampled and the overall progress of the decoding. Note that this is not an exactly linear progress bar because it is based on how many windows have completed decoding, so it will speed up toward the end of the process as more windows complete.\r\n![v0 8 0_progress](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F59141e4f-6fa2-4ff7-82d4-b7f7dcb7f34c)\r\n\r\n- **Various other improvements:** We also did a pass on our current issues and resolved many of them, if you have one pending please test out this version to verify they are fixed. Thanks again to everyone that contributes to these issues, it helps immensely to make WhisperKit better for everyone 🚀.\r\n\r\n## What's Changed\r\n* Remove purported OGG support from CLI by @iandundas in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F153\r\n* Resample audio files in 10mb chunks by @finnvoor in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F158\r\n* feat: add version output by @chenrui333 in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F148\r\n* Fix TEST_HOST name mismatch by @CongLeSolutionX in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F177\r\n* feat: copy text with eager decoding, add keyboard shortcut by @iGerman00 in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F178\r\n* Fix progress when using VAD chunking by @finnvoor in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F179\r\n* Fix indeterminate tests by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F180\r\n* Fix resampling large files by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F183\r\n\r\n## New Contributors\r\n* @iandundas made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F153\r\n* @chenrui333 made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F148\r\n* @CongLeSolutionX made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F177\r\n* @iGerman00 made their first contribution in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F178\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.7.2...v0.8.0","2024-07-12T18:50:16",{"id":204,"version":205,"summary_zh":206,"released_at":207},154641,"v0.7.2","Early stopping now keeps track of the chunked window internally when running async transcription via the VAD chunking method. This will give further control for stopping specific windows based on your custom criteria in the `TranscriptionCallback`.\r\n\r\n## What's Changed\r\n* Fix early stopping for VAD by @ZachNagengast in https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F155\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.7.1...v0.7.2","2024-05-30T13:11:16",{"id":209,"version":210,"summary_zh":211,"released_at":212},154642,"v0.7.1","Hotifx for `shouldEarlyStop` logic\r\n\r\n## What's Changed\r\n- Ensures early stopping flag on TextDecoder is always reset at the beginning of a new loop\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.7.0...v0.7.1","2024-05-25T22:55:04",{"id":214,"version":215,"summary_zh":216,"released_at":217},154623,"v0.18.0","## 亮点\n\n本次发布围绕一个新的共享 `ModelManager` 基类重构了 SpeakerKit，统一了各 Kit 的下载 → 加载 → 卸载 生命周期，并简化了 SpeakerKit 的公共 API。\n\n现在，顶级入口点只需：\n\n```swift\nlet speakerKit = try await SpeakerKit()\n```\n\n在默认情况下无需配置对象。\n\n## 架构变更\n\n### `ModelManager`（新增，`ArgmaxCore`）\n\n一个可重用的基类，用于管理完整的模型生命周期，所有 Kit 现在都可以继承它。它通过内部的 `LoadModelsCoordinator` 处理状态转换、错误恢复以及并发加载的合并——多个并发调用 `ensureModelsLoaded()` 会合并为单个正在进行的任务，而不是相互竞争。\n\n后端特定的 I\u002FO 操作被委托给一个新的 `ModelLoader` 协议：\n\n```swift\npublic protocol ModelLoader: AnyObject, Sendable {\n    var modelFolder: String? { get }\n    func resolveModels(downloader: ModelDownloader, progressCallback: ((Progress) -> Void)?) async throws -> String\n    func load(from modelPath: String, prewarm: Bool) async throws\n    func unload() async\n}\n```\n\n### `SpeakerKitDiarizer`（取代 `SpeakerKitModelManager`）\n\n`SpeakerKitModelManager` 已被 `SpeakerKitDiarizer` 取代，后者继承自 `ModelManager` 并遵循新的 `Diarizer` 协议。可通过静态工厂方法创建：\n\n```swift\nlet diarizer = SpeakerKitDiarizer.pyannote(config: config)\n```\n\n`SpeakerKitModelManager` 仍然作为一个已弃用的类型别名指向 `SpeakerKitDiarizer`，因此现有代码在编译时会显示警告，而不会直接报错。\n\n### `Diarizer` 协议（新增）\n\n一个用于接入说话人分离后端的简洁协议：\n\n```swift\npublic protocol Diarizer: Sendable {\n    var modelState: ModelState { get }\n    func downloadModels() async throws\n    func loadModels() async throws\n    func unloadModels() async\n    func diarize(audioArray: [Float], options: (any DiarizationOptions)?, progressCallback: ...) async throws -> DiarizationResult\n}\n```\n\n### `ModelDownloadConfig`（新增，`ArgmaxCore`）\n\n所有下载参数（端点、仓库、令牌、修订版本、后台会话标志）现在都封装在一个 `ModelDownloadConfig` 结构体中，而不是单独传递给 `ModelDownloader`。现有的便捷初始化方法仍保持不变。\n\n### `SpeakerKitConfig`（新基类）\n\n`PyannoteConfig` 现在继承自 `SpeakerKitConfig`，后者增加了一个 `load: Bool` 标志。当其值为 `false`（默认值）时，模型会在首次调用 `diarize()` 时延迟加载。若设置为 `true`，则会在 `SpeakerKit` 初始化时立即加载模型。\n\n## API 变更\n\n### `SpeakerKit` 初始化简化\n\n```swift\n\u002F\u002F 之前\nlet config = PyannoteConfig()\nlet speakerKit = try await SpeakerKit(config)\n\n\u002F\u002F 之后\nlet speakerKit = try await SpeakerKit()\n```\n\n本地模型路径现在使用 `String` 而不是 `URL`：\n\n```swift\n\u002F\u002F 之前\nlet config = PyannoteConfig(modelFolder: URL(filePath: \"\u002Fpath\u002Fto\u002Fmodels\"))\n","2026-04-01T03:25:18",{"id":219,"version":220,"summary_zh":221,"released_at":222},154624,"v0.17.0","## 亮点\n\n我们很高兴将我们的商用级说话人日志框架 [SpeakerKit](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F440) 开源！\n\n随着 [NVIDIA Sortformer](https:\u002F\u002Fwww.argmaxinc.com\u002Fblog\u002Fargmax-sdk-2) 现在为 Argmax Pro SDK 中的实时说话人日志提供支持，我们也将 [Pyannote 4 (community-1)](https:\u002F\u002Fhuggingface.co\u002Fargmaxinc\u002Fspeakerkit-coreml) 的实现开源。Pyannote 因其在解决“谁在何时说话”问题上的出色表现而广为人知，并在 AMI、DIHARD 和 VoxConverse 等数据集上取得了优异的成绩。有关架构细节和基准测试，请参阅 [博客文章](https:\u002F\u002Fwww.argmaxinc.com\u002Fblog\u002Fspeakerkit)。\n\n## 快速入门\n\n只需几行代码即可下载、加载、进行说话人日志分析，并生成 RTTM（富转录时间标记）输出：\n```swift\nimport SpeakerKit\n\nlet speakerKit = try await SpeakerKit()\nlet result = try await speakerKit.diarize(audioPath: \"audio.wav\")\nlet rttm = speakerKit.generateRTTM(result: result)\n```\n\n## 核心功能\n\n- 端到端的 Pyannote 风格说话人日志处理流程\n- 自动估计说话人数量或手动设置\n- 可将说话人信息添加到 WhisperKit 输出中的实用工具\n- 标准 RTTM 导出\n\n请浏览新的 [SpeakerKit README 部分](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit#speakerkit)，获取 API 文档、配置详情以及优化建议。\n\n## 命令行界面\n\n`whisperkit-cli` 现在包含一个专门的 `diarize` 子命令：\n```bash\nswift run -c release whisperkit-cli diarize --audio-path audio.wav --rttm-path output.rttm\n```\n\n通过 Homebrew 安装：\n```bash\nbrew install whisperkit-cli\nwhisperkit-cli diarize --audio-path audio.wav --rttm-path output.rttm\n```\n\n您还可以使用新的 `--diarization` 标志同时运行转录和说话人日志分析：\n```bash\nwhisperkit-cli transcribe --audio-path audio.wav --diarization\n```\n\n示例输出：\n```\n---- 说话人日志结果 ----\nSPEAKER audio 1 0.220 7.360 什么是 RLHF？即基于人类反馈的强化学习。那道菜中究竟是什么神奇的配料，让它变得如此美味呢？ \u003CNA> A \u003CNA> \u003CNA>\nSPEAKER audio 1 7.610 14.850 - 所以我们在大量文本数据上训练这些模型。在这个过程中，它们会学习到关于其中内容底层表示的一些知识。 \u003CNA> B \u003CNA> \u003CNA>\n```\n\n此外，还提供了用于调整说话人数目、模型变体、聚类算法等的附加标志。\n\n## WhisperAX 示例更新\n\n[WhisperAX 示例应用](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Ftree\u002Fmain\u002FExamples\u002FWhisperAX)现已更新，支持 SpeakerKit。它现在包含说话人日志开关、灵活的管道选择器以及用于浏览已标注片段的“说话人”选项卡。\n\n## 变更内容\n* 添加了由 @a2they 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F440 中实现的、支持 Pyannote 说话人日志的 SpeakerKit\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.16.0...v0.17.0","2026-03-13T15:38:51",{"id":224,"version":225,"summary_zh":226,"released_at":227},154625,"v0.16.0","## 亮点\n\n本次发布引入了 [TTSKit](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F425)——一个全新的可选库，利用最新的 Core ML 功能（如 `MLState` 和 `MLTensor`），结合 Apple Neural Engine 实现最优推理，从而在设备端提供高质量的文本转语音功能。\n\n在首次发布中，我们推出了 [Qwen3-TTS CustomVoice](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-TTS-12Hz-0.6B-CustomVoice) 模型中的 [0.6b](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-TTS-12Hz-0.6B-CustomVoice) 版本，以及支持指令控制的 [1.7b](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-TTS-12Hz-1.7B-CustomVoice) 版本。未来版本还将带来更多功能（包括语音克隆）。\n\n只需三行代码即可完成下载、加载、生成并播放音频流：\n\n```swift\nimport TTSKit\n\nlet ttsKit = try await TTSKit()\ntry await ttsKit.play(text: \"Hello from TTSKit!\")\n```\n\n### 核心特性\n- 实时自适应流式传输  \n  - 在生成音频的同时即开始播放，实现从输入文本到首个音频缓冲区输出的最短延迟。  \n  - `.auto` 模式会根据设备的推理速度自动调整，确保流畅稳定的播放体验。\n- 内置9种预设音色\n- 支持10种语言\n- 仅限1.7B模型支持风格指令\n- 针对长文本输入的自动分块处理\n- 支持以 wav\u002Fm4a 格式导出音频文件，并可选择添加元数据。\n- 模块化协议驱动架构（包含6个可替换的 Core ML 组件），便于定制和未来模型的接入。\n\n完整 API 文档、模型选择及高级用法，请参阅 [README.md](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit#ttskit) 中的新版 TTSKit 章节。\n\n#### 命令行工具\n\n您可以通过以下命令快速体验：\n```bash\nswift run -c release whisperkit-cli tts --text \"Hello from TTSKit\" --play\n```\n\n此外，发布后也将通过 Homebrew 提供安装：\n```bash\nbrew install whisperkit-cli\nwhisperkit-cli tts --text \"Hello from TTSKit\" --play\n```\n\n该工具提供了对音色、语言、模型变体、风格、温度、分块策略、计算单元、可复现性种子等参数的全面控制。\n\n#### 示例应用\n\n除了命令行工具外，我们还发布了一个新的示例应用 [TTSKitExample](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Ftree\u002Fmain\u002FExamples\u002FTTS\u002FTTSKitExample)，供开发者在将 TTSKit 集成到自己的应用时参考。该应用具备实时波形可视化、模型管理、带元数据的持久化音频文件历史记录，以及跨平台支持等功能。以下是截图：\n\u003Cimg width=\"1212\" height=\"812\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb8b9785d-ffb0-498a-ae1e-1dcecd18fd18\" \u002F>\n\n更多关于如何运行此应用的信息，请参阅示例应用的 [README.md](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fblob\u002Fmain\u002FExamples\u002FTTS\u002FTTSKitExample\u002FREADME.md)。\n\n### 架构变更\n- 新增共享的 `ArgmaxCore` 目标，用于存放通用工具库。\n- TTSKit 作为可选组件与现有 Swift 包一同发布（未对现有内容造成破坏性更改）。","2026-03-03T02:49:22",{"id":229,"version":230,"summary_zh":231,"released_at":232},154626,"v0.15.0","本次小版本更新将 `swift-transformers` 依赖升级至 `1.1.2`，并将 `TranscriptionResult` 从结构体提升为开放类，以便高级客户端能够重写其行为。\n\n### `TranscriptionResult` API 变更\n由于它已从 `struct` 改为 `class`，如果您之前依赖于旧有的值语义，现在复制操作只会传递相同的引用，因此对实例的修改将会被共享。请检查所有假设存在独立副本的代码（例如数组、闭包捕获值等），并在需要隔离时初始化一个新的 `TranscriptionResult` 实例。如果您对该类进行子类化，请使用相同的锁机制（`TranscriptionPropertyLock`）保护任何新的存储属性，以确保线程安全。\n\n## 变更内容\n* 由 @ZachNagengast 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F375 中完成的 swift-transformers 依赖及 CI 镜像升级\n* 由 @chen-argmax 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F376 中完成的 `TranscriptionResult` API 更新\n\n## 新贡献者\n* @chen-argmax 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F376 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.14.1...v0.15.0","2025-11-07T21:42:26",{"id":234,"version":235,"summary_zh":236,"released_at":237},154628,"v0.14.0","## 亮点\n\n本次发布引入了 [WhisperKit 本地服务器](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F348)！这是一个兼容 OpenAI 的、基于 [Vapor](https:\u002F\u002Fgithub.com\u002Fvapor\u002Fvapor) 的 HTTP 服务器，既可以通过命令行运行，也可以作为子进程启动。\n\n您可以通过以下命令试用：\n```swift\nBUILD_ALL=1 swift run whisperkit-cli serve\n```\n\n### 核心特性\n\n- **本地服务器**：兼容 OpenAI 的 `\u002Fv1\u002Faudio\u002Ftranscriptions` 和 `\u002Fv1\u002Faudio\u002Ftranslations` 端点\n- **转录流式传输**：在文件转录过程中通过服务器发送事件（SSE）实时推送进度\n- **响应格式**：支持 `json` 和 `verbose_json`\n- **时间戳粒度**：提供词级和片段级时间戳\n- **客户端示例**：包括 Python、Swift 和使用 curl 的 Bash 示例\n\n此外，您还可以使用 Makefile 命令 `make build-local-server` 来生成一个可执行文件，该文件可以打包到您的 Electron 或 Tauri 应用中，而无需任何原生集成。\n\n本次更新还修复了多个与分词器加载和 VAD 访问相关的问题，并提升了整体代码质量。\n\n## 变更内容\n* 添加 WhisperKit 本地服务器，支持音频转录和翻译 API，由 @a2they 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F348 中实现\n* 修复分词器和标点符号相关问题，并改进远程配置的处理，由 @ZachNagengast 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F350 中完成\n* 更新 README.md，加入 Argmax SDK 相关内容，由 @atiorh 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F346 中完成\n* 将 EnergyVAD 公开，由 @finnvoor 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F347 中实现\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.13.1...v0.14.0","2025-09-20T21:14:08",{"id":239,"version":240,"summary_zh":241,"released_at":242},154629,"v0.13.1","补丁版本，修复了与分词器加载及基于配置的日志滤波器相关的一些问题。\n\n- 分词器下载现在会尊重若已指定的 downloadBase 路径。感谢 @Kavi-Gupta 的建议，详见：https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fissues\u002F339\n- 如果模型文件夹路径中存在分词器文件，CLI 现在将能够在离线模式下加载分词器。感谢 @cedricporter 的报告，详见：https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fissues\u002F340\n- 曾观察到，若在配置中定义了 logits 滤波器，它们实际上并不会传递给文本解码器。此补丁通过在 WhisperKit 初始化时将这些滤波器传递给文本解码器，确保其生效。\n\n此外，还包含了由 @JimLiu 贡献的改进日志记录功能。感谢各位让 WhisperKit 不断完善！🚀\n\n## 变更内容\n* 功能：增强 WhisperKit CLI 的详细日志记录，由 @JimLiu 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F335 中实现\n* 将 logits 滤波器传递至文本解码器，并改进分词器加载机制，由 @ZachNagengast 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F343 中实现\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.13.0...v0.13.1","2025-07-31T18:14:38",{"id":244,"version":245,"summary_zh":246,"released_at":247},154630,"v0.13.0","### 新增 API\n- **异步 VAD 支持**：`VoiceActivityDetector` 中的 `voiceActivityAsync(in:)` 方法\n- **片段发现回调**：`transcribe()` 方法现可接受 `SegmentDiscoveryCallback`，以便在转录过程中接收可排序的片段，并提供精确的定位值\n\n### ⚠️ 已弃用函数 → 工具类\n现有代码仍可正常运行，但会显示弃用警告。\n\n```swift\n\u002F\u002F 旧 → 新\ncompressionRatio(of:) → TextUtilities.compressionRatio(of:)\nformatSegments(_:withTimestamps:) → TranscriptionUtilities.formatSegments(_:withTimestamps:)\nloadTokenizer(for:tokenizerFolder:useBackgroundSession:) → ModelUtilities.loadTokenizer(for:tokenizerFolder:useBackgroundSession:)\nmodelSupport(for:from:) → ModelUtilities.modelSupport(for:from:)\ndetectModelURL(inFolder:named:) → ModelUtilities.detectModelURL(inFolder:named:)\nfindLongestCommonPrefix(_:_:) → TranscriptionUtilities.findLongestCommonPrefix(_:_:)\nmergeTranscriptionResults(_:confirmedWords:) → TranscriptionUtilities.mergeTranscriptionResults(_:confirmedWords:)\nresolveAbsolutePath(_:) → FileManager.resolveAbsolutePath(_:)\n```\n\n### 基于协议的解码器输入\n```swift\n\u002F\u002F 旧\nfunc decodeText(using decoderInputs: DecodingInputs) -> DecodingResult\n\n\u002F\u002F 新\nfunc decodeText(using decoderInputs: any DecodingInputsType) -> DecodingResult\n```\n\n## 变更内容\n* 修复了 `modelSupport` 中前缀与不同硬件芯片重叠的问题，由 @a2they 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F326 中完成。\n* 修复了生成报告路径时无法正确处理包含点号的文件名的问题，由 @JimLiu 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F333 中完成。\n* 在 VAD 分块过程中实现了可排序的片段发现功能，由 @ZachNagengast 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F334 中实现。\n* 重构并清理了工具类，为解码器输入添加了协议 `DecodingInputsType`，由 @a2they 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F338 中完成。\n\n## 新贡献者\n* @JimLiu 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F333 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.12.0...v0.13.0","2025-06-13T04:01:44",{"id":249,"version":250,"summary_zh":251,"released_at":252},154631,"v0.12.0","本次小版本更新引入了多声道音频合并功能，这是用户长期以来的强烈需求。现在，默认的音频处理代码路径会在检测到多个声道时始终将所有声道合并[合并所有声道](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fblob\u002F11a1fabd8e6844d4879db3fe1a0dcc7d8a44533c\u002FSources\u002FWhisperKit\u002FCore\u002FAudio\u002FAudioProcessor.swift#L46)，而在之前仅使用第0个声道。此外，您还可以在加载音频时通过配置选择特定的声道：\n\n```swift\n        let config = WhisperKitConfig(\n            ...\n            audioInputConfig: AudioInputConfig(channelMode: .sumChannels([1, 3, 5]))\n        )\n```\n\n如果您的音频文件中不同声道分别对应不同的说话人，那么这种方法可以作为一种简化的说话人分离手段。\n\n摘自 #320：\n> 音频合并算法的工作原理如下：我们首先计算所有声道中的峰值，然后检查单声道（合并后）版本的峰值是否高于任何一个声道的峰值。如果高于，则对整个音频轨道进行缩放，使单声道的峰值与最响亮的那个声道的峰值一致。\n>\n> 例如：上方为单声道（合并后的）波形，下方为合并前的各个独立声道波形。\n![image](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F0c8b0eca-58cf-417e-8934-292a3d246da7)\n> 在这里可以看到，合并后的音频与原始的多声道音频文件保持了相同的响度，并且展示了所有声道合并后的总波形。\n\n此外，本次发布还针对自上次更新以来新推出的最新设备更新了 `recommendedModels` 函数，并改进了一些测试方法。\n\n## 变更内容\n* 多声道音频合并功能，由 @flashno 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F320 中实现\n* 根据平台设置 `concurrentWorkerCount` 的默认值，由 @ZachNagengast 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F321 中完成\n* 更新了 `fallbackModelSupportConfig`，增加了更多设备标识符，由 @iandundas 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F323 中完成\n* 使用远程模型进行测试，由 @ZachNagengast 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F324 中实现\n\n## 新贡献者\n* @flashno 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F320 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.11.0...v0.12.0","2025-04-15T21:03:31",{"id":254,"version":255,"summary_zh":256,"released_at":257},154632,"v0.11.0","在本次发布中，您现在可以使用 [swift-transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fswift-transformers) 的较新版本，而无需通过 fork 来更改版本，即使该库被其他依赖项引入（例如 [mlx-swift-examples](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx-swift-examples\u002Fblob\u002Fmain\u002FPackage.swift)）。在我们能够专门针对 Hub 和 Tokenizers 而不是整个库进行目标化之前，我们将继续默认使用 0.1.8 版本（目前仍在开发中）。\n\n此次发布还对词级时间戳功能进行了一些改进。此前，该功能并非总是能过滤掉特殊标记，并且与原始 OpenAI 实现相比略有偏差。此外，我们还添加了一些简单的启发式规则：如果单词的起始时间可以向后移动而不发生重叠，则会延长持续时间为 0 的词的时间戳。同时，“强制”预填充标记的方案也进行了调整，允许模型预测第一个时间戳，而不是始终从 0.00 开始（适用于音频在开头停顿后才开始的情况）。\n\n除此之外，整个项目还修复了若干 bug，并进行了多项提升开发体验的改进。欢迎您试用并反馈使用情况，可通过 [此处](https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fissues?q=sort%3Aupdated-desc+is%3Aissue+is%3Aopen) 或 [Discord](https:\u002F\u002Fdiscord.gg\u002FG5F5GZGecC) 告诉我们您的感受！🚀\n\n## 变更内容\n* 确保 resampleBuffer 中四舍五入后的容量不为零，由 @drewmccormack 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F295 中实现。\n* 改进 SegmentSeeker 的词对齐功能，由 @ZachNagengast 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F305 中实现。\n\n## 新贡献者\n* @drewmccormack 在 https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fpull\u002F295 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fargmaxinc\u002FWhisperKit\u002Fcompare\u002Fv0.10.2...v0.11.0","2025-02-22T02:45:54"]