[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-modelscope--FunASR":3,"tool-modelscope--FunASR":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159636,2,"2026-04-17T23:33:34",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":76,"owner_website":78,"owner_url":79,"languages":80,"stars":119,"forks":120,"last_commit_at":121,"license":122,"difficulty_score":10,"env_os":123,"env_gpu":124,"env_ram":125,"env_deps":126,"category_tags":134,"github_topics":136,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":152,"updated_at":153,"faqs":154,"releases":183},8920,"modelscope\u002FFunASR","FunASR","A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.","FunASR 是一款功能强大的端到端语音识别开源工具包，旨在搭建学术研究与工业应用之间的桥梁。它不仅能完成基础的语音转文字（ASR）任务，还集成了语音活动检测、标点恢复、说话人验证及区分、多说话人识别等全套音频处理能力，有效解决了传统方案中功能分散、模型部署困难及定制成本高的问题。\n\n无论是希望快速构建高精度语音服务的开发者，还是致力于探索前沿算法的研究人员，FunASR 都能提供极大的便利。工具内置了丰富的工业级预训练模型，用户可直接调用或通过简洁的脚本进行微调，轻松适配特定场景。其核心亮点在于推出了 Paraformer-large 等非自回归模型，在保持极高识别准确率的同时，显著提升了推理效率，非常适合对实时性要求严苛的生产环境。此外，FunASR 持续更新，已支持包括 Whisper-large-v3-turbo 在内的多种主流大模型，并覆盖 31 种语言的低延迟实时转录。借助友好的教程与完善的文档，FunASR 让高质量的语音技术变得触手可及，真正实现了\"ASR for Fun\"。","[\u002F\u002F]: # '\u003Cdiv align=\"left\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_a21ae322e484.jpg\" width=\"400\"\u002F>\u003C\u002Fdiv>'\n\n([简体中文](.\u002FREADME_zh.md)|English)\n\n[\u002F\u002F]: # \"# FunASR: A Fundamental End-to-End Speech Recognition Toolkit\"\n\n[![SVG Banners](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_d998aae27036.png)](https:\u002F\u002Fgithub.com\u002FAkshay090\u002Fsvg-banners)\n\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ffunasr)](https:\u002F\u002Fpypi.org\u002Fproject\u002Ffunasr\u002F)\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F3839\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_1145cd82417e.png\" alt=\"modelscope%2FFunASR | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cstrong>FunASR\u003C\u002Fstrong> hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model, researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun！\n\n[**Highlights**](#highlights)\n| [**News**](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR#whats-new)\n| [**Installation**](#installation)\n| [**Quick Start**](#quick-start)\n| [**Tutorial**](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Fblob\u002Fmain\u002Fdocs\u002Ftutorial\u002FREADME.md)\n| [**Runtime**](.\u002Fruntime\u002Freadme.md)\n| [**Model Zoo**](#model-zoo)\n| [**Contact**](#contact)\n\n\u003Ca name=\"highlights\">\u003C\u002Fa>\n\n## Highlights\n\n- FunASR is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR. FunASR provides convenient scripts and tutorials, supporting inference and fine-tuning of pre-trained models.\n- We have released a vast collection of academic and industrial pretrained models on the [ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels?page=1&tasks=auto-speech-recognition) and [huggingface](https:\u002F\u002Fhuggingface.co\u002FFunASR), which can be accessed through our [Model Zoo](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Fblob\u002Fmain\u002Fdocs\u002Fmodel_zoo\u002Fmodelscope_models.md). The representative [Paraformer-large](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Fsummary), a non-autoregressive end-to-end speech recognition model, has the advantages of high accuracy, high efficiency, and convenient deployment, supporting the rapid construction of speech recognition services. For more details on service deployment, please refer to the [service deployment document](runtime\u002Freadme_cn.md).\n\n\u003Ca name=\"whats-new\">\u003C\u002Fa>\n\n## What's new:\n\n- 2025\u002F12\u002F15: [Fun-ASR-Nano-2512](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FFun-ASR) is an end-to-end speech recognition large model trained on tens of millions of hours real speech data. It supports low-latency real-time transcription and covers 31 languages.\n- 2024\u002F10\u002F29: Real-time Transcription Service 1.12 released, The 2pass-offline mode supports the SensevoiceSmal model；([docs](runtime\u002Freadme.md));\n- 2024\u002F10\u002F10：Added support for the Whisper-large-v3-turbo model, a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. It can be downloaded from the [modelscope](examples\u002Findustrial_data_pretraining\u002Fwhisper\u002Fdemo.py), and [openai](examples\u002Findustrial_data_pretraining\u002Fwhisper\u002Fdemo_from_openai.py).\n- 2024\u002F09\u002F26: Offline File Transcription Service 4.6, Offline File Transcription Service of English 1.7, Real-time Transcription Service 1.11 released, fix memory leak & Support the SensevoiceSmall onnx model；File Transcription Service 2.0 GPU released, Fix GPU memory leak; ([docs](runtime\u002Freadme.md));\n- 2024\u002F09\u002F25：keyword spotting models are new supported. Supports fine-tuning and inference for four models: [fsmn_kws](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_sanm_kws_phone-xiaoyun-commands-online), [fsmn_kws_mt](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_sanm_kws_phone-xiaoyun-commands-online), [sanm_kws](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_sanm_kws_phone-xiaoyun-commands-offline), [sanm_kws_streaming](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_sanm_kws_phone-xiaoyun-commands-online).\n- 2024\u002F07\u002F04：[SenseVoice](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FSenseVoice) is a speech foundation model with multiple speech understanding capabilities, including ASR, LID, SER, and AED.\n\n\u003Cdetails>\u003Csummary>Full Changelog\u003C\u002Fsummary>\n    \n- 2024\u002F07\u002F01: Offline File Transcription Service GPU 1.1 released, optimize BladeDISC model compatibility issues; ref to ([docs](runtime\u002Freadme.md))\n- 2024\u002F06\u002F27: Offline File Transcription Service GPU 1.0 released, supporting dynamic batch processing and multi-threading concurrency. In the long audio test set, the single-thread RTF is 0.0076, and multi-threads' speedup is 1200+ (compared to 330+ on CPU); ref to ([docs](runtime\u002Freadme.md))\n- 2024\u002F05\u002F15：emotion recognition models are new supported. [emotion2vec+large](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Femotion2vec_plus_large\u002Fsummary)，[emotion2vec+base](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Femotion2vec_plus_base\u002Fsummary)，[emotion2vec+seed](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Femotion2vec_plus_seed\u002Fsummary). currently supports the following categories: 0: angry 1: happy 2: neutral 3: sad 4: unknown.\n- 2024\u002F05\u002F15: Offline File Transcription Service 4.5, Offline File Transcription Service of English 1.6, Real-time Transcription Service 1.10 released, adapting to FunASR 1.0 model structure；([docs](runtime\u002Freadme.md))\n- 2024\u002F03\u002F05：Added the Qwen-Audio and Qwen-Audio-Chat large-scale audio-text multimodal models, which have topped multiple audio domain leaderboards. These models support speech dialogue, [usage](examples\u002Findustrial_data_pretraining\u002Fqwen_audio).\n- 2024\u002F03\u002F05：Added support for the Whisper-large-v3 model, a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. It can be downloaded from the[modelscope](examples\u002Findustrial_data_pretraining\u002Fwhisper\u002Fdemo.py), and [openai](examples\u002Findustrial_data_pretraining\u002Fwhisper\u002Fdemo_from_openai.py).\n- 2024\u002F03\u002F05: Offline File Transcription Service 4.4, Offline File Transcription Service of English 1.5，Real-time Transcription Service 1.9 released，docker image supports ARM64 platform, update modelscope；([docs](runtime\u002Freadme.md))\n- 2024\u002F01\u002F30：funasr-1.0 has been released ([docs](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Fdiscussions\u002F1319))\n- 2024\u002F01\u002F30：emotion recognition models are new supported. [model link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002Femotion2vec_base_finetuned\u002Fsummary), modified from [repo](https:\u002F\u002Fgithub.com\u002FddlBoJack\u002Femotion2vec).\n- 2024\u002F01\u002F25: Offline File Transcription Service 4.2, Offline File Transcription Service of English 1.3 released，optimized the VAD (Voice Activity Detection) data processing method, significantly reducing peak memory usage, memory leak optimization; Real-time Transcription Service 1.7 released，optimizatized the client-side；([docs](runtime\u002Freadme.md))\n- 2024\u002F01\u002F09: The Funasr SDK for Windows version 2.0 has been released, featuring support for The offline file transcription service (CPU) of Mandarin 4.1, The offline file transcription service (CPU) of English 1.2, The real-time transcription service (CPU) of Mandarin 1.6. For more details, please refer to the official documentation or release notes([FunASR-Runtime-Windows](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Ffunasr-runtime-win-cpu-x64\u002Fsummary))\n- 2024\u002F01\u002F03: File Transcription Service 4.0 released, Added support for 8k models, optimized timestamp mismatch issues and added sentence-level timestamps, improved the effectiveness of English word FST hotwords, supported automated configuration of thread parameters, and fixed known crash issues as well as memory leak problems, refer to ([docs](runtime\u002Freadme.md#file-transcription-service-mandarin-cpu)).\n- 2024\u002F01\u002F03: Real-time Transcription Service 1.6 released，The 2pass-offline mode supports Ngram language model decoding and WFST hotwords, while also addressing known crash issues and memory leak problems, ([docs](runtime\u002Freadme.md#the-real-time-transcription-service-mandarin-cpu))\n- 2024\u002F01\u002F03: Fixed known crash issues as well as memory leak problems, ([docs](runtime\u002Freadme.md#file-transcription-service-english-cpu)).\n- 2023\u002F12\u002F04: The Funasr SDK for Windows version 1.0 has been released, featuring support for The offline file transcription service (CPU) of Mandarin, The offline file transcription service (CPU) of English, The real-time transcription service (CPU) of Mandarin. For more details, please refer to the official documentation or release notes([FunASR-Runtime-Windows](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Ffunasr-runtime-win-cpu-x64\u002Fsummary))\n- 2023\u002F11\u002F08: The offline file transcription service 3.0 (CPU) of Mandarin has been released, adding punctuation large model, Ngram language model, and wfst hot words. For detailed information, please refer to [docs](runtime#file-transcription-service-mandarin-cpu).\n- 2023\u002F10\u002F17: The offline file transcription service (CPU) of English has been released. For more details, please refer to ([docs](runtime#file-transcription-service-english-cpu)).\n- 2023\u002F10\u002F13: [SlideSpeech](https:\u002F\u002Fslidespeech.github.io\u002F): A large scale multi-modal audio-visual corpus with a significant amount of real-time synchronized slides.\n- 2023\u002F10\u002F10: The ASR-SpeakersDiarization combined pipeline [Paraformer-VAD-SPK](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Fblob\u002Fmain\u002Fegs_modelscope\u002Fasr_vad_spk\u002Fspeech_paraformer-large-vad-punc-spk_asr_nat-zh-cn\u002Fdemo.py) is now released. Experience the model to get recognition results with speaker information.\n- 2023\u002F10\u002F07: [FunCodec](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunCodec): A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec.\n- 2023\u002F09\u002F01: The offline file transcription service 2.0 (CPU) of Mandarin has been released, with added support for ffmpeg, timestamp, and hotword models. For more details, please refer to ([docs](runtime#file-transcription-service-mandarin-cpu)).\n- 2023\u002F08\u002F07: The real-time transcription service (CPU) of Mandarin has been released. For more details, please refer to ([docs](runtime#the-real-time-transcription-service-mandarin-cpu)).\n- 2023\u002F07\u002F17: BAT is released, which is a low-latency and low-memory-consumption RNN-T model. For more details, please refer to ([BAT](egs\u002Faishell\u002Fbat)).\n- 2023\u002F06\u002F26: ASRU2023 Multi-Channel Multi-Party Meeting Transcription Challenge 2.0 completed the competition and announced the results. For more details, please refer to ([M2MeT2.0](https:\u002F\u002Falibaba-damo-academy.github.io\u002FFunASR\u002Fm2met2\u002Findex.html)).\n\n\u003C\u002Fdetails>\n\n\u003Ca name=\"Installation\">\u003C\u002Fa>\n\n## Installation\n\n- Requirements\n\n```text\npython>=3.8\ntorch>=1.13\ntorchaudio\n```\n\n- Install for pypi\n\n```shell\npip3 install -U funasr\n```\n\n- Or install from source code\n\n```sh\ngit clone https:\u002F\u002Fgithub.com\u002Falibaba\u002FFunASR.git && cd FunASR\npip3 install -e .\u002F\n```\n\n- Install modelscope or huggingface_hub for the pretrained models (Optional)\n\n```shell\npip3 install -U modelscope huggingface_hub\n```\n\n## Model Zoo\n\nFunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the [Model License Agreement](.\u002FMODEL_LICENSE). Below are some representative models, for more models please refer to the [Model Zoo](.\u002Fmodel_zoo).\n\n(Note: ⭐ represents the ModelScope model zoo, 🤗 represents the Huggingface model zoo, 🍀 represents the OpenAI model zoo)\n\n|                                                                                                         Model Name                                                                                                         |                                                                                                                        Task Details                                                                                                                         |          Training Data           | Parameters |\n|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------:| :--------: |\n|                   Fun-ASR-Nano \u003Cbr> ([⭐](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FFunAudioLLM\u002FFun-ASR-Nano-2512) [🤗](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FFun-ASR-Nano-2512) )                                                      |Speech recognition supports Chinese, English, and Japanese. Chinese includes support for 7 dialects and 26 regional accents. English and Japanese cover multiple regional accents. Additional features include lyric recognition and rap speech recognition. |    Tens of millions of hours     |  800M  |\n|                                         SenseVoiceSmall \u003Cbr> ([⭐](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FSenseVoiceSmall) [🤗](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FSenseVoiceSmall) )                                         |                                                              multiple speech understanding capabilities, including ASR, ITN, LID, SER, and AED, support languages such as zh, yue, en, ja, ko                                                               |           300000 hours           |    234M    |\n|           paraformer-zh \u003Cbr> ([⭐](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Fparaformer-zh) )           |                                                                                                     speech recognition, with timestamps, non-streaming                                                                                                      |      60000 hours, Mandarin       |    220M    |\n| \u003Cnobr>paraformer-zh-streaming \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Fparaformer-zh-streaming) )\u003C\u002Fnobr> |                                                                                                                speech recognition, streaming                                                                                                                |      60000 hours, Mandarin       |    220M    |\n|               paraformer-en \u003Cbr> ( [⭐](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Fparaformer-en) )                |                                                                                                    speech recognition, without timestamps, non-streaming                                                                                                    |       50000 hours, English       |    220M    |\n|                            conformer-en \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_conformer_asr-en-16k-vocab4199-pytorch\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Fconformer-en) )                             |                                                                                                              speech recognition, non-streaming                                                                                                              |       50000 hours, English       |    220M    |\n|                               ct-punc \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fpunc_ct-transformer_cn-en-common-vocab471067-large\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Fct-punc) )                               |                                                                                                                   punctuation restoration                                                                                                                   |    100M, Mandarin and English    |    290M    |\n|                                   fsmn-vad \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_fsmn_vad_zh-cn-16k-common-pytorch\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Ffsmn-vad) )                                   |                                                                                                                  voice activity detection                                                                                                                   | 5000 hours, Mandarin and English |    0.4M    |\n|                                                              fsmn-kws \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_charctc_kws_phone-xiaoyun\u002Fsummary) )                                                              |                                                                                                                 keyword spotting，streaming                                                                                                                  |       5000 hours, Mandarin       |    0.7M    |\n|                                     fa-zh \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_timestamp_prediction-v1-16k-offline\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Ffa-zh) )                                     |                                                                                                                    timestamp prediction                                                                                                                     |       5000 hours, Mandarin       |    38M     |\n|                                       cam++ \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_campplus_sv_zh-cn_16k-common\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Fcampplus) )                                        |                                                                                                              speaker verification\u002Fdiarization                                                                                                               |            5000 hours            |    7.2M    |\n|                                            Whisper-large-v3 \u003Cbr> ([⭐](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FWhisper-large-v3\u002Fsummary) [🍀](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper) )                                             |                                                                                                     speech recognition, with timestamps, non-streaming                                                                                                      |           multilingual           |   1550 M   |\n|                                      Whisper-large-v3-turbo \u003Cbr> ([⭐](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FWhisper-large-v3-turbo\u002Fsummary) [🍀](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper) )                                       |                                                                                                     speech recognition, with timestamps, non-streaming                                                                                                      |           multilingual           |   809 M    |\n|                                                Qwen-Audio \u003Cbr> ([⭐](examples\u002Findustrial_data_pretraining\u002Fqwen_audio\u002Fdemo.py) [🤗](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen-Audio) )                                                |                                                                                                         audio-text multimodal models (pretraining)                                                                                                          |           multilingual           |     8B     |\n|                                        Qwen-Audio-Chat \u003Cbr> ([⭐](examples\u002Findustrial_data_pretraining\u002Fqwen_audio\u002Fdemo_chat.py) [🤗](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen-Audio-Chat) )                                         |                                                                                                             audio-text multimodal models (chat)                                                                                                             |           multilingual           |     8B     |\n|                               emotion2vec+large \u003Cbr> ([⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Femotion2vec_plus_large\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Femotion2vec\u002Femotion2vec_plus_large) )                               |                                                                                                                 speech emotion recongintion                                                                                                                 |           40000 hours            |    300M    |\n\n[\u002F\u002F]: #\n[\u002F\u002F]: # \"FunASR supports pre-trained or further fine-tuned models for deployment as a service. The CPU version of the Chinese offline file conversion service has been released, details can be found in [docs](funasr\u002Fruntime\u002Fdocs\u002FSDK_tutorial.md). More detailed information about service deployment can be found in the [deployment roadmap](funasr\u002Fruntime\u002Freadme_cn.md).\"\n\n\u003Ca name=\"quick-start\">\u003C\u002Fa>\n\n## Quick Start\n\nBelow is a quick start tutorial. Test audio files ([Mandarin](https:\u002F\u002Fisv-data.oss-cn-hangzhou.aliyuncs.com\u002Fics\u002FMaaS\u002FASR\u002Ftest_audio\u002Fvad_example.wav), [English](https:\u002F\u002Fisv-data.oss-cn-hangzhou.aliyuncs.com\u002Fics\u002FMaaS\u002FASR\u002Ftest_audio\u002Fasr_example_en.wav)).\n\n### Command-line usage\n\n```shell\nfunasr ++model=paraformer-zh ++vad_model=\"fsmn-vad\" ++punc_model=\"ct-punc\" ++input=asr_example_zh.wav\n```\n\nNotes: Support recognition of single audio file, as well as file list in Kaldi-style wav.scp format: `wav_id wav_pat`\n\n### Speech Recognition (Non-streaming)\n\n#### Fun-ASR-Nano\n\n```python\nfrom funasr import AutoModel\n\nmodel_dir = \"FunAudioLLM\u002FFun-ASR-Nano-2512\"\n\nmodel = AutoModel(\n    model=model_dir,\n    vad_model=\"fsmn-vad\",\n    vad_kwargs={\"max_single_segment_time\": 30000},\n    device=\"cuda:0\",\n)\nres = model.generate(input=[wav_path], cache={}, batch_size_s=0)\ntext = res[0][\"text\"]\nprint(text)\n```\n\nParameter Description:\n\n- `model_dir`: The name of the model, or the path to the model on the local disk.\n- `vad_model`: This indicates the activation of VAD (Voice Activity Detection). The purpose of VAD is to split long audio into shorter clips. In this case, the inference time includes both VAD and SenseVoice total consumption, and represents the end-to-end latency. If you wish to test the SenseVoice model's inference time separately, the VAD model can be disabled.\n- `vad_kwargs`: Specifies the configurations for the VAD model. `max_single_segment_time`: denotes the maximum duration for audio segmentation by the `vad_model`, with the unit being milliseconds (ms).\n- `batch_size_s`: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in seconds (s).\n\n#### SenseVoice\n\n```python\nfrom funasr import AutoModel\nfrom funasr.utils.postprocess_utils import rich_transcription_postprocess\n\nmodel_dir = \"iic\u002FSenseVoiceSmall\"\n\nmodel = AutoModel(\n    model=model_dir,\n    vad_model=\"fsmn-vad\",\n    vad_kwargs={\"max_single_segment_time\": 30000},\n    device=\"cuda:0\",\n)\n\n# en\nres = model.generate(\n    input=f\"{model.model_path}\u002Fexample\u002Fen.mp3\",\n    cache={},\n    language=\"auto\",  # \"zn\", \"en\", \"yue\", \"ja\", \"ko\", \"nospeech\"\n    use_itn=True,\n    batch_size_s=60,\n    merge_vad=True,  #\n    merge_length_s=15,\n)\ntext = rich_transcription_postprocess(res[0][\"text\"])\nprint(text)\n```\n\nParameter Description:\n\n- `model_dir`: The name of the model, or the path to the model on the local disk.\n- `vad_model`: This indicates the activation of VAD (Voice Activity Detection). The purpose of VAD is to split long audio into shorter clips. In this case, the inference time includes both VAD and SenseVoice total consumption, and represents the end-to-end latency. If you wish to test the SenseVoice model's inference time separately, the VAD model can be disabled.\n- `vad_kwargs`: Specifies the configurations for the VAD model. `max_single_segment_time`: denotes the maximum duration for audio segmentation by the `vad_model`, with the unit being milliseconds (ms).\n- `use_itn`: Whether the output result includes punctuation and inverse text normalization.\n- `batch_size_s`: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in seconds (s).\n- `merge_vad`: Whether to merge short audio fragments segmented by the VAD model, with the merged length being `merge_length_s`, in seconds (s).\n- `ban_emo_unk`: Whether to ban the output of the `emo_unk` token.\n\n#### Paraformer\n\n```python\nfrom funasr import AutoModel\n# paraformer-zh is a multi-functional asr model\n# use vad, punc, spk or not as you need\nmodel = AutoModel(model=\"paraformer-zh\",  vad_model=\"fsmn-vad\",  punc_model=\"ct-punc\",\n                  # spk_model=\"cam++\",\n                  )\nres = model.generate(input=f\"{model.model_path}\u002Fexample\u002Fasr_example.wav\",\n                     batch_size_s=300,\n                     hotword='魔搭')\nprint(res)\n```\n\nNote: `hub`: represents the model repository, `ms` stands for selecting ModelScope download, `hf` stands for selecting Huggingface download.\n\n### Speech Recognition (Streaming)\n\n```python\nfrom funasr import AutoModel\n\nchunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms\nencoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention\ndecoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention\n\nmodel = AutoModel(model=\"paraformer-zh-streaming\")\n\nimport soundfile\nimport os\n\nwav_file = os.path.join(model.model_path, \"example\u002Fasr_example.wav\")\nspeech, sample_rate = soundfile.read(wav_file)\nchunk_stride = chunk_size[1] * 960 # 600ms\n\ncache = {}\ntotal_chunk_num = int(len((speech)-1)\u002Fchunk_stride+1)\nfor i in range(total_chunk_num):\n    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]\n    is_final = i == total_chunk_num - 1\n    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)\n    print(res)\n```\n\nNote: `chunk_size` is the configuration for streaming latency.` [0,10,5]` indicates that the real-time display granularity is `10*60=600ms`, and the lookahead information is `5*60=300ms`. Each inference input is `600ms` (sample points are `16000*0.6=960`), and the output is the corresponding text. For the last speech segment input, `is_final=True` needs to be set to force the output of the last word.\n\n\u003Cdetails>\u003Csummary>More Examples\u003C\u002Fsummary>\n\n### Voice Activity Detection (Non-Streaming)\n\n```python\nfrom funasr import AutoModel\n\nmodel = AutoModel(model=\"fsmn-vad\")\nwav_file = f\"{model.model_path}\u002Fexample\u002Fvad_example.wav\"\nres = model.generate(input=wav_file)\nprint(res)\n```\n\nNote: The output format of the VAD model is: `[[beg1, end1], [beg2, end2], ..., [begN, endN]]`, where `begN\u002FendN` indicates the starting\u002Fending point of the `N-th` valid audio segment, measured in milliseconds.\n\n### Voice Activity Detection (Streaming)\n\n```python\nfrom funasr import AutoModel\n\nchunk_size = 200 # ms\nmodel = AutoModel(model=\"fsmn-vad\")\n\nimport soundfile\n\nwav_file = f\"{model.model_path}\u002Fexample\u002Fvad_example.wav\"\nspeech, sample_rate = soundfile.read(wav_file)\nchunk_stride = int(chunk_size * sample_rate \u002F 1000)\n\ncache = {}\ntotal_chunk_num = int(len((speech)-1)\u002Fchunk_stride+1)\nfor i in range(total_chunk_num):\n    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]\n    is_final = i == total_chunk_num - 1\n    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size)\n    if len(res[0][\"value\"]):\n        print(res)\n```\n\nNote: The output format for the streaming VAD model can be one of four scenarios:\n\n- `[[beg1, end1], [beg2, end2], .., [begN, endN]]`：The same as the offline VAD output result mentioned above.\n- `[[beg, -1]]`：Indicates that only a starting point has been detected.\n- `[[-1, end]]`：Indicates that only an ending point has been detected.\n- `[]`：Indicates that neither a starting point nor an ending point has been detected.\n\nThe output is measured in milliseconds and represents the absolute time from the starting point.\n\n### Punctuation Restoration\n\n```python\nfrom funasr import AutoModel\n\nmodel = AutoModel(model=\"ct-punc\")\nres = model.generate(input=\"那今天的会就到这里吧 happy new year 明年见\")\nprint(res)\n```\n\n### Timestamp Prediction\n\n```python\nfrom funasr import AutoModel\n\nmodel = AutoModel(model=\"fa-zh\")\nwav_file = f\"{model.model_path}\u002Fexample\u002Fasr_example.wav\"\ntext_file = f\"{model.model_path}\u002Fexample\u002Ftext.txt\"\nres = model.generate(input=(wav_file, text_file), data_type=(\"sound\", \"text\"))\nprint(res)\n```\n\n### Speech Emotion Recognition\n\n```python\nfrom funasr import AutoModel\n\nmodel = AutoModel(model=\"emotion2vec_plus_large\")\n\nwav_file = f\"{model.model_path}\u002Fexample\u002Ftest.wav\"\n\nres = model.generate(wav_file, output_dir=\".\u002Foutputs\", granularity=\"utterance\", extract_embedding=False)\nprint(res)\n```\n\nMore usages ref to [docs](docs\u002Ftutorial\u002FREADME_zh.md),\nmore examples ref to [demo](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Ftree\u002Fmain\u002Fexamples\u002Findustrial_data_pretraining)\n\n\u003C\u002Fdetails>\n\n## Export ONNX\n\n### Command-line usage\n\n```shell\nfunasr-export ++model=paraformer ++quantize=false ++device=cpu\n```\n\n### Python\n\n```python\nfrom funasr import AutoModel\n\nmodel = AutoModel(model=\"paraformer\", device=\"cpu\")\n\nres = model.export(quantize=False)\n```\n\n### Test ONNX\n\n```python\n# pip3 install -U funasr-onnx\nfrom pathlib import Path\nfrom runtime.python.onnxruntime.funasr_onnx.paraformer_bin import Paraformer\n\n\nhome_dir = Path.home()\n\nmodel_dir = \"damo\u002Fspeech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch\"\nmodel = Paraformer(model_dir, batch_size=1, quantize=True)\n\nwav_path = [f\"{home_dir}\u002F.cache\u002Fmodelscope\u002Fhub\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Fexample\u002Fasr_example.wav\"]\n\nresult = model(wav_path)\nprint(result)\n```\n\nMore examples ref to [demo](runtime\u002Fpython\u002Fonnxruntime)\n\n## Deployment Service\n\nFunASR supports deploying pre-trained or further fine-tuned models for service. Currently, it supports the following types of service deployment:\n\n- File transcription service, Mandarin, CPU version, done\n- The real-time transcription service, Mandarin (CPU), done\n- File transcription service, English, CPU version, done\n- File transcription service, Mandarin, GPU version, in progress\n- and more.\n\nFor more detailed information, please refer to the [service deployment documentation](runtime\u002Freadme.md).\n\n\u003Ca name=\"contact\">\u003C\u002Fa>\n\n## Community Communication\n\nIf you encounter problems in use, you can directly raise Issues on the github page.\n\nYou can also scan the following DingTalk group to join the community group for communication and discussion.\n\n|                           DingTalk group                            |\n| :-----------------------------------------------------------------: |\n| \u003Cdiv align=\"left\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_1d0d07053db0.png\" width=\"250\"\u002F> |\n\n## Contributors\n\n| \u003Cdiv align=\"left\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_e72d2b6eb5c6.png\" width=\"260\"\u002F> | \u003Cdiv align=\"left\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_646f2bd40b23.png\" width=\"260\"\u002F> | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_646067faa268.png\" width=\"200\"\u002F> \u003C\u002Fdiv> | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_9d21d947e4dc.png\" width=\"200\"\u002F> \u003C\u002Fdiv> | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_69cf8103f3ee.png\" width=\"200\"\u002F> \u003C\u002Fdiv> | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_0ce9353ff267.png\" width=\"250\"\u002F> \u003C\u002Fdiv> |\n| :----------------------------------------------------------------: | :-------------------------------------------------------------: | :-----------------------------------------------------------: | :-----------------------------------------------------: | :-------------------------------------------------------: | :----------------------------------------------------: |\n\nThe contributors can be found in [contributors list](.\u002FAcknowledge.md)\n\n## License\n\nThis project is licensed under [The MIT License](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT). FunASR also contains various third-party components and some code modified from other repos under other open source licenses.\nThe use of pretraining model is subject to [model license](.\u002FMODEL_LICENSE)\n\n## Citations\n\n```bibtex\n@inproceedings{gao2023funasr,\n  author={Zhifu Gao and Zerui Li and Jiaming Wang and Haoneng Luo and Xian Shi and Mengzhe Chen and Yabin Li and Lingyun Zuo and Zhihao Du and Zhangyu Xiao and Shiliang Zhang},\n  title={FunASR: A Fundamental End-to-End Speech Recognition Toolkit},\n  year={2023},\n  booktitle={INTERSPEECH},\n}\n@inproceedings{An2023bat,\n  author={Keyu An and Xian Shi and Shiliang Zhang},\n  title={BAT: Boundary aware transducer for memory-efficient and low-latency ASR},\n  year={2023},\n  booktitle={INTERSPEECH},\n}\n@inproceedings{gao22b_interspeech,\n  author={Zhifu Gao and ShiLiang Zhang and Ian McLoughlin and Zhijie Yan},\n  title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},\n  year=2022,\n  booktitle={Proc. Interspeech 2022},\n  pages={2063--2067},\n  doi={10.21437\u002FInterspeech.2022-9996}\n}\n@inproceedings{shi2023seaco,\n  author={Xian Shi and Yexin Yang and Zerui Li and Yanni Chen and Zhifu Gao and Shiliang Zhang},\n  title={SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability},\n  year={2023},\n  booktitle={ICASSP2024}\n}\n```\n","[\u002F\u002F]: # '\u003Cdiv align=\"left\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_a21ae322e484.jpg\" width=\"400\"\u002F>\u003C\u002Fdiv>'\n\n([简体中文](.\u002FREADME_zh.md)|English)\n\n[\u002F\u002F]: # \"# FunASR：一款基础的端到端语音识别工具包\"\n\n[![SVG 横幅](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_d998aae27036.png)](https:\u002F\u002Fgithub.com\u002FAkshay090\u002Fsvg-banners)\n\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ffunasr)](https:\u002F\u002Fpypi.org\u002Fproject\u002Ffunasr\u002F)\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F3839\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_1145cd82417e.png\" alt=\"modelscope%2FFunASR | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cstrong>FunASR\u003C\u002Fstrong> 致力于搭建语音识别领域学术研究与工业应用之间的桥梁。通过支持工业级语音识别模型的训练与微调，研究人员和开发者能够更便捷地开展语音识别模型的研究与生产工作，从而推动语音识别生态的发展。让语音识别充满乐趣！\n\n[**亮点**](#highlights)\n| [**新闻**](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR#whats-new)\n| [**安装**](#installation)\n| [**快速入门**](#quick-start)\n| [**教程**](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Fblob\u002Fmain\u002Fdocs\u002Ftutorial\u002FREADME.md)\n| [**运行时环境**](.\u002Fruntime\u002Freadme.md)\n| [**模型库**](#model-zoo)\n| [**联系我们**](#contact)\n\n\u003Ca name=\"highlights\">\u003C\u002Fa>\n\n## 亮点\n\n- FunASR 是一款基础的语音识别工具包，提供多种功能，包括语音识别（ASR）、语音活动检测（VAD）、标点符号恢复、语言模型、说话人验证、说话人日志以及多说话人 ASR 等。FunASR 提供便捷的脚本和教程，支持预训练模型的推理与微调。\n- 我们已在 [ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels?page=1&tasks=auto-speech-recognition) 和 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002FFunASR) 上发布了大量学术与工业领域的预训练模型，可通过我们的 [模型库](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Fblob\u002Fmain\u002Fdocs\u002Fmodel_zoo\u002Fmodelscope_models.md) 获取。其中具有代表性的 [Paraformer-large](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Fsummary) 是一种非自回归的端到端语音识别模型，具有高精度、高效率和部署便捷等优势，可支持快速构建语音识别服务。有关服务部署的更多详情，请参阅 [服务部署文档](runtime\u002Freadme_cn.md)。\n\n\u003Ca name=\"whats-new\">\u003C\u002Fa>\n\n## 最新动态：\n\n- 2025年12月15日：[Fun-ASR-Nano-2512](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FFun-ASR) 是一款基于数千万小时真实语音数据训练的端到端语音识别大模型。它支持低延迟实时转写，并覆盖31种语言。\n- 2024年10月29日：实时转录服务1.12版本发布，双遍离线模式支持 SensevoiceSmall 模型；（[文档](runtime\u002Freadme.md)）；\n- 2024年10月10日：新增对 Whisper-large-v3-turbo 模型的支持，该模型是一款多任务模型，可实现多语言语音识别、语音翻译和语言识别等功能。用户可以从 [ModelScope](examples\u002Findustrial_data_pretraining\u002Fwhisper\u002Fdemo.py) 和 [OpenAI](examples\u002Findustrial_data_pretraining\u002Fwhisper\u002Fdemo_from_openai.py) 下载该模型。\n- 2024年9月26日：离线文件转录服务4.6版本、英语离线文件转录服务1.7版本以及实时转录服务1.11版本发布，修复了内存泄漏问题，并支持 SensevoiceSmall ONNX 模型；同时发布了文件转录服务2.0 GPU版本，解决了 GPU 内存泄漏问题；（[文档](runtime\u002Freadme.md)）；\n- 2024年9月25日：新增关键词检测模型支持。现支持四种模型的微调与推理：[fsmn_kws](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_sanm_kws_phone-xiaoyun-commands-online)、[fsmn_kws_mt](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_sanm_kws_phone-xiaoyun-commands-online)、[sanm_kws](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_sanm_kws_phone-xiaoyun-commands-offline) 和 [sanm_kws_streaming](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_sanm_kws_phone-xiaoyun-commands-online)。\n- 2024年7月4日：[SenseVoice](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FSenseVoice) 是一款具备多种语音理解能力的基础模型，包括 ASR、LID、SER 和 AED 等。\n\n\u003Cdetails>\u003Csummary>完整更新日志\u003C\u002Fsummary>\n    \n- 2024年7月1日：离线文件转写服务 GPU 1.1 版本发布，优化了 BladeDISC 模型兼容性问题；参考文档（[docs](runtime\u002Freadme.md)）\n- 2024年6月27日：离线文件转写服务 GPU 1.0 版本发布，支持动态批处理和多线程并发。在长音频测试集中，单线程 RTF 为 0.0076，多线程加速比超过 1200 倍（相比 CPU 的 330 倍左右）；参考文档（[docs](runtime\u002Freadme.md)）\n- 2024年5月15日：新增情感识别模型支持。包括 [emotion2vec+large](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Femotion2vec_plus_large\u002Fsummary)、[emotion2vec+base](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Femotion2vec_plus_base\u002Fsummary) 和 [emotion2vec+seed](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Femotion2vec_plus_seed\u002Fsummary)。目前支持以下类别：0：愤怒，1：快乐，2： neutral，3：悲伤，4：未知。\n- 2024年5月15日：离线文件转写服务 4.5、英语离线文件转写服务 1.6、实时转写服务 1.10 发布，适配 FunASR 1.0 模型结构；([docs](runtime\u002Freadme.md))\n- 2024年3月5日：新增 Qwen-Audio 和 Qwen-Audio-Chat 大规模音视频多模态模型，这些模型在多个音频领域榜单中名列前茅。它们支持语音对话，使用方法参见 [usage](examples\u002Findustrial_data_pretraining\u002Fqwen_audio)。\n- 2024年3月5日：新增对 Whisper-large-v3 模型的支持，该模型是一个多任务模型，可进行多语言语音识别、语音翻译和语言识别。可以从 [modelscope](examples\u002Findustrial_data_pretraining\u002Fwhisper\u002Fdemo.py) 和 [openai](examples\u002Findustrial_data_pretraining\u002Fwhisper\u002Fdemo_from_openai.py) 下载。\n- 2024年3月5日：离线文件转写服务 4.4、英语离线文件转写服务 1.5、实时转写服务 1.9 发布，Docker 镜像支持 ARM64 平台，并更新 ModelScope；([docs](runtime\u002Freadme.md))\n- 2024年1月30日：funasr-1.0 已发布（[docs](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Fdiscussions\u002F1319)）\n- 2024年1月30日：新增情感识别模型支持。模型链接为 [model link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002Femotion2vec_base_finetuned\u002Fsummary)，基于 [repo](https:\u002F\u002Fgithub.com\u002FddlBoJack\u002Femotion2vec) 修改而来。\n- 2024年1月25日：离线文件转写服务 4.2、英语离线文件转写服务 1.3 发布，优化了 VAD（语音活动检测）数据处理方法，显著降低了内存峰值占用，并优化了内存泄漏问题；实时转写服务 1.7 发布，优化了客户端部分；([docs](runtime\u002Freadme.md))\n- 2024年1月9日：Funasr SDK for Windows 2.0 版本发布，支持普通话离线文件转写服务（CPU）4.1、英语离线文件转写服务（CPU）1.2、普通话实时转写服务（CPU）1.6。更多详情请参阅官方文档或发布说明（[FunASR-Runtime-Windows](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Ffunasr-runtime-win-cpu-x64\u002Fsummary)）\n- 2024年1月3日：文件转写服务 4.0 发布，新增对 8k 模型的支持，优化了时间戳不匹配问题并增加了句子级时间戳，提升了英文单词 FST 热词的效果，支持线程参数的自动化配置，并修复了已知的崩溃问题及内存泄漏问题，详情参见 ([docs](runtime\u002Freadme.md#file-transcription-service-mandarin-cpu))。\n- 2024年1月3日：实时转写服务 1.6 发布，双通道离线模式支持 Ngram 语言模型解码和 WFST 热词，同时解决了已知的崩溃问题和内存泄漏问题，([docs](runtime\u002Freadme.md#the-real-time-transcription-service-mandarin-cpu))。\n- 2024年1月3日：修复了已知的崩溃问题以及内存泄漏问题，([docs](runtime\u002Freadme.md#file-transcription-service-english-cpu))。\n- 2023年12月4日：Funasr SDK for Windows 1.0 版本发布，支持普通话离线文件转写服务（CPU）、英语离线文件转写服务（CPU）以及普通话实时转写服务（CPU）。更多详情请参阅官方文档或发布说明（[FunASR-Runtime-Windows](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Ffunasr-runtime-win-cpu-x64\u002Fsummary)）。\n- 2023年11月8日：普通话离线文件转写服务 3.0（CPU）发布，新增标点符号大模型、Ngram 语言模型和 WFST 热词。详细信息请参阅 [docs](runtime#file-transcription-service-mandarin-cpu)。\n- 2023年10月17日：英语离线文件转写服务（CPU）发布。更多详情请参阅 ([docs](runtime#file-transcription-service-english-cpu))。\n- 2023年10月13日：[SlideSpeech](https:\u002F\u002Fslidespeech.github.io\u002F)：一个大规模多模态视听语料库，包含大量实时同步的幻灯片。\n- 2023年10月10日：ASR-SpeakersDiarization 联合流水线 [Paraformer-VAD-SPK](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Fblob\u002Fmain\u002Fegs_modelscope\u002Fasr_vad_spk\u002Fspeech_paraformer-large-vad-punc-spk_asr_nat-zh-cn\u002Fdemo.py) 现已发布。体验该模型即可获得带有说话人信息的识别结果。\n- 2023年10月7日：[FunCodec](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunCodec)：一套基础、可复现且可集成的开源神经语音编解码工具包。\n- 2023年9月1日：普通话离线文件转写服务 2.0（CPU）发布，新增对 ffmpeg、时间戳和热词模型的支持。更多详情请参阅 ([docs](runtime#file-transcription-service-mandarin-cpu))。\n- 2023年8月7日：普通话实时转写服务（CPU）发布。更多详情请参阅 ([docs](runtime#the-real-time-transcription-service-mandarin-cpu))。\n- 2023年7月17日：BAT 模型发布，这是一种低延迟、低内存消耗的 RNN-T 模型。更多详情请参阅 ([BAT](egs\u002Faishell\u002Fbat))。\n- 2023年6月26日：ASRU2023 多通道多方会议转录挑战赛 2.0 完成比赛并公布结果。更多详情请参阅 ([M2MeT2.0](https:\u002F\u002Falibaba-damo-academy.github.io\u002FFunASR\u002Fm2met2\u002Findex.html))。\n\n\u003C\u002Fdetails>\n\n\u003Ca name=\"Installation\">\u003C\u002Fa>\n\n\n\n## 安装\n\n- 需求\n\n```text\npython>=3.8\ntorch>=1.13\ntorchaudio\n```\n\n- 通过 pypi 安装\n\n```shell\npip3 install -U funasr\n```\n\n- 或者从源代码安装\n\n```sh\ngit clone https:\u002F\u002Fgithub.com\u002Falibaba\u002FFunASR.git && cd FunASR\npip3 install -e .\u002F\n```\n\n- 安装 modelscope 或 huggingface_hub 以获取预训练模型（可选）\n\n```shell\npip3 install -U modelscope huggingface_hub\n```\n\n## 模型库\n\nFunASR 已在工业级数据上开源了大量预训练模型。您可以在[模型许可协议](.\u002FMODEL_LICENSE)的许可下自由使用、复制、修改和分享 FunASR 模型。以下是一些代表性模型，更多模型请参阅[模型库](.\u002Fmodel_zoo)。\n\n（注：⭐ 表示 ModelScope 模型库，🤗 表示 Huggingface 模型库，🍀 表示 OpenAI 模型库）\n\n|                                                                                                         模型名称                                                                                                         |                                                                                                                        任务详情                                                                                                                         |          训练数据           | 参数 |\n|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------:| :--------: |\n|                   Fun-ASR-Nano \u003Cbr> ([⭐](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FFunAudioLLM\u002FFun-ASR-Nano-2512) [🤗](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FFun-ASR-Nano-2512) )                                                      |语音识别支持中文、英语和日语。中文包括7种方言和26种地方口音的支持。英语和日语覆盖多种地区口音。附加功能包括歌词识别和说唱语音识别。 |    数千万小时     |  800M  |\n|                                         SenseVoiceSmall \u003Cbr> ([⭐](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FSenseVoiceSmall) [🤗](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FSenseVoiceSmall) )                                         |                                                              多种语音理解能力，包括ASR、ITN、LID、SER和AED，支持zh、yue、en、ja、ko等语言                                                               |           30万小时           |    234M    |\n|           paraformer-zh \u003Cbr> ([⭐](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Fparaformer-zh) )           |                                                                                                     语音识别，带时间戳，非流式                                                                                                      |      6万小时，普通话       |    220M    |\n| \u003Cnobr>paraformer-zh-streaming \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Fparaformer-zh-streaming) )\u003C\u002Fnobr> |                                                                                                                语音识别，流式                                                                                                                |      6万小时，普通话       |    220M    |\n|               paraformer-en \u003Cbr> ( [⭐](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Fparaformer-en) )                |                                                                                                    语音识别，无时间戳，非流式                                                                                                    |       5万小时，英语       |    220M    |\n|                            conformer-en \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_conformer_asr-en-16k-vocab4199-pytorch\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Fconformer-en) )                             |                                                                                                              语音识别，非流式                                                                                                              |       5万小时，英语       |    220M    |\n|                               ct-punc \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fpunc_ct-transformer_cn-en-common-vocab471067-large\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Fct-punc) )                               |                                                                                                                   标点符号恢复                                                                                                                   |    1亿条，中文和英语    |    290M    |\n|                                   fsmn-vad \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_fsmn_vad_zh-cn-16k-common-pytorch\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Ffsmn-vad) )                                   |                                                                                                                  语音活动检测                                                                                                                   | 5000小时，中文和英语 |    0.4M    |\n|                                                              fsmn-kws \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_charctc_kws_phone-xiaoyun\u002Fsummary) )                                                              |                                                                                                                 关键词检测，流式                                                                                                                  |       5000小时，普通话       |    0.7M    |\n|                                     fa-zh \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_timestamp_prediction-v1-16k-offline\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Ffa-zh) )                                     |                                                                                                                    时间戳预测                                                                                                                     |       5000小时，普通话       |    38M     |\n|                                       cam++ \u003Cbr> ( [⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_campplus_sv_zh-cn_16k-common\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Ffunasr\u002Fcampplus) )                                        |                                                                                                              发言人验证\u002F区分                                                                                                               |            5000小时            |    7.2M    |\n|                                            Whisper-large-v3 \u003Cbr> ([⭐](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FWhisper-large-v3\u002Fsummary) [🍀](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper) )                                             |                                                                                                     语音识别，带时间戳，非流式                                                                                                      |           多语言           |   1550 M   |\n|                                      Whisper-large-v3-turbo \u003Cbr> ([⭐](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FWhisper-large-v3-turbo\u002Fsummary) [🍀](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper) )                                       |                                                                                                     语音识别，带时间戳，非流式                                                                                                      |           多语言           |   809 M    |\n|                                                Qwen-Audio \u003Cbr> ([⭐](examples\u002Findustrial_data_pretraining\u002Fqwen_audio\u002Fdemo.py) [🤗](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen-Audio) )                                                |                                                                                                         音频-文本多模态模型（预训练）                                                                                                          |           多语言           |     8B     |\n|                                        Qwen-Audio-Chat \u003Cbr> ([⭐](examples\u002Findustrial_data_pretraining\u002Fqwen_audio\u002Fdemo_chat.py) [🤗](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen-Audio-Chat) )                                         |                                                                                                             音频-文本多模态模型（聊天）                                                                                                             |           多语言           |     8B     |\n|                               emotion2vec+large \u003Cbr> ([⭐](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Femotion2vec_plus_large\u002Fsummary) [🤗](https:\u002F\u002Fhuggingface.co\u002Femotion2vec\u002Femotion2vec_plus_large) )                               |                                                                                                                 语音情感识别                                                                                                                 |           4万小时            |    300M    |\n\n[\u002F\u002F]: #\n[\u002F\u002F]: # \"FunASR支持预训练或进一步微调后的模型以服务形式部署。中文离线文件转换服务的CPU版本已发布，详情请参阅[文档](funasr\u002Fruntime\u002Fdocs\u002FSDK_tutorial.md)。关于服务部署的更详细信息，请参阅[部署路线图](funasr\u002Fruntime\u002Freadme_cn.md)。\"\n\n\u003Ca name=\"quick-start\">\u003C\u002Fa>\n\n\n\n## 快速入门\n\n以下是快速入门教程。测试音频文件（[普通话](https:\u002F\u002Fisv-data.oss-cn-hangzhou.aliyuncs.com\u002Fics\u002FMaaS\u002FASR\u002Ftest_audio\u002Fvad_example.wav)，[英语](https:\u002F\u002Fisv-data.oss-cn-hangzhou.aliyuncs.com\u002Fics\u002FMaaS\u002FASR\u002Ftest_audio\u002Fasr_example_en.wav))。\n\n### 命令行使用\n\n```shell\nfunasr ++model=paraformer-zh ++vad_model=\"fsmn-vad\" ++punc_model=\"ct-punc\" ++input=asr_example_zh.wav\n```\n\n注意：支持单个音频文件的识别，也支持Kaldi风格的wav.scp格式的文件列表：`wav_id wav_pat`。\n\n### 语音识别（非流式）\n\n#### Fun-ASR-Nano\n\n```python\nfrom funasr import AutoModel\n\nmodel_dir = \"FunAudioLLM\u002FFun-ASR-Nano-2512\"\n\nmodel = AutoModel(\n    model=model_dir,\n    vad_model=\"fsmn-vad\",\n    vad_kwargs={\"max_single_segment_time\": 30000},\n    device=\"cuda:0\",\n)\nres = model.generate(input=[wav_path], cache={}, batch_size_s=0)\ntext = res[0][\"text\"]\nprint(text)\n```\n\n参数说明：\n\n- `model_dir`：模型名称，或本地磁盘上模型的路径。\n- `vad_model`：表示启用VAD（语音活动检测）。VAD的作用是将长音频分割成较短的片段。在这种情况下，推理时间包括VAD和SenseVoice的总耗时，代表端到端延迟。如果希望单独测试SenseVoice模型的推理时间，可以禁用VAD模型。\n- `vad_kwargs`：指定VAD模型的配置。`max_single_segment_time`：表示`vad_model`进行音频分割的最大时长，单位为毫秒（ms）。\n- `batch_size_s`：表示使用动态批处理，其中批次中音频的总时长以秒（s）为单位。\n\n#### SenseVoice\n\n```python\nfrom funasr import AutoModel\nfrom funasr.utils.postprocess_utils import rich_transcription_postprocess\n\nmodel_dir = \"iic\u002FSenseVoiceSmall\"\n\nmodel = AutoModel(\n    model=model_dir,\n    vad_model=\"fsmn-vad\",\n    vad_kwargs={\"max_single_segment_time\": 30000},\n    device=\"cuda:0\",\n)\n\n# 英语\nres = model.generate(\n    input=f\"{model.model_path}\u002Fexample\u002Fen.mp3\",\n    cache={},\n    language=\"auto\",  # \"zn\", \"en\", \"yue\", \"ja\", \"ko\", \"nospeech\"\n    use_itn=True,\n    batch_size_s=60,\n    merge_vad=True,  #\n    merge_length_s=15,\n)\ntext = rich_transcription_postprocess(res[0][\"text\"])\nprint(text)\n```\n\n参数说明：\n\n- `model_dir`：模型名称，或本地磁盘上模型的路径。\n- `vad_model`：表示启用VAD（语音活动检测）。VAD的作用是将长音频分割成较短的片段。在这种情况下，推理时间包括VAD和SenseVoice的总耗时，代表端到端延迟。如果希望单独测试SenseVoice模型的推理时间，可以禁用VAD模型。\n- `vad_kwargs`：指定VAD模型的配置。`max_single_segment_time`：表示`vad_model`进行音频分割的最大时长，单位为毫秒（ms）。\n- `use_itn`：输出结果是否包含标点符号和逆文本规范化。\n- `batch_size_s`：表示使用动态批处理，其中批次中音频的总时长以秒（s）为单位。\n- `merge_vad`：是否合并由VAD模型分割的短音频片段，合并后的长度为`merge_length_s`，单位为秒（s）。\n- `ban_emo_unk`：是否禁止输出`emo_unk`标记。\n\n#### Paraformer\n\n```python\nfrom funasr import AutoModel\n# paraformer-zh 是一个多功能的ASR模型\n# 根据需要选择是否使用vad、punc、spk等\nmodel = AutoModel(model=\"paraformer-zh\",  vad_model=\"fsmn-vad\",  punc_model=\"ct-punc\",\n                  # spk_model=\"cam++\",\n                  )\nres = model.generate(input=f\"{model.model_path}\u002Fexample\u002Fasr_example.wav\",\n                     batch_size_s=300,\n                     hotword='魔搭')\nprint(res)\n```\n\n注：`hub`表示模型仓库，`ms`代表选择ModelScope下载，`hf`代表选择Huggingface下载。\n\n### 语音识别（流式）\n\n```python\nfrom funasr import AutoModel\n\nchunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms\nencoder_chunk_look_back = 4 #编码器自注意力回看的块数\ndecoder_chunk_look_back = 1 #解码器交叉注意力回看的编码器块数\n\nmodel = AutoModel(model=\"paraformer-zh-streaming\")\n\nimport soundfile\nimport os\n\nwav_file = os.path.join(model.model_path, \"example\u002Fasr_example.wav\")\nspeech, sample_rate = soundfile.read(wav_file)\nchunk_stride = chunk_size[1] * 960 # 600ms\n\ncache = {}\ntotal_chunk_num = int(len((speech)-1)\u002Fchunk_stride+1)\nfor i in range(total_chunk_num):\n    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]\n    is_final = i == total_chunk_num - 1\n    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)\n    print(res)\n```\n\n注：`chunk_size`是流式延迟的配置。`[0,10,5]`表示实时显示的粒度为`10*60=600ms`，前瞻信息为`5*60=300ms`。每次推理输入为`600ms`（采样点为`16000*0.6=960`），输出为对应的文本。对于最后一段语音输入，需设置`is_final=True`，以强制输出最后一个词。\n\n\u003Cdetails>\u003Csummary>更多示例\u003C\u002Fsummary>\n\n### 语音活动检测（非流式）\n\n```python\nfrom funasr import AutoModel\n\nmodel = AutoModel(model=\"fsmn-vad\")\nwav_file = f\"{model.model_path}\u002Fexample\u002Fvad_example.wav\"\nres = model.generate(input=wav_file)\nprint(res)\n```\n\n注：VAD模型的输出格式为：`[[beg1, end1], [beg2, end2], ..., [begN, endN]]`，其中`begN\u002FendN`表示第`N`个有效音频片段的起始\u002F结束时间，单位为毫秒。\n\n### 语音活动检测（流式）\n\n```python\nfrom funasr import AutoModel\n\nchunk_size = 200 # 毫秒\nmodel = AutoModel(model=\"fsmn-vad\")\n\nimport soundfile\n\nwav_file = f\"{model.model_path}\u002Fexample\u002Fvad_example.wav\"\nspeech, sample_rate = soundfile.read(wav_file)\nchunk_stride = int(chunk_size * sample_rate \u002F 1000)\n\ncache = {}\ntotal_chunk_num = int(len((speech)-1)\u002Fchunk_stride+1)\nfor i in range(total_chunk_num):\n    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]\n    is_final = i == total_chunk_num - 1\n    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size)\n    if len(res[0][\"value\"]):\n        print(res)\n```\n\n注意：流式 VAD 模型的输出格式有四种情况：\n\n- `[[beg1, end1], [beg2, end2], .., [begN, endN]]`：与上述离线 VAD 输出结果相同。\n- `[[beg, -1]]`：表示仅检测到起始点。\n- `[[-1, end]]`：表示仅检测到结束点。\n- `[]`：表示未检测到起始点或结束点。\n\n输出以毫秒为单位，表示从起始点开始的绝对时间。\n\n### 标点符号恢复\n\n```python\nfrom funasr import AutoModel\n\nmodel = AutoModel(model=\"ct-punc\")\nres = model.generate(input=\"那今天的会就到这里吧 happy new year 明年见\")\nprint(res)\n```\n\n### 时间戳预测\n\n```python\nfrom funasr import AutoModel\n\nmodel = AutoModel(model=\"fa-zh\")\nwav_file = f\"{model.model_path}\u002Fexample\u002Fasr_example.wav\"\ntext_file = f\"{model.model_path}\u002Fexample\u002Ftext.txt\"\nres = model.generate(input=(wav_file, text_file), data_type=(\"sound\", \"text\"))\nprint(res)\n```\n\n### 语音情感识别\n\n```python\nfrom funasr import AutoModel\n\nmodel = AutoModel(model=\"emotion2vec_plus_large\")\n\nwav_file = f\"{model.model_path}\u002Fexample\u002Ftest.wav\"\n\nres = model.generate(wav_file, output_dir=\".\u002Foutputs\", granularity=\"utterance\", extract_embedding=False)\nprint(res)\n```\n\n更多用法参考 [文档](docs\u002Ftutorial\u002FREADME_zh.md)，\n更多示例参考 [demo](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Ftree\u002Fmain\u002Fexamples\u002Findustrial_data_pretraining)\n\n\u003C\u002Fdetails>\n\n## 导出 ONNX\n\n### 命令行使用\n\n```shell\nfunasr-export ++model=paraformer ++quantize=false ++device=cpu\n```\n\n### Python\n\n```python\nfrom funasr import AutoModel\n\nmodel = AutoModel(model=\"paraformer\", device=\"cpu\")\n\nres = model.export(quantize=False)\n```\n\n### 测试 ONNX\n\n```python\n# pip3 install -U funasr-onnx\nfrom pathlib import Path\nfrom runtime.python.onnxruntime.funasr_onnx.paraformer_bin import Paraformer\n\n\nhome_dir = Path.home()\n\nmodel_dir = \"damo\u002Fspeech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch\"\nmodel = Paraformer(model_dir, batch_size=1, quantize=True)\n\nwav_path = [f\"{home_dir}\u002F.cache\u002Fmodelscope\u002Fhub\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Fexample\u002Fasr_example.wav\"]\n\nresult = model(wav_path)\nprint(result)\n```\n\n更多示例参考 [demo](runtime\u002Fpython\u002Fonnxruntime)\n\n## 部署服务\n\nFunASR 支持部署预训练或进一步微调后的模型以提供服务。目前支持以下类型的服务部署：\n\n- 文件转写服务，普通话，CPU 版本，已完成\n- 实时转写服务，普通话（CPU），已完成\n- 文件转写服务，英语，CPU 版本，已完成\n- 文件转写服务，普通话，GPU 版本，正在进行中\n- 以及其他。\n\n更多详细信息，请参阅 [服务部署文档](runtime\u002Freadme.md)。\n\n\u003Ca name=\"contact\">\u003C\u002Fa>\n\n## 社区交流\n\n如果您在使用过程中遇到问题，可以直接在 GitHub 页面上提交 Issue。\n\n您也可以扫描下方的钉钉群二维码，加入社区群进行交流和讨论。\n\n|                           钉钉群                            |\n| :-----------------------------------------------------------------: |\n| \u003Cdiv align=\"left\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_1d0d07053db0.png\" width=\"250\"\u002F> |\n\n## 贡献者\n\n| \u003Cdiv align=\"left\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_e72d2b6eb5c6.png\" width=\"260\"\u002F> | \u003Cdiv align=\"left\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_646f2bd40b23.png\" width=\"260\"\u002F> | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_646067faa268.png\" width=\"200\"\u002F> \u003C\u002Fdiv> | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_9d21d947e4dc.png\" width=\"200\"\u002F> \u003C\u002Fdiv> | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_69cf8103f3ee.png\" width=\"200\"\u002F> \u003C\u002Fdiv> | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_readme_0ce9353ff267.png\" width=\"250\"\u002F> \u003C\u002Fdiv> |\n| :----------------------------------------------------------------: | :-------------------------------------------------------------: | :-----------------------------------------------------------: | :-----------------------------------------------------: | :-------------------------------------------------------: | :----------------------------------------------------: |\n\n贡献者名单请参见 [贡献者列表](.\u002FAcknowledge.md)\n\n## 许可证\n\n本项目采用 [MIT 许可证](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT) 开源。FunASR 还包含多种第三方组件，以及基于其他开源许可证修改的部分代码。\n预训练模型的使用受 [模型许可证](.\u002FMODEL_LICENSE) 约束。\n\n## 引用\n\n```bibtex\n@inproceedings{gao2023funasr,\n  author={Zhifu Gao and Zerui Li and Jiaming Wang and Haoneng Luo and Xian Shi and Mengzhe Chen and Yabin Li and Lingyun Zuo and Zhihao Du and Zhangyu Xiao and Shiliang Zhang},\n  title={FunASR: A Fundamental End-to-End Speech Recognition Toolkit},\n  year={2023},\n  booktitle={INTERSPEECH},\n}\n@inproceedings{An2023bat,\n  author={Keyu An and Xian Shi and Shiliang Zhang},\n  title={BAT: Boundary aware transducer for memory-efficient and low-latency ASR},\n  year={2023},\n  booktitle={INTERSPEECH},\n}\n@inproceedings{gao22b_interspeech,\n  author={Zhifu Gao and ShiLiang Zhang and Ian McLoughlin and Zhijie Yan},\n  title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},\n  year=2022,\n  booktitle={Proc. Interspeech 2022},\n  pages={2063--2067},\n  doi={10.21437\u002FInterspeech.2022-9996}\n}\n@inproceedings{shi2023seaco,\n  author={Xian Shi and Yexin Yang and Zerui Li and Yanni Chen and Zhifu Gao and Shiliang Zhang},\n  title={SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability},\n  year={2023},\n  booktitle={ICASSP2024}\n}\n```","# FunASR 快速上手指南\n\nFunASR 是一款由阿里巴巴达摩院开源的基础性端到端语音识别工具包，旨在连接学术研究与工业应用。它支持语音识别（ASR）、语音活动检测（VAD）、标点恢复、说话人验证、说话人日记以及多说话人 ASR 等多种功能，并提供丰富的工业级预训练模型。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**：Linux, macOS, Windows\n*   **Python 版本**：>= 3.8\n*   **核心依赖**：\n    *   PyTorch >= 1.13\n    *   torchaudio\n\n## 安装步骤\n\n### 1. 安装 FunASR\n\n推荐使用 pip 直接安装最新稳定版：\n\n```shell\npip3 install -U funasr\n```\n\n如果您需要从源码安装以获取最新特性：\n\n```sh\ngit clone https:\u002F\u002Fgithub.com\u002Falibaba\u002FFunASR.git && cd FunASR\npip3 install -e .\u002F\n```\n\n### 2. 安装模型下载工具（可选但推荐）\n\nFunASR 的预训练模型主要托管在 ModelScope（魔搭社区）和 HuggingFace 上。为了更方便地下载模型（尤其是国内用户推荐使用 ModelScope），建议安装以下库：\n\n```shell\npip3 install -U modelscope huggingface_hub\n```\n\n## 基本使用\n\nFunASR 提供了极简的 Python API 来加载预训练模型并进行推理。以下是一个使用中文语音识别模型（Paraformer-large）进行文件转录的最简单示例。\n\n### 示例：语音文件转文字\n\n```python\nfrom funasr import AutoModel\n\n# 初始化模型\n# model_name: 模型名称，支持自动从 ModelScope 或 HuggingFace 下载\n# device: 运行设备，\"cuda\" 或 \"cpu\"\nmodel = AutoModel(model=\"paraformer-zh\", device=\"cuda\")\n\n# 执行推理\n# input: 音频文件路径，支持单文件或文件列表\nres = model.generate(input=\"example.wav\")\n\n# 打印结果\nprint(res)\n```\n\n**代码说明：**\n*   `AutoModel` 会自动处理模型的下载、加载和设备配置。\n*   首次运行时，如果本地没有模型，它会自动从 ModelScope 下载（若已安装 `modelscope`）或 HuggingFace。\n*   `generate` 方法返回包含识别文本、时间戳等信息的字典列表。\n\n### 进阶：批量处理与参数调整\n\n```python\nfrom funasr import AutoModel\n\nmodel = AutoModel(model=\"paraformer-zh\", device=\"cuda\")\n\n# 批量处理多个文件\nfiles = [\"audio_1.wav\", \"audio_2.wav\"]\nres = model.generate(input=files, batch_size_s=300)\n\nfor item in res:\n    print(f\"File: {item.get('filename', 'unknown')}, Text: {item['text']}\")\n```\n\n通过以上步骤，您即可快速搭建起基于 FunASR 的语音识别服务。更多高级功能（如 VAD、标点恢复、说话人日记等）可参考官方教程文档。","某大型电商客服团队每天需处理数万通用户投诉录音，急需将语音数据转化为可检索的结构化文本以分析服务质量。\n\n### 没有 FunASR 时\n- **转写效率低下**：依赖人工听写或昂贵的第三方 API，处理海量录音耗时数天，严重滞后于业务复盘节奏。\n- **识别准确率不足**：通用模型无法适应电商特有的商品术语和用户口音，导致关键信息（如订单号、诉求点）频繁识别错误。\n- **缺乏结构化处理**：原始转录文本无标点、不分说话人，且包含大量静音噪音，数据清洗需额外编写复杂脚本。\n- **部署成本高昂**：自研高精度模型需要深厚的算法背景和大量算力投入，中小团队难以承担训练与微调门槛。\n\n### 使用 FunASR 后\n- **实时高效转写**：利用 Paraformer-large 等预训练模型，实现工业级高并发推理，万条录音可在小时内完成高精度转写。\n- **领域自适应强**：通过简单的微调功能，快速让模型掌握电商专有词汇，显著降低专有名词和方言的识别错误率。\n- **端到端一站式处理**：内置 VAD（语音活动检测）、标点恢复及说话人分离功能，直接输出带标点、分角色的干净文本，免去繁琐后处理。\n- **开箱即用低成本**：提供丰富的 ModelScope 预训练模型库和简洁的 Python 接口，开发人员无需从零训练，即可在本地或云端快速部署服务。\n\nFunASR 将原本需要数周完成的语音数据处理流程缩短至小时级，并以极低的成本实现了工业级的识别精度与结构化输出。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmodelscope_FunASR_998073e2.png","modelscope","ModelScope","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmodelscope_66a27ef8.png","Model-as-a-Service in the making: bring accessible AI to all.",null,"contact@modelscope.cn","https:\u002F\u002Fwww.modelscope.cn\u002F","https:\u002F\u002Fgithub.com\u002Fmodelscope",[81,85,89,93,97,101,105,108,112,115],{"name":82,"color":83,"percentage":84},"Python","#3572A5",73.2,{"name":86,"color":87,"percentage":88},"C++","#f34b7d",13.5,{"name":90,"color":91,"percentage":92},"Shell","#89e051",3.6,{"name":94,"color":95,"percentage":96},"JavaScript","#f1e05a",3.5,{"name":98,"color":99,"percentage":100},"C#","#178600",2.9,{"name":102,"color":103,"percentage":104},"Vue","#41b883",0.7,{"name":106,"color":107,"percentage":104},"Java","#b07219",{"name":109,"color":110,"percentage":111},"CMake","#DA3434",0.4,{"name":113,"color":114,"percentage":111},"SCSS","#c6538c",{"name":116,"color":117,"percentage":118},"Perl","#0298c3",0.3,15717,1649,"2026-04-18T00:12:28","MIT","Linux, macOS, Windows","非必需（支持 CPU 和 GPU 模式）。GPU 模式下需 NVIDIA 显卡，具体型号和显存未说明，但提及有针对 GPU 内存泄漏的优化及动态批处理功能。","未说明（提及离线文件转录服务优化了 VAD 数据处理以显著降低峰值内存使用）",{"notes":127,"python":128,"dependencies":129},"该工具支持 Windows、Linux 和 macOS。提供专门的 Windows SDK (FunASR-Runtime-Windows) 支持 CPU 推理。Docker 镜像支持 ARM64 平台。若需使用预训练模型，建议安装 modelscope 或 huggingface_hub。支持多种任务包括语音识别、标点恢复、说话人日志及情感识别等。","3.8+",[130,131,132,133],"torch>=1.13","torchaudio","modelscope (可选)","huggingface_hub (可选)",[35,14,135],"音频",[137,138,139,140,141,142,143,144,145,146,147,148,149,150,151],"conformer","pytorch","speech-recognition","paraformer","punctuation","speaker-diarization","rnnt","audio-visual-speech-recognition","pretrained-model","voice-activity-detection","whisper","dfsmn","vad","speechgpt","speechllm","2026-03-27T02:49:30.150509","2026-04-18T14:15:10.871857",[155,160,165,170,175,179],{"id":156,"question_zh":157,"answer_zh":158,"source_url":159},40000,"如何在 Windows 上编译和运行 FunASR（特别是 ONNX Runtime 支持）？","FunASR 社区已适配 Windows 软件包，支持离线文件转录和实时听写。用户可以直接下载预编译的 Windows CPU 版本包，无需自行修改代码编译。\n下载地址：https:\u002F\u002Fwww.modelscope.cn\u002Fapi\u002Fv1\u002Fmodels\u002Fdamo\u002Ffunasr-runtime-win-cpu-x64\u002Frepo?Revision=master&FilePath=funasr-runtime-win-cpu-x64-v0.1.0.zip\n使用文档：https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Ffunasr-runtime-win-cpu-x64\u002Fsummary\n\n若需自行编译，主要需解决以下兼容性问题：\n1. Onnx Session 构造函数需将路径转换为宽字符 (StrToWstr)。\n2. 头文件兼容：Windows 下需包含 \u003Cio.h> 并定义 F_OK，替代 \u003Cunistd.h>。\n3. 补充 win_func.h 以实现 gettimeofday 函数。\n4. 将 \u003Ccodecvt> 的 include 移至头部以避免 namespace 错误。\n5. 修改 yaml-cpp 源码，将 `const std::string& input` 改为 `const std::string input` 以修复字符串复制构造问题。","https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR\u002Fissues\u002F726",{"id":161,"question_zh":162,"answer_zh":163,"source_url":164},40001,"WebSocket C++ 客户端连接服务器发送 WAV 文件后，为何无法返回 ASR 结果并报错 'End of File'？","该问题通常由 `io_thread_num` 参数配置不当引起。默认值可能为 8，但在某些环境下会导致读取失败。\n解决方案：\n1. 确保代码已更新到最新版本。\n2. 在启动 websocketclient 或配置时，将 `io_thread_num` 参数显式设置为 1。\n注意：websocketclient 进程数量也建议设为 1。如果需要高并发，应通过启动多个客户端进程或利用服务器端的多线程处理能力来实现，而不是单纯增加单个客户端的 IO 线程数。","https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR\u002Fissues\u002F597",{"id":166,"question_zh":167,"answer_zh":168,"source_url":169},40002,"运行 Paraformer 微调或推理时报错 'RuntimeError: The size of tensor a must match the size of tensor b' 如何解决？","此错误通常是因为未正确安装最新代码或数据集格式不匹配导致的。\n排查步骤：\n1. 确认是否使用了最新的代码库。如果是克隆的代码，请执行 `pip install -e .` 进行重新安装，确保生效的是最新代码而非旧版本的 pip 包。\n2. 检查数据集格式。如果使用的是自定义数据集，请对比项目提供的测试数据，确认输入数据的维度（如帧长、特征维度）是否与模型预期一致。错误信息中的维度不匹配（如 13 vs 14）往往暗示数据预处理环节存在问题。","https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR\u002Fissues\u002F1837",{"id":171,"question_zh":172,"answer_zh":173,"source_url":174},40003,"FunASR 是否支持 Whisper-v3-large-turbo 模型？","目前 FunASR 官方并未直接提供 Whisper-v3-large-turbo 的内置支持管道。不过，社区用户反馈 Sevlrio + Paraformer 的组合效果良好，而 FSMN 模型在某些场景下表现一般。如果需要使用该特定模型，用户可能需要自行构建处理管道或关注后续的版本更新。","https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR\u002Fissues\u002F2132",{"id":176,"question_zh":177,"answer_zh":178,"source_url":174},40004,"在进行说话人识别（Speaker Recognition）时，为什么必须是词级别（word level）的处理？","关于说话人识别必须基于词级别的具体技术原因，目前在公开讨论中尚未有明确的官方定论。有用户对此表示困惑（\"why speaker recognition must be word level?\"），这可能与当前模型架构（如 CampPlus）在分割音频片段与标签对齐时的实现逻辑有关。如果遇到相关断言错误（如 `AssertionError: len(segments) == len(labels)`），通常意味着 VAD 分割出的片段数量与标签数量不一致，建议检查 VAD 模型的灵敏度设置或后处理逻辑。",{"id":180,"question_zh":181,"answer_zh":182,"source_url":164},40005,"如何使用 PHP 或其他非 Python\u002FC++ 语言作为 WebSocket 客户端调用 FunASR 服务？","虽然官方主要提供 C++ 和 Python 的客户端示例，但 WebSocket 协议是通用的。对于 PHP 或其他语言用户：\n1. 参考官方 C++ 客户端的数据发送格式（通常是二进制 WAV 数据流）。\n2. 确保服务端启动时参数配置正确（如端口、模型路径）。\n3. 客户端需要建立 WebSocket 连接，并按帧或整体发送 WAV 文件的二进制数据。\n4. 如果遇到连接立即断开或无结果，请检查服务端的 `io_thread_num` 设置（建议设为 1 进行测试），并确保发送的数据格式（采样率、编码）与服务端模型要求（如 16k Hz）一致。",[184,189,194,199],{"id":185,"version":186,"summary_zh":187,"released_at":188},323498,"v0.3.0","## 新增内容：\n\n### 2023年3月17日，funasr-0.3.0，modelscope-1.4.1\n- 新特性：\n    - 增加了对GPU运行时方案的支持，即[nv-triton](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Ftree\u002Fmain\u002Ffunasr\u002Fruntime\u002Ftriton_gpu)，该方案可便捷地从ModelScope导出Paraformer模型并部署为服务。我们在单块V100 GPU上进行了基准测试，取得了0.0032的RTF和300倍的加速效果。\n    - 增加了对CPU运行时[量化方案](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Ftree\u002Fmain\u002Ffunasr\u002Fexport)的支持，该方案支持从ModelScope导出量化的ONNX和Libtorch模型。我们在CPU-8369B上进行了[基准测试](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Ftree\u002Fmain\u002Ffunasr\u002Fruntime\u002Fpython)，结果表明RTF提升了50%（0.00438 → 0.00226），速度提升了一倍（228 → 442）。\n    - 增加了gRPC服务部署方案的C++版本。与Python运行时相比，C++版本的ONNXRuntime及量化方案效率高出一倍，[示例](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Ftree\u002Fmain\u002Ffunasr\u002Fruntime\u002Fgrpc)。\n    - 为[16k VAD模型](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_fsmn_vad_zh-cn-16k-common-pytorch\u002Fsummary)和[8k VAD模型](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_fsmn_vad_zh-cn-8k-common\u002Fsummary)新增了流式推理管道，支持音频输入流（≥10ms），[示例](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Fdiscussions\u002F236)。\n    - 改进了[标点预测模型](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fpunc_ct-transformer_zh-cn-common-vocab272727-pytorch\u002Fsummary)，准确率有所提升（F-score由55.6提高到56.5）。\n    - 增加了基于gRPC服务的实时字幕示例，采用两 pass 识别模型。使用[Paraformer流式模型](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online\u002Fsummary)进行实时文本输出，同时利用[Paraformer-large离线模型](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Fsummary)对识别结果进行校正，[示例](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Ftree\u002Fmain\u002Ffunasr\u002Fruntime\u002Fpython\u002Fgrpc)。\n- 新模型：\n    - 新增了[16k Paraformer流式模型](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online\u002Fsummary)，支持流式音频输入的实时语音识别，[示例](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Fdiscussions\u002F241)。该模型可通过gRPC服务部署，实现实时字幕功能。\n    - 新增了[流式标点模型](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fpunc_ct-transformer_zh-cn-common-vad_realtime-vocab272727\u002Fsummary)，可在流式语音识别场景中实现实时标点标注，并基于VAD检测点进行实时调用。该模型可与rea","2023-03-16T08:15:02",{"id":190,"version":191,"summary_zh":192,"released_at":193},323499,"v0.2.0","## 新增内容：\n\n### 2023.2.17，funasr-0.2.0，ModelScope-1.3.0\n- 我们新增支持一项功能：可从 ModelScope 将 ParaFormer 模型导出为 [ONNX 和 TorchScript 格式](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Ftree\u002Fmain\u002Ffunasr\u002Fexport)。本地微调的模型同样支持。\n- 我们新增支持一项功能：[ONNX Runtime](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Ftree\u002Fmain\u002Ffunasr\u002Fruntime\u002Fpython)，您无需依赖 ModelScope 或 FunASR 即可部署该运行时。以 [ParaFormer-Large](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Fsummary) 模型为例，在 CPU 上使用 ONNX Runtime 后，实时因子（RTF）提升了 3 倍（从 0.110 降至 0.038），[详情](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Ftree\u002Fmain\u002Ffunasr\u002Fruntime\u002Fpython\u002Fonnxruntime\u002Fparaformer\u002Frapid_paraformer#speed)。\n- 我们新增支持一项功能：[gRPC](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Ftree\u002Fmain\u002Ffunasr\u002Fruntime\u002Fpython\u002Fgrpc)，您可以通过部署 ModelScope 管道或 ONNX Runtime，利用 gRPC 构建 ASR 服务。\n- 我们发布了一款新模型 [ParaFormer-Large-Contextual](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404\u002Fsummary)，该模型基于激励增强技术实现了热词定制，显著提升了热词的召回率和精确率。\n- 我们优化了 [ParaFormer-Large-Long](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Fsummary) 的时间戳对齐算法，大幅提高了时间戳预测的准确性，累计平均偏移量（AAS）达到 74.7 毫秒，[详情](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.12343)。\n- 我们发布了一款新模型，即 [8kHz VAD 模型](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_fsmn_vad_zh-cn-16k-common-pytorch\u002Fsummary)，该模型能够准确预测非静音语音的持续时间。它可与 ModelScope 中的任何 ASR 模型自由集成，[详情](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Fdiscussions\u002F134)。\n- 我们发布了一款新模型，[MFCCA](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FNPU-ASLP\u002Fspeech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950\u002Fsummary)，这是一款多通道、多说话人模型，不受麦克风数量和布局的影响，适用于中文会议转写任务。\n- 我们还发布了多款新的 UniASR 模型：[闽南语方言模型](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_UniASR_asr_2pass-minnan-16k-common-vocab3825\u002Fsummary)、[法语模型](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_UniASR_asr_2pass-fr-16k-common-vocab3472-tensorflow1-online\u002Fsummary)、[德语模型](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_UniASR_asr_2pass-de-16k-common-vocab3690-tensorflow1-online\u002Fsummary)、[越南语模型](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_UniASR_asr_2pass-vi-16k-common-vocab1001-pytorch-online\u002Fsummary)以及[波斯语模型](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online\u002Fsummary)。\n- 我们发布了一款新模型，[ParaFormer-Data2Vec 模型](https","2023-02-20T02:22:02",{"id":195,"version":196,"summary_zh":197,"released_at":198},323500,"v0.1.6","## 发布说明：\r\n### 2023.1.16，funasr-0.1.6\r\n- 我们发布了一个新版本模型[Paraformer-large-long](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Fsummary)，该模型集成了[VAD](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_fsmn_vad_zh-cn-16k-common-pytorch\u002Fsummary)模型、[ASR](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Fsummary)、[标点](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fpunc_ct-transformer_zh-cn-common-vocab272727-pytorch\u002Fsummary)模型以及时间戳功能。该模型可以处理长达数小时的输入音频。\r\n- 我们发布了一种新型模型，即[VAD](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_fsmn_vad_zh-cn-16k-common-pytorch\u002Fsummary)，它可以预测非静音语音的持续时间。该模型可以自由地与[Model Zoo](docs\u002Fmodelscope_models.md)中的任何ASR模型进行集成。\r\n- 我们发布了一种新型模型，即[Punctuation](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fpunc_ct-transformer_zh-cn-common-vocab272727-pytorch\u002Fsummary)，它可以为ASR模型的识别结果添加标点符号。该模型也可以自由地与[Model Zoo](docs\u002Fmodelscope_models.md)中的任何ASR模型进行集成。\r\n- 我们发布了一个新模型，即[Data2vec](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch\u002Fsummary)，这是一个无监督预训练模型，可用于ASR及其他下游任务的微调。\r\n- 我们发布了一个新模型，即[Paraformer-Tiny](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch\u002Fsummary)，这是一个轻量级的Paraformer模型，支持普通话指令词识别。\r\n- 我们发布了一种新型模型，即[SV](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch\u002Fsummary)，它可以提取说话人嵌入，并进一步对成对的语音片段进行说话人验证。未来版本中还将支持说话人日区分割功能。\r\n- 我们优化了ModelScope的推理流程，通过将模型构建过程整合到流水线构建中，从而加快了推理速度。\r\n- ModelScope的推理流水线现在支持多种类型的音频输入格式，包括wav.scp、wav格式、音频字节流、波形样本等。\r\n\r\n## 最新更新\r\n- 2023年1月（1月16号发布）：[funasr-0.1.6](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR\u002Ftree\u002Fmain), modelscope-1.2.0\r\n  - 上线新模型：\r\n    - [Paraformer-large长音频模型](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Fsummary)，集成VAD、ASR、标点与时间戳功能，可直接对时长为数小时音频进行识别，并输出带标点文字与时间戳。\r\n    - [中文无监督预训练Data2vec模型](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch\u002Fsummary)，采用Data2vec结构，基于AISHELL-2数据的中文无监督预训练模型，支持ASR或者下游任务微调模型。\r\n    - [16k语音端点检测VAD模型](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_fsmn_vad_zh-cn-16k-common-pytorch\u002Fsummary)，可用于检测长语音片段中有效语音的起止时间点。\r\n    - [中文标点预测通用模型](https:\u002F\u002Fwww.modelscope.cn\u002F","2023-01-16T11:28:23",{"id":200,"version":201,"summary_zh":202,"released_at":203},323501,"v0.1.4","这是首个发布版本。  \n1. Paraformer 模型支持批量解码，且批次大小可大于 1。  \n2. 新增了 UniASR 模型及相应的训练配方。  \n3. 同时包含了 Transformer 和 Conformer 模型。  \n4. 在 ModelScope 平台上进行模型推理和微调更加便捷。","2022-12-10T04:54:23"]