[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-FunAudioLLM--SenseVoice":3,"tool-FunAudioLLM--SenseVoice":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",145895,2,"2026-04-08T11:32:59",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108111,"2026-04-08T11:23:26",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":72,"owner_avatar_url":73,"owner_bio":74,"owner_company":74,"owner_location":74,"owner_email":74,"owner_twitter":74,"owner_website":74,"owner_url":75,"languages":76,"stars":89,"forks":90,"last_commit_at":91,"license":92,"difficulty_score":10,"env_os":93,"env_gpu":94,"env_ram":95,"env_deps":96,"category_tags":104,"github_topics":107,"view_count":32,"oss_zip_url":74,"oss_zip_packed_at":74,"status":17,"created_at":121,"updated_at":122,"faqs":123,"releases":154},5529,"FunAudioLLM\u002FSenseVoice","SenseVoice","Multilingual Voice Understanding Model","SenseVoice 是一款强大的多语言语音理解基础模型，旨在让机器更精准地“听懂”人类声音。它不仅能够进行高精度的自动语音识别（ASR），支持包括中文、英语、日语等在内的 50 多种语言，还能同时识别说话人的情绪（如开心、悲伤）以及检测背景中的特定音频事件（如掌声、笑声、咳嗽声或背景音乐）。\n\n对于需要处理复杂语音场景的开发者、研究人员及企业用户而言，SenseVoice 有效解决了传统模型功能单一、推理速度慢或难以捕捉情感与非语言信息的痛点。其核心亮点在于卓越的性能与效率：基于超过 40 万小时数据训练，其在多语言识别准确率上超越了知名的 Whisper 模型；同时采用非自回归端到端架构，推理速度极快，处理 10 秒音频仅需 70 毫秒，比 Whisper-Large 快 15 倍，非常适合实时交互应用。此外，SenseVoice 还提供了便捷的微调脚本和多语言服务端部署方案，帮助用户轻松适配特定业务场景，是构建智能语音助手、会议分析系统及无障碍辅助工具的理想选择。","([简体中文](.\u002FREADME_zh.md)|English|[日本語](.\u002FREADME_ja.md))\n\n\n# Introduction\n\nSenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR),  spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). \n\n\u003Cdiv align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_5314ab6db0ec.png\">\n\u003C\u002Fdiv>\n\n[\u002F\u002F]: # (\u003Cdiv align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_180454c3e47c.png\" width=\"700\"\u002F> \u003C\u002Fdiv>)\n\n\u003Cdiv align=\"center\">  \n\u003Ch4>\n\u003Ca href=\"https:\u002F\u002Ffunaudiollm.github.io\u002F\"> Homepage \u003C\u002Fa>\n｜\u003Ca href=\"#What's News\"> What's News \u003C\u002Fa>\n｜\u003Ca href=\"#Benchmarks\"> Benchmarks \u003C\u002Fa>\n｜\u003Ca href=\"#Install\"> Install \u003C\u002Fa>\n｜\u003Ca href=\"#Usage\"> Usage \u003C\u002Fa>\n｜\u003Ca href=\"#Community\"> Community \u003C\u002Fa>\n\u003C\u002Fh4>\n\nModel Zoo:\n[modelscope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FSenseVoiceSmall), [huggingface](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FSenseVoiceSmall)\n\nOnline Demo:\n[modelscope demo](https:\u002F\u002Fwww.modelscope.cn\u002Fstudios\u002Fiic\u002FSenseVoice), [huggingface space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FFunAudioLLM\u002FSenseVoice)\n\n\n\u003C\u002Fdiv>\n\n\n\u003Ca name=\"Highligts\">\u003C\u002Fa>\n# Highlights 🎯\n**SenseVoice** focuses on high-accuracy multilingual speech recognition, speech emotion recognition, and audio event detection.\n- **Multilingual Speech Recognition:** Trained with over 400,000 hours of data, supporting more than 50 languages, the recognition performance surpasses that of the Whisper model.\n- **Rich transcribe:** \n  - Possess excellent emotion recognition capabilities, achieving and surpassing the effectiveness of the current best emotion recognition models on test data.\n  - Offer sound event detection capabilities, supporting the detection of various common human-computer interaction events such as bgm, applause, laughter, crying, coughing, and sneezing.\n- **Efficient Inference:** The SenseVoice-Small model utilizes a non-autoregressive end-to-end framework, leading to exceptionally low inference latency. It requires only 70ms to process 10 seconds of audio, which is 15 times faster than Whisper-Large.\n- **Convenient Finetuning:** Provide convenient finetuning scripts and strategies, allowing users to easily address long-tail sample issues according to their business scenarios.\n- **Service Deployment:** Offer service deployment pipeline,  supporting multi-concurrent requests, with client-side languages including Python, C++, HTML, Java, and C#, among others.\n\n\u003Ca name=\"What's News\">\u003C\u002Fa>\n# What's New 🔥\n- 2024\u002F11: Add support for timestamp based on the CTC alignment.\n- 2024\u002F7: Added Export Features for [ONNX](.\u002Fdemo_onnx.py) and [libtorch](.\u002Fdemo_libtorch.py), as well as Python Version Runtimes: [funasr-onnx-0.4.0](https:\u002F\u002Fpypi.org\u002Fproject\u002Ffunasr-onnx\u002F), [funasr-torch-0.1.1](https:\u002F\u002Fpypi.org\u002Fproject\u002Ffunasr-torch\u002F)\n- 2024\u002F7: The [SenseVoice-Small](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FSenseVoiceSmall) voice understanding model is open-sourced, which offers high-precision multilingual speech recognition, emotion recognition, and audio event detection capabilities for Mandarin, Cantonese, English, Japanese, and Korean and leads to exceptionally low inference latency.  \n- 2024\u002F7: The CosyVoice for natural speech generation with multi-language, timbre, and emotion control. CosyVoice excels in multi-lingual voice generation, zero-shot voice generation, cross-lingual voice cloning, and instruction-following capabilities. [CosyVoice repo](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice) and [CosyVoice space](https:\u002F\u002Fwww.modelscope.cn\u002Fstudios\u002Fiic\u002FCosyVoice-300M).\n- 2024\u002F7: [FunASR](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR) is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR.\n\n\u003Ca name=\"Benchmarks\">\u003C\u002Fa>\n# Benchmarks 📝\n\n## Multilingual Speech Recognition\nWe compared the performance of multilingual speech recognition between SenseVoice and Whisper on open-source benchmark datasets, including AISHELL-1, AISHELL-2, Wenetspeech, LibriSpeech, and Common Voice. In terms of Chinese and Cantonese recognition, the SenseVoice-Small model has advantages.\n\n\u003Cdiv align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_1a8d1f8ce5fd.png\" width=\"400\" \u002F>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_d82aef7e5122.png\" width=\"400\" \u002F>\n\u003C\u002Fdiv>\n\n## Speech Emotion Recognition\n\nDue to the current lack of widely-used benchmarks and methods for speech emotion recognition, we conducted evaluations across various metrics on multiple test sets and performed a comprehensive comparison with numerous results from recent benchmarks. The selected test sets encompass data in both Chinese and English, and include multiple styles such as performances, films, and natural conversations. Without finetuning on the target data, SenseVoice was able to achieve and exceed the performance of the current best speech emotion recognition models.\n\n\u003Cdiv align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_bf6f09fc7b14.png\" width=\"1000\" \u002F>\n\u003C\u002Fdiv>\n\nFurthermore, we compared multiple open-source speech emotion recognition models on the test sets, and the results indicate that the SenseVoice-Large model achieved the best performance on nearly all datasets, while the SenseVoice-Small model also surpassed other open-source models on the majority of the datasets.\n\n\u003Cdiv align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_451060d837db.png\" width=\"500\" \u002F>\n\u003C\u002Fdiv>\n\n## Audio Event Detection\n\nAlthough trained exclusively on speech data, SenseVoice can still function as a standalone event detection model. We compared its performance on the environmental sound classification ESC-50 dataset against the widely used industry models BEATS and PANN. The SenseVoice model achieved commendable results on these tasks. However, due to limitations in training data and methodology, its event classification performance has some gaps compared to specialized AED models.\n\n\u003Cdiv align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_5131e980f9e2.png\" width=\"500\" \u002F>\n\u003C\u002Fdiv>\n\n## Computational  Efficiency\n\nThe SenseVoice-Small model deploys a non-autoregressive end-to-end architecture, resulting in extremely low inference latency. With a similar number of parameters to the Whisper-Small model, it infers more than 5 times faster than Whisper-Small and 15 times faster than Whisper-Large. \n\n\u003Cdiv align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_f52717bbaf44.png\" width=\"1000\" \u002F>\n\u003C\u002Fdiv>\n\n\n# Requirements\n\n```shell\npip install -r requirements.txt\n```\n\n\u003Ca name=\"Usage\">\u003C\u002Fa>\n# Usage\n\n## Inference\n\nSupports input of audio in any format and of any duration.\n\n```python\nfrom funasr import AutoModel\nfrom funasr.utils.postprocess_utils import rich_transcription_postprocess\n\nmodel_dir = \"iic\u002FSenseVoiceSmall\"\n\n\nmodel = AutoModel(\n    model=model_dir,\n    trust_remote_code=True,\n    remote_code=\".\u002Fmodel.py\",    \n    vad_model=\"fsmn-vad\",\n    vad_kwargs={\"max_single_segment_time\": 30000},\n    device=\"cuda:0\",\n)\n\n# en\nres = model.generate(\n    input=f\"{model.model_path}\u002Fexample\u002Fen.mp3\",\n    cache={},\n    language=\"auto\",  # \"zh\", \"en\", \"yue\", \"ja\", \"ko\", \"nospeech\"\n    use_itn=True,\n    batch_size_s=60,\n    merge_vad=True,  #\n    merge_length_s=15,\n)\ntext = rich_transcription_postprocess(res[0][\"text\"])\nprint(text)\n```\n\n\u003Cdetails>\u003Csummary>Parameter Description (Click to Expand)\u003C\u002Fsummary>\n\n- `model_dir`: The name of the model, or the path to the model on the local disk.\n- `trust_remote_code`:\n  - When `True`, it means that the model's code implementation is loaded from `remote_code`, which specifies the exact location of the `model` code (for example, `model.py` in the current directory). It supports absolute paths, relative paths, and network URLs.\n  - When `False`, it indicates that the model's code implementation is the integrated version within [FunASR](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR). At this time, modifications made to `model.py` in the current directory will not be effective, as the version loaded is the internal one from FunASR. For the model code, [click here to view](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR\u002Ftree\u002Fmain\u002Ffunasr\u002Fmodels\u002Fsense_voice).\n- `vad_model`: This indicates the activation of VAD (Voice Activity Detection). The purpose of VAD is to split long audio into shorter clips. In this case, the inference time includes both VAD and SenseVoice total consumption, and represents the end-to-end latency. If you wish to test the SenseVoice model's inference time separately, the VAD model can be disabled.\n- `vad_kwargs`: Specifies the configurations for the VAD model. `max_single_segment_time`: denotes the maximum duration for audio segmentation by the `vad_model`, with the unit being milliseconds (ms).\n- `use_itn`: Whether the output result includes punctuation and inverse text normalization.\n- `batch_size_s`: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in seconds (s).\n- `merge_vad`: Whether to merge short audio fragments segmented by the VAD model, with the merged length being `merge_length_s`, in seconds (s).\n- `ban_emo_unk`: Whether to ban the output of the `emo_unk` token.\n\u003C\u002Fdetails>\n\nIf all inputs are short audios (\u003C30s), and batch inference is needed to speed up inference efficiency, the VAD model can be removed, and `batch_size` can be set accordingly.\n```python\nmodel = AutoModel(model=model_dir, trust_remote_code=True, device=\"cuda:0\")\n\nres = model.generate(\n    input=f\"{model.model_path}\u002Fexample\u002Fen.mp3\",\n    cache={},\n    language=\"zh\", # \"zh\", \"en\", \"yue\", \"ja\", \"ko\", \"nospeech\"\n    use_itn=False,\n    batch_size=64, \n)\n```\n\nFor more usage, please refer to [docs](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR\u002Fblob\u002Fmain\u002Fdocs\u002Ftutorial\u002FREADME.md)\n\n### Inference directly\n\nSupports input of audio in any format, with an input duration limit of 30 seconds or less.\n\n```python\nfrom model import SenseVoiceSmall\nfrom funasr.utils.postprocess_utils import rich_transcription_postprocess\n\nmodel_dir = \"iic\u002FSenseVoiceSmall\"\nm, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device=\"cuda:0\")\nm.eval()\n\nres = m.inference(\n    data_in=f\"{kwargs['model_path']}\u002Fexample\u002Fen.mp3\",\n    language=\"auto\", # \"zh\", \"en\", \"yue\", \"ja\", \"ko\", \"nospeech\"\n    use_itn=False,\n    ban_emo_unk=False,\n    **kwargs,\n)\n\ntext = rich_transcription_postprocess(res[0][0][\"text\"])\nprint(text)\n```\n\n### Export and Test\n\u003Cdetails>\u003Csummary>ONNX and Libtorch Export\u003C\u002Fsummary>\n\n#### ONNX\n```python\n# pip3 install -U funasr funasr-onnx\nfrom pathlib import Path\nfrom funasr_onnx import SenseVoiceSmall\nfrom funasr_onnx.utils.postprocess_utils import rich_transcription_postprocess\n\n\nmodel_dir = \"iic\u002FSenseVoiceSmall\"\n\nmodel = SenseVoiceSmall(model_dir, batch_size=10, quantize=True)\n\n# inference\nwav_or_scp = [\"{}\u002F.cache\u002Fmodelscope\u002Fhub\u002F{}\u002Fexample\u002Fen.mp3\".format(Path.home(), model_dir)]\n\nres = model(wav_or_scp, language=\"auto\", use_itn=True)\nprint([rich_transcription_postprocess(i) for i in res])\n```\nNote: ONNX model is exported to the original model directory.\n\n#### Libtorch\n```python\nfrom pathlib import Path\nfrom funasr_torch import SenseVoiceSmall\nfrom funasr_torch.utils.postprocess_utils import rich_transcription_postprocess\n\n\nmodel_dir = \"iic\u002FSenseVoiceSmall\"\n\nmodel = SenseVoiceSmall(model_dir, batch_size=10, device=\"cuda:0\")\n\nwav_or_scp = [\"{}\u002F.cache\u002Fmodelscope\u002Fhub\u002F{}\u002Fexample\u002Fen.mp3\".format(Path.home(), model_dir)]\n\nres = model(wav_or_scp, language=\"auto\", use_itn=True)\nprint([rich_transcription_postprocess(i) for i in res])\n```\nNote: Libtorch model is exported to the original model directory.\n\u003C\u002Fdetails>\n\n## Service\n\n### Deployment with FastAPI\n```shell\nexport SENSEVOICE_DEVICE=cuda:0\nfastapi run --port 50000\n```\n\n## Finetune\n\n### Requirements\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002Falibaba\u002FFunASR.git && cd FunASR\npip3 install -e .\u002F\n```\n## 🐳 Docker Support\n\nSenseVoice can be built and run using Docker to simplify setup, ensure reproducibility, and support both CPU and GPU inference.\n\n### Build with Docker\n```bash\ndocker build -t sensevoice .\n```\n\n### Run (GPU – default)\n```bash\ndocker run --gpus all -p 50000:50000 sensevoice\n```\n### Run (CPU-only)\n```bash\ndocker run -e SENSEVOICE_DEVICE=cpu -p 50000:50000 sensevoice\n```\n### Docker Compose\nDocker Compose provides an easier way to run SenseVoice with persistent model caching, networking etc. \n\n### Start Stack\n```bash\ndocker compose up --build\n```\n### Data prepare\n\nData examples\n\n```text\n{\"key\": \"YOU0000008470_S0000238_punc_itn\", \"text_language\": \"\u003C|en|>\", \"emo_target\": \"\u003C|NEUTRAL|>\", \"event_target\": \"\u003C|Speech|>\", \"with_or_wo_itn\": \"\u003C|withitn|>\", \"target\": \"Including legal due diligence, subscription agreement, negotiation.\", \"source\": \"\u002Fcpfs01\u002Fshared\u002FGroup-speech\u002Fbeinian.lzr\u002Fdata\u002Findustrial_data\u002Fenglish_all\u002Faudio\u002FYOU0000008470_S0000238.wav\", \"target_len\": 7, \"source_len\": 140}\n{\"key\": \"AUD0000001556_S0007580\", \"text_language\": \"\u003C|en|>\", \"emo_target\": \"\u003C|NEUTRAL|>\", \"event_target\": \"\u003C|Speech|>\", \"with_or_wo_itn\": \"\u003C|woitn|>\", \"target\": \"there is a tendency to identify the self or take interest in what one has got used to\", \"source\": \"\u002Fcpfs01\u002Fshared\u002FGroup-speech\u002Fbeinian.lzr\u002Fdata\u002Findustrial_data\u002Fenglish_all\u002Faudio\u002FAUD0000001556_S0007580.wav\", \"target_len\": 18, \"source_len\": 360}\n```\n\nFull ref to `data\u002Ftrain_example.jsonl`\n\n\u003Cdetails>\u003Csummary>Data Prepare Details\u003C\u002Fsummary>\n\nDescription：\n- `key`: audio file unique ID\n- `source`：path to the audio file\n- `source_len`：number of fbank frames of the audio file\n- `target`：transcription\n- `target_len`：length of target\n- `text_language`：language id of the audio file\n- `emo_target`：emotion label of the audio file\n- `event_target`：event label of the audio file\n- `with_or_wo_itn`：whether includes punctuation and inverse text normalization\n\n\n`train_text.txt`\n\n\n```bash\nBAC009S0764W0121 甚至出现交易几乎停滞的情况\nBAC009S0916W0489 湖北一公司以员工名义贷款数十员工负债千万\nasr_example_cn_en 所有只要处理 data 不管你是做 machine learning 做 deep learning 做 data analytics 做 data science 也好 scientist 也好通通都要都做的基本功啊那 again 先先对有一些>也许对\nID0012W0014 he tried to think how it could be\n```\n\n`train_wav.scp`\n\n\n\n```bash\nBAC009S0764W0121 https:\u002F\u002Fisv-data.oss-cn-hangzhou.aliyuncs.com\u002Fics\u002FMaaS\u002FASR\u002Ftest_audio\u002FBAC009S0764W0121.wav\nBAC009S0916W0489 https:\u002F\u002Fisv-data.oss-cn-hangzhou.aliyuncs.com\u002Fics\u002FMaaS\u002FASR\u002Ftest_audio\u002FBAC009S0916W0489.wav\nasr_example_cn_en https:\u002F\u002Fisv-data.oss-cn-hangzhou.aliyuncs.com\u002Fics\u002FMaaS\u002FASR\u002Ftest_audio\u002Fasr_example_cn_en.wav\nID0012W0014 https:\u002F\u002Fisv-data.oss-cn-hangzhou.aliyuncs.com\u002Fics\u002FMaaS\u002FASR\u002Ftest_audio\u002Fasr_example_en.wav\n```\n\n`train_text_language.txt`\n\nThe language ids include `\u003C|zh|>`、`\u003C|en|>`、`\u003C|yue|>`、`\u003C|ja|>` and `\u003C|ko|>`.\n\n```bash\nBAC009S0764W0121 \u003C|zh|>\nBAC009S0916W0489 \u003C|zh|>\nasr_example_cn_en \u003C|zh|>\nID0012W0014 \u003C|en|>\n```\n\n`train_emo.txt`\n\nThe emotion labels include`\u003C|HAPPY|>`、`\u003C|SAD|>`、`\u003C|ANGRY|>`、`\u003C|NEUTRAL|>`、`\u003C|FEARFUL|>`、`\u003C|DISGUSTED|>` and `\u003C|SURPRISED|>`.\n\n```bash\nBAC009S0764W0121 \u003C|NEUTRAL|>\nBAC009S0916W0489 \u003C|NEUTRAL|>\nasr_example_cn_en \u003C|NEUTRAL|>\nID0012W0014 \u003C|NEUTRAL|>\n```\n\n`train_event.txt`\n\nThe event labels include`\u003C|BGM|>`、`\u003C|Speech|>`、`\u003C|Applause|>`、`\u003C|Laughter|>`、`\u003C|Cry|>`、`\u003C|Sneeze|>`、`\u003C|Breath|>` and `\u003C|Cough|>`.\n\n```bash\nBAC009S0764W0121 \u003C|Speech|>\nBAC009S0916W0489 \u003C|Speech|>\nasr_example_cn_en \u003C|Speech|>\nID0012W0014 \u003C|Speech|>\n```\n\n`Command`\n```shell\n# generate train.jsonl and val.jsonl from wav.scp, text.txt, text_language.txt, emo_target.txt, event_target.txt\nsensevoice2jsonl \\\n++scp_file_list='[\"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_wav.scp\", \"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_text.txt\", \"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_text_language.txt\", \"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_emo.txt\", \"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_event.txt\"]' \\\n++data_type_list='[\"source\", \"target\", \"text_language\", \"emo_target\", \"event_target\"]' \\\n++jsonl_file_out=\"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain.jsonl\"\n```\n\nIf there is no `train_text_language.txt`, `train_emo_target.txt` and `train_event_target.txt`, the language, emotion and event label will be predicted automatically by using the `SenseVoice` model.\n```shell\n# generate train.jsonl and val.jsonl from wav.scp and text.txt\nsensevoice2jsonl \\\n++scp_file_list='[\"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_wav.scp\", \"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_text.txt\"]' \\\n++data_type_list='[\"source\", \"target\"]' \\\n++jsonl_file_out=\"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain.jsonl\" \\\n++model_dir='iic\u002FSenseVoiceSmall'\n```\n\u003C\u002Fdetails>\n\n### Finetune\n\nEnsure to modify the train_tool in finetune.sh to the absolute path of `funasr\u002Fbin\u002Ftrain_ds.py` from the FunASR installation directory you have set up earlier.\n\n```shell\nbash finetune.sh\n```\n\n## WebUI\n\n```shell\npython webui.py\n```\n\n\u003Cdiv align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_d389e37a0a68.png\" width=\"700\"\u002F> \u003C\u002Fdiv>\n\n\n## Remarkable Third-Party Work\n- Triton (GPU) Deployment Best Practices: Using Triton + TensorRT, tested with FP32, achieving an acceleration ratio of 526 on V100 GPU. FP16 support is in progress. [Repository](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR\u002Fblob\u002Fmain\u002Fruntime\u002Ftriton_gpu\u002FREADME.md)\n- Sherpa-onnx Deployment Best Practices: Supports using SenseVoice in 10 programming languages: C++, C, Python, C#, Go, Swift, Kotlin, Java, JavaScript, and Dart. Also supports deploying SenseVoice on platforms like iOS, Android, and Raspberry Pi. [Repository](https:\u002F\u002Fk2-fsa.github.io\u002Fsherpa\u002Fonnx\u002Fsense-voice\u002Findex.html)\n- [SenseVoice.cpp](https:\u002F\u002Fgithub.com\u002Flovemefan\u002FSenseVoice.cpp). Inference of SenseVoice in pure C\u002FC++ based on GGML, supporting 3-bit, 4-bit, 5-bit, 8-bit quantization, etc. with no third-party dependencies.\n- [streaming-sensevoice](https:\u002F\u002Fgithub.com\u002Fpengzhendong\u002Fstreaming-sensevoice) processes inference in chunks. To achieve pseudo-streaming, it employs a truncated attention mechanism, sacrificing some accuracy. Additionally, this technology supports CTC prefix beam search and hot-word boosting features.\n- [OmniSenseVoice](https:\u002F\u002Fgithub.com\u002Flifeiteng\u002FOmniSenseVoice) is optimized for lightning-fast inference and batching process. \n- [SenseVoice Hotword](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdengcunqin\u002FSenseVoiceSmall_hotword)，Neural Network Hotword Enhancement，[Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002F1QkIvh8j7rrUjRyWOgAvdA)。\n\u003Ca name=\"Community\">\u003C\u002Fa>\n# Community\nIf you encounter problems in use, you can directly raise Issues on the github page.\n\nYou can also scan the following DingTalk group QR code to join the community group for communication and discussion.\n\n|                          FunASR                          |\n|:--------------------------------------------------------:|\n| \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_a812379040e9.png\" width=\"250\"\u002F>\u003C\u002Fdiv> |\n\n\n","([简体中文](.\u002FREADME_zh.md)|English|[日本語](.\u002FREADME_ja.md))\n\n\n# 简介\n\nSenseVoice 是一款具备多种语音理解能力的语音基础模型，包括自动语音识别 (ASR)、语音语言识别 (LID)、语音情感识别 (SER) 以及音频事件检测 (AED)。\n\n\u003Cdiv align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_5314ab6db0ec.png\">\n\u003C\u002Fdiv>\n\n[\u002F\u002F]: # (\u003Cdiv align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_180454c3e47c.png\" width=\"700\"\u002F> \u003C\u002Fdiv>)\n\n\u003Cdiv align=\"center\">  \n\u003Ch4>\n\u003Ca href=\"https:\u002F\u002Ffunaudiollm.github.io\u002F\"> 首页 \u003C\u002Fa>\n｜\u003Ca href=\"#What's News\"> 最新动态 \u003C\u002Fa>\n｜\u003Ca href=\"#Benchmarks\"> 基准测试 \u003C\u002Fa>\n｜\u003Ca href=\"#Install\"> 安装 \u003C\u002Fa>\n｜\u003Ca href=\"#Usage\"> 使用方法 \u003C\u002Fa>\n｜\u003Ca href=\"#Community\"> 社区 \u003C\u002Fa>\n\u003C\u002Fh4>\n\n模型库：\n[modelscope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FSenseVoiceSmall), [huggingface](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FSenseVoiceSmall)\n\n在线演示：\n[modelscope demo](https:\u002F\u002Fwww.modelscope.cn\u002Fstudios\u002Fiic\u002FSenseVoice), [huggingface space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FFunAudioLLM\u002FSenseVoice)\n\n\n\u003C\u002Fdiv>\n\n\n\u003Ca name=\"Highligts\">\u003C\u002Fa>\n# 亮点 🎯\n**SenseVoice** 专注于高精度多语言语音识别、语音情感识别和音频事件检测。\n- **多语言语音识别：** 基于超过40万小时的数据训练，支持50多种语言，其识别性能超越了 Whisper 模型。\n- **丰富转录功能：**\n  - 具备出色的情感识别能力，在测试数据上达到并超越当前最佳的情感识别模型的效果。\n  - 提供声音事件检测功能，能够检测多种常见的人机交互事件，如背景音乐、掌声、笑声、哭声、咳嗽和打喷嚏等。\n- **高效推理：** SenseVoice-Small 模型采用非自回归端到端框架，推理延迟极低。处理10秒音频仅需70毫秒，速度是 Whisper-Large 的15倍。\n- **便捷微调：** 提供便捷的微调脚本和策略，用户可根据自身业务场景轻松解决长尾样本问题。\n- **服务部署：** 提供服务部署流水线，支持多并发请求，客户端语言包括 Python、C++、HTML、Java 和 C# 等。\n\n\u003Ca name=\"What's News\">\u003C\u002Fa>\n# 最新动态 🔥\n- 2024年11月：新增基于 CTC 对齐的时间戳支持。\n- 2024年7月：增加了针对 [ONNX](.\u002Fdemo_onnx.py) 和 [libtorch](.\u002Fdemo_libtorch.py) 的导出功能，以及 Python 版本运行时：[funasr-onnx-0.4.0](https:\u002F\u002Fpypi.org\u002Fproject\u002Ffunasr-onnx\u002F)、[funasr-torch-0.1.1](https:\u002F\u002Fpypi.org\u002Fproject\u002Ffunasr-torch\u002F)。\n- 2024年7月：开源了 [SenseVoice-Small](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FSenseVoiceSmall) 语音理解模型，该模型具备高精度的多语言语音识别、情感识别和音频事件检测能力，支持普通话、粤语、英语、日语和韩语，并且推理延迟极低。\n- 2024年7月：推出了用于自然语音生成的 CosyVoice，可实现多语言、音色和情感控制。CosyVoice 在多语言语音生成、零样本语音生成、跨语言语音克隆以及指令跟随方面表现出色。[CosyVoice 仓库](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice) 和 [CosyVoice 空间](https:\u002F\u002Fwww.modelscope.cn\u002Fstudios\u002Fiic\u002FCosyVoice-300M)。\n- 2024年7月：[FunASR](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR) 是一款基础语音识别工具包，提供多种功能，包括语音识别 (ASR)、语音活动检测 (VAD)、标点符号恢复、语言模型、说话人验证、说话人分离以及多说话人 ASR。\n\n\u003Ca name=\"Benchmarks\">\u003C\u002Fa>\n# 基准测试 📝\n\n## 多语言语音识别\n我们在开源基准数据集（包括 AISHELL-1、AISHELL-2、Wenetspeech、LibriSpeech 和 Common Voice）上对比了 SenseVoice 和 Whisper 的多语言语音识别性能。在中文和粤语识别方面，SenseVoice-Small 模型具有优势。\n\n\u003Cdiv align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_1a8d1f8ce5fd.png\" width=\"400\" \u002F>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_d82aef7e5122.png\" width=\"400\" \u002F>\n\u003C\u002Fdiv>\n\n## 语音情感识别\n\n由于目前缺乏广泛使用的语音情感识别基准和方法，我们对多个测试集进行了多维度评估，并与近期各类基准测试中的大量结果进行了全面比较。所选测试集涵盖了中英文数据，包含表演、电影和自然对话等多种风格。在未针对目标数据进行微调的情况下，SenseVoice 已经达到了甚至超越了当前最佳语音情感识别模型的性能。\n\n\u003Cdiv align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_bf6f09fc7b14.png\" width=\"1000\" \u002F>\n\u003C\u002Fdiv>\n\n此外，我们还在这些测试集上对比了多个开源语音情感识别模型，结果显示 SenseVoice-Large 模型在几乎所有数据集上都取得了最佳成绩，而 SenseVoice-Small 模型也在大多数数据集上超越了其他开源模型。\n\n\u003Cdiv align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_451060d837db.png\" width=\"500\" \u002F>\n\u003C\u002Fdiv>\n\n## 音频事件检测\n\n尽管 SenseVoice 仅使用语音数据进行训练，但它仍然可以作为独立的事件检测模型使用。我们将其在环境声音分类数据集 ESC-50 上的表现与业界广泛使用的 BEATS 和 PANN 模型进行了比较。SenseVoice 模型在这些任务上取得了令人满意的结果。然而，由于训练数据和方法的限制，其事件分类性能与专门的 AED 模型相比仍存在一定差距。\n\n\u003Cdiv align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_5131e980f9e2.png\" width=\"500\" \u002F>\n\u003C\u002Fdiv>\n\n## 计算效率\nSenseVoice-Small 模型采用了非自回归端到端架构，因此推理延迟极低。在参数量与 Whisper-Small 相近的情况下，它的推理速度比 Whisper-Small 快5倍以上，比 Whisper-Large 快15倍。\n\n\u003Cdiv align=\"center\">  \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_f52717bbaf44.png\" width=\"1000\" \u002F>\n\u003C\u002Fdiv>\n\n\n# 系统要求\n\n```shell\npip install -r requirements.txt\n```\n\n\u003Ca name=\"Usage\">\u003C\u002Fa>\n# 使用方法\n\n## 推理\n支持任意格式和任意时长的音频输入。\n\n```python\nfrom funasr import AutoModel\nfrom funasr.utils.postprocess_utils import rich_transcription_postprocess\n\nmodel_dir = \"iic\u002FSenseVoiceSmall\"\n\n\nmodel = AutoModel(\n    model=model_dir,\n    trust_remote_code=True,\n    remote_code=\".\u002Fmodel.py\",    \n    vad_model=\"fsmn-vad\",\n    vad_kwargs={\"max_single_segment_time\": 30000},\n    device=\"cuda:0\",\n)\n\n# zh\nres = model.generate(\n    input=f\"{model.model_path}\u002Fexample\u002Fen.mp3\",\n    cache={},\n    language=\"auto\",  # \"zh\", \"en\", \"yue\", \"ja\", \"ko\", \"nospeech\"\n    use_itn=True,\n    batch_size_s=60,\n    merge_vad=True,  #\n    merge_length_s=15,\n)\ntext = rich_transcription_postprocess(res[0][\"text\"])\nprint(text)\n```\n\n\u003Cdetails>\u003Csummary>参数说明（点击展开）\u003C\u002Fsummary>\n\n- `model_dir`: 模型名称，或本地磁盘上模型的路径。\n- `trust_remote_code`:\n  - 当为 `True` 时，表示从 `remote_code` 加载模型的代码实现，其中指定了 `model` 代码的具体位置（例如当前目录下的 `model.py`）。支持绝对路径、相对路径和网络 URL。\n  - 当为 `False` 时，表示使用 [FunASR](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR) 内置的模型代码实现。此时，对当前目录下 `model.py` 的修改将不会生效，因为加载的是 FunASR 内部版本。有关模型代码，请 [点击此处查看](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR\u002Ftree\u002Fmain\u002Ffunasr\u002Fmodels\u002Fsense_voice)。\n- `vad_model`: 表示是否启用 VAD（语音活动检测）。VAD 的作用是将长音频分割成较短的片段。在这种情况下，推理时间包括 VAD 和 SenseVoice 的总耗时，即端到端延迟。如果希望单独测试 SenseVoice 模型的推理时间，则可以禁用 VAD 模型。\n- `vad_kwargs`: 指定 VAD 模型的配置。`max_single_segment_time`: 表示 `vad_model` 对音频进行分割的最大时长，单位为毫秒 (ms)。\n- `use_itn`: 输出结果是否包含标点符号和反文本规范化。\n- `batch_size_s`: 表示使用动态批处理，批次中音频的总时长以秒 (s) 为单位。\n- `merge_vad`: 是否将 VAD 模型分割出的短音频片段合并，合并后的长度为 `merge_length_s`，单位为秒 (s)。\n- `ban_emo_unk`: 是否禁止输出 `emo_unk` 标记。\n\u003C\u002Fdetails>\n\n如果所有输入均为短音频（\u003C30秒），且需要通过批处理推理来提高效率，则可以移除 VAD 模型，并相应地设置 `batch_size`。\n```python\nmodel = AutoModel(model=model_dir, trust_remote_code=True, device=\"cuda:0\")\n\nres = model.generate(\n    input=f\"{model.model_path}\u002Fexample\u002Fen.mp3\",\n    cache={},\n    language=\"zh\", # \"zh\", \"en\", \"yue\", \"ja\", \"ko\", \"nospeech\"\n    use_itn=False,\n    batch_size=64, \n)\n```\n\n更多用法请参考 [文档](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR\u002Fblob\u002Fmain\u002Fdocs\u002Ftutorial\u002FREADME.md)\n\n### 直接推理\n\n支持任意格式的音频输入，输入时长限制为 30 秒或更短。\n\n```python\nfrom model import SenseVoiceSmall\nfrom funasr.utils.postprocess_utils import rich_transcription_postprocess\n\nmodel_dir = \"iic\u002FSenseVoiceSmall\"\nm, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device=\"cuda:0\")\nm.eval()\n\nres = m.inference(\n    data_in=f\"{kwargs['model_path']}\u002Fexample\u002Fen.mp3\",\n    language=\"auto\", # \"zh\", \"en\", \"yue\", \"ja\", \"ko\", \"nospeech\"\n    use_itn=False,\n    ban_emo_unk=False,\n    **kwargs,\n)\n\ntext = rich_transcription_postprocess(res[0][0][\"text\"])\nprint(text)\n```\n\n### 导出与测试\n\u003Cdetails>\u003Csummary>ONNX 和 Libtorch 导出\u003C\u002Fsummary>\n\n#### ONNX\n```python\n# pip3 install -U funasr funasr-onnx\nfrom pathlib import Path\nfrom funasr_onnx import SenseVoiceSmall\nfrom funasr_onnx.utils.postprocess_utils import rich_transcription_postprocess\n\n\nmodel_dir = \"iic\u002FSenseVoiceSmall\"\n\nmodel = SenseVoiceSmall(model_dir, batch_size=10, quantize=True)\n\n# 推理\nwav_or_scp = [\"{}\u002F.cache\u002Fmodelscope\u002Fhub\u002F{}\u002Fexample\u002Fen.mp3\".format(Path.home(), model_dir)]\n\nres = model(wav_or_scp, language=\"auto\", use_itn=True)\nprint([rich_transcription_postprocess(i) for i in res])\n```\n注意：ONNX 模型会导出到原始模型目录。\n\n#### Libtorch\n```python\nfrom pathlib import Path\nfrom funasr_torch import SenseVoiceSmall\nfrom funasr_torch.utils.postprocess_utils import rich_transcription_postprocess\n\n\nmodel_dir = \"iic\u002FSenseVoiceSmall\"\n\nmodel = SenseVoiceSmall(model_dir, batch_size=10, device=\"cuda:0\")\n\nwav_or_scp = [\"{}\u002F.cache\u002Fmodelscope\u002Fhub\u002F{}\u002Fexample\u002Fen.mp3\".format(Path.home(), model_dir)]\n\nres = model(wav_or_scp, language=\"auto\", use_itn=True)\nprint([rich_transcription_postprocess(i) for i in res])\n```\n注意：Libtorch 模型会导出到原始模型目录。\n\u003C\u002Fdetails>\n\n## 服务\n\n### 使用 FastAPI 部署\n```shell\nexport SENSEVOICE_DEVICE=cuda:0\nfastapi run --port 50000\n```\n\n## 微调\n\n### 需求\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002Falibaba\u002FFunASR.git && cd FunASR\npip3 install -e .\u002F\n```\n## 🐳 Docker 支持\n\nSenseVoice 可以通过 Docker 构建和运行，从而简化部署、确保可重复性，并支持 CPU 和 GPU 推理。\n\n### 使用 Docker 构建\n```bash\ndocker build -t sensevoice .\n```\n\n### 运行（GPU，默认）\n```bash\ndocker run --gpus all -p 50000:50000 sensevoice\n```\n### 运行（仅 CPU）\n```bash\ndocker run -e SENSEVOICE_DEVICE=cpu -p 50000:50000 sensevoice\n```\n### Docker Compose\nDocker Compose 提供了一种更简便的方式运行 SenseVoice，同时支持持久化模型缓存、网络等功能。\n\n### 启动堆栈\n```bash\ndocker compose up --build\n```\n\n### 数据准备\n\n数据示例\n\n```text\n{\"key\": \"YOU0000008470_S0000238_punc_itn\", \"text_language\": \"\u003C|en|>\", \"emo_target\": \"\u003C|NEUTRAL|>\", \"event_target\": \"\u003C|Speech|>\", \"with_or_wo_itn\": \"\u003C|withitn|>\", \"target\": \"Including legal due diligence, subscription agreement, negotiation.\", \"source\": \"\u002Fcpfs01\u002Fshared\u002FGroup-speech\u002Fbeinian.lzr\u002Fdata\u002Findustrial_data\u002Fenglish_all\u002Faudio\u002FYOU0000008470_S0000238.wav\", \"target_len\": 7, \"source_len\": 140}\n{\"key\": \"AUD0000001556_S0007580\", \"text_language\": \"\u003C|en|>\", \"emo_target\": \"\u003C|NEUTRAL|>\", \"event_target\": \"\u003C|Speech|>\", \"with_or_wo_itn\": \"\u003C|woitn|>\", \"target\": \"there is a tendency to identify the self or take interest in what one has got used to\", \"source\": \"\u002Fcpfs01\u002Fshared\u002FGroup-speech\u002Fbeinian.lzr\u002Fdata\u002Findustrial_data\u002Fenglish_all\u002Faudio\u002FAUD0000001556_S0007580.wav\", \"target_len\": 18, \"source_len\": 360}\n```\n\n完整参考 `data\u002Ftrain_example.jsonl`\n\n\u003Cdetails>\u003Csummary>数据准备详情\u003C\u002Fsummary>\n\n描述：\n- `key`: 音频文件唯一标识\n- `source`：音频文件路径\n- `source_len`：音频文件的fbank帧数\n- `target`：转录文本\n- `target_len`：目标文本长度\n- `text_language`：音频文件的语言标识\n- `emo_target`：音频文件的情感标签\n- `event_target`：音频文件的事件标签\n- `with_or_wo_itn`：是否包含标点符号和逆向文本规范化\n\n\n`train_text.txt`\n\n\n```bash\nBAC009S0764W0121 甚至出现交易几乎停滞的情况\nBAC009S0916W0489 湖北一公司以员工名义贷款数十员工负债千万\nasr_example_cn_en 所有只要处理 data 不管你是做 machine learning 做 deep learning 做 data analytics 做 data science 也好 scientist 也好通通都要都做的基本功啊那 again 先先对有一些>也许对\nID0012W0014 he tried to think how it could be\n```\n\n`train_wav.scp`\n\n\n\n```bash\nBAC009S0764W0121 https:\u002F\u002Fisv-data.oss-cn-hangzhou.aliyuncs.com\u002Fics\u002FMaaS\u002FASR\u002Ftest_audio\u002FBAC009S0764W0121.wav\nBAC009S0916W0489 https:\u002F\u002Fisv-data.oss-cn-hangzhou.aliyuncs.com\u002Fics\u002FMaaS\u002FASR\u002Ftest_audio\u002FBAC009S0916W0489.wav\nasr_example_cn_en https:\u002F\u002Fisv-data.oss-cn-hangzhou.aliyuncs.com\u002Fics\u002FMaaS\u002FASR\u002Ftest_audio\u002Fasr_example_cn_en.wav\nID0012W0014 https:\u002F\u002Fisv-data.oss-cn-hangzhou.aliyuncs.com\u002Fics\u002FMaaS\u002FASR\u002Ftest_audio\u002Fasr_example_en.wav\n```\n\n`train_text_language.txt`\n\n语言标识包括 `\u003C|zh|>`、`\u003C|en|>`、`\u003C|yue|>`、`\u003C|ja|>` 和 `\u003C|ko|>`。\n\n```bash\nBAC009S0764W0121 \u003C|zh|>\nBAC009S0916W0489 \u003C|zh|>\nasr_example_cn_en \u003C|zh|>\nID0012W0014 \u003C|en|>\n```\n\n`train_emo.txt`\n\n情感标签包括`\u003C|HAPPY|>`、`\u003C|SAD|>`、`\u003C|ANGRY|>`、`\u003C|NEUTRAL|>`、`\u003C|FEARFUL|>`、`\u003C|DISGUSTED|>` 和 `\u003C|SURPRISED|>`。\n\n```bash\nBAC009S0764W0121 \u003C|NEUTRAL|>\nBAC009S0916W0489 \u003C|NEUTRAL|>\nasr_example_cn_en \u003C|NEUTRAL|>\nID0012W0014 \u003C|NEUTRAL|>\n```\n\n`train_event.txt`\n\n事件标签包括`\u003C|BGM|>`、`\u003C|Speech|>`、`\u003C|Applause|>`、`\u003C|Laughter|>`、`\u003C|Cry|>`、`\u003C|Sneeze|>`、`\u003C|Breath|>` 和 `\u003C|Cough|>`。\n\n```bash\nBAC009S0764W0121 \u003C|Speech|>\nBAC009S0916W0489 \u003C|Speech|>\nasr_example_cn_en \u003C|Speech|>\nID0012W0014 \u003C|Speech|>\n```\n\n`命令`\n```shell\n# 从wav.scp、text.txt、text_language.txt、emo_target.txt、event_target.txt生成train.jsonl和val.jsonl\nsensevoice2jsonl \\\n++scp_file_list='[\"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_wav.scp\", \"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_text.txt\", \"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_text_language.txt\", \"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_emo.txt\", \"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_event.txt\"]' \\\n++data_type_list='[\"source\", \"target\", \"text_language\", \"emo_target\", \"event_target\"]' \\\n++jsonl_file_out=\"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain.jsonl\"\n```\n\n如果没有 `train_text_language.txt`、`train_emo_target.txt` 和 `train_event_target.txt`，语言、情感和事件标签将使用 `SenseVoice` 模型自动预测。\n```shell\n# 从wav.scp和text.txt生成train.jsonl和val.jsonl\nsensevoice2jsonl \\\n++scp_file_list='[\"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_wav.scp\", \"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain_text.txt\"]' \\\n++data_type_list='[\"source\", \"target\"]' \\\n++jsonl_file_out=\"..\u002F..\u002F..\u002Fdata\u002Flist\u002Ftrain.jsonl\" \\\n++model_dir='iic\u002FSenseVoiceSmall'\n```\n\u003C\u002Fdetails>\n\n### 微调\n\n请确保在 `finetune.sh` 中将 `train_tool` 修改为你之前设置的 FunASR 安装目录下的 `funasr\u002Fbin\u002Ftrain_ds.py` 的绝对路径。\n\n```shell\nbash finetune.sh\n```\n\n## WebUI\n\n```shell\npython webui.py\n```\n\n\u003Cdiv align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_d389e37a0a68.png\" width=\"700\"\u002F> \u003C\u002Fdiv>\n\n\n## 值得关注的第三方工作\n- Triton（GPU）部署最佳实践：使用 Triton + TensorRT，经 FP32 测试，在 V100 GPU 上实现了 526 倍加速。FP16 支持正在进行中。[仓库](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR\u002Fblob\u002Fmain\u002Fruntime\u002Ftriton_gpu\u002FREADME.md)\n- Sherpa-onnx 部署最佳实践：支持以 C++、C、Python、C#、Go、Swift、Kotlin、Java、JavaScript 和 Dart 等 10 种编程语言使用 SenseVoice。还支持在 iOS、Android 和 Raspberry Pi 等平台上部署 SenseVoice。[仓库](https:\u002F\u002Fk2-fsa.github.io\u002Fsherpa\u002Fonnx\u002Fsense-voice\u002Findex.html)\n- [SenseVoice.cpp](https:\u002F\u002Fgithub.com\u002Flovemefan\u002FSenseVoice.cpp)。基于 GGML 的纯 C\u002FC++ 版本的 SenseVoice 推理，支持 3 位、4 位、5 位、8 位量化等，无第三方依赖。\n- [streaming-sensevoice](https:\u002F\u002Fgithub.com\u002Fpengzhendong\u002Fstreaming-sensevoice) 采用分块推理方式。为实现伪流式传输，它使用了截断注意力机制，牺牲了一定的准确性。此外，该技术还支持 CTC 前缀束搜索和热词增强功能。\n- [OmniSenseVoice](https:\u002F\u002Fgithub.com\u002Flifeiteng\u002FOmniSenseVoice) 针对闪电般的快速推理和批处理流程进行了优化。\n- [SenseVoice Hotword](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fdengcunqin\u002FSenseVoiceSmall_hotword)，神经网络热词增强，[基于上下文短语预测网络的上下文化端到端语音识别](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002F1QkIvh8j7rrUjRyWOgAvdA)。\n\u003Ca name=\"Community\">\u003C\u002Fa>\n# 社区\n如果在使用过程中遇到问题，可以直接在 GitHub 页面上提交 Issue。\n\n你也可以扫描以下钉钉群二维码加入社区群，进行交流和讨论。\n\n|                          FunASR                          |\n|:--------------------------------------------------------:|\n| \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_readme_a812379040e9.png\" width=\"250\"\u002F>\u003C\u002Fdiv> |","# SenseVoice 快速上手指南\n\nSenseVoice 是一个多功能语音基础模型，支持自动语音识别（ASR）、语种识别（LID）、情感识别（SER）和音频事件检测（AED）。该模型支持 50+ 种语言，推理速度极快（比 Whisper-Large 快 15 倍），并提供丰富的微调和服务部署方案。\n\n## 环境准备\n\n- **操作系统**：Linux \u002F macOS \u002F Windows\n- **Python 版本**：3.8 - 3.10\n- **硬件要求**：\n  - GPU 推理（推荐）：NVIDIA GPU + CUDA 11.x+\n  - CPU 推理：支持，但速度较慢\n- **前置依赖**：\n  - `torch` (建议 2.0+)\n  - `funasr` 核心库\n  - 可选：`onnxruntime`, `libtorch` (用于导出加速)\n\n## 安装步骤\n\n推荐使用国内镜像源加速安装。\n\n### 1. 安装基础依赖\n\n```bash\npip install -r requirements.txt\n```\n\n若需手动安装核心库，可使用以下命令（推荐阿里云\u002F清华源）：\n\n```bash\npip install funasr -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F\npip install modelscope -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F\n```\n\n### 2. 验证安装\n\n确保能够导入相关模块且无报错：\n\n```python\nfrom funasr import AutoModel\nprint(\"Installation successful!\")\n```\n\n## 基本使用\n\n以下示例展示如何使用 `AutoModel` 接口进行最简单的语音转文字及情感分析。该方式自动处理长音频切片（VAD），支持任意时长和格式的音频输入。\n\n### 快速推理示例\n\n```python\nfrom funasr import AutoModel\nfrom funasr.utils.postprocess_utils import rich_transcription_postprocess\n\n# 模型目录，可自动从 ModelScope 下载\nmodel_dir = \"iic\u002FSenseVoiceSmall\"\n\n# 初始化模型\nmodel = AutoModel(\n    model=model_dir,\n    trust_remote_code=True,\n    remote_code=\".\u002Fmodel.py\",    \n    vad_model=\"fsmn-vad\",                # 启用 VAD 处理长音频\n    vad_kwargs={\"max_single_segment_time\": 30000}, # 最大切片时长 30s\n    device=\"cuda:0\",                     # 若无 GPU 请改为 \"cpu\"\n)\n\n# 执行推理\nres = model.generate(\n    input=f\"{model.model_path}\u002Fexample\u002Fen.mp3\", # 替换为你的音频路径\n    cache={},\n    language=\"auto\",   # 支持：\"zh\", \"en\", \"yue\", \"ja\", \"ko\", \"nospeech\", \"auto\"\n    use_itn=True,      # 是否开启逆文本标准化（标点、数字格式化）\n    batch_size_s=60,   # 动态批处理，单位秒\n    merge_vad=True,    # 合并 VAD 切片\n    merge_length_s=15, # 合并阈值\n)\n\n# 后处理并打印结果\ntext = rich_transcription_postprocess(res[0][\"text\"])\nprint(text)\n```\n\n### 短音频批量推理（高性能模式）\n\n如果所有输入音频均小于 30 秒且需要高吞吐批量处理，可关闭 VAD 以提升效率：\n\n```python\nfrom funasr import AutoModel\n\nmodel_dir = \"iic\u002FSenseVoiceSmall\"\n\n# 不加载 VAD 模型\nmodel = AutoModel(model=model_dir, trust_remote_code=True, device=\"cuda:0\")\n\nres = model.generate(\n    input=f\"{model.model_path}\u002Fexample\u002Fen.mp3\",\n    cache={},\n    language=\"zh\",     # 指定语种\n    use_itn=False,\n    batch_size=64,     # 静态批处理数量\n)\n\nprint(res)\n```\n\n### 参数说明简述\n- `language`: 指定识别语种，设为 `\"auto\"` 可自动检测。\n- `use_itn`: 设为 `True` 可输出带标点且数字格式化的文本（如\"一百\"转为\"100\"）。\n- `device`: 根据硬件环境选择 `\"cuda:0\"` 或 `\"cpu\"`。","某跨国电商平台的客服团队每天需处理来自全球 50 多个国家的海量语音投诉录音，急需从中提取客户情绪、识别语言种类并过滤背景噪音以优化服务流程。\n\n### 没有 SenseVoice 时\n- **多模型堆砌成本高**：需要分别部署 ASR 语音转写、LID 语言识别和 SER 情绪分析三个独立模型，系统架构复杂且维护困难。\n- **实时响应延迟大**：传统自回归模型（如 Whisper-Large）处理 10 秒音频耗时超过 1 秒，无法在客服通话中实时提示坐席人员客户的情绪变化。\n- **关键信息遗漏**：难以区分背景中的笑声、掌声或咳嗽声，导致系统将非人声干扰误判为有效内容，影响工单分类准确率。\n- **小语种支持弱**：对于日语、韩语或粤语等长尾语种的识别精度不足，常出现转写错误，引发跨文化沟通误解。\n\n### 使用 SenseVoice 后\n- **全能一体化处理**：SenseVoice 单个模型即可同时完成多语言转写、情绪判定及音频事件检测，大幅简化了技术栈并降低了算力成本。\n- **超低延迟交互**：凭借非自回归架构，SenseVoice 仅需 70ms 即可完成 10 秒音频分析，比 Whisper 快 15 倍，实现了真正的实时情绪预警。\n- **精细化场景感知**：能精准识别背景音乐、哭泣或打喷嚏等特定事件，自动过滤无效噪音，让客服系统只关注核心诉求。\n- **高精度多语言覆盖**：基于 40 万小时数据训练，SenseVoice 在包括中文方言在内的 50 多种语言上表现卓越，显著提升了全球客户的满意度。\n\nSenseVoice 通过“一模型多能”与极致速度，将原本割裂且滞后的语音分析流程升级为实时、精准的智能决策中枢。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFunAudioLLM_SenseVoice_1a8d1f8c.png","FunAudioLLM","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FFunAudioLLM_a46ce040.png",null,"https:\u002F\u002Fgithub.com\u002FFunAudioLLM",[77,81,85],{"name":78,"color":79,"percentage":80},"Python","#3572A5",96.2,{"name":82,"color":83,"percentage":84},"Shell","#89e051",2.7,{"name":86,"color":87,"percentage":88},"Dockerfile","#384d54",1.1,7932,718,"2026-04-08T07:25:32","NOASSERTION","Linux, macOS, Windows","非必需（支持 CPU），若使用 GPU 推荐 NVIDIA 显卡，示例代码指定 device='cuda:0'，具体显存和 CUDA 版本未在文中明确说明","未说明",{"notes":97,"python":98,"dependencies":99},"1. 模型支持通过 Docker 构建和运行，同时支持 CPU 和 GPU 推理。\n2. 核心功能依赖于 FunASR 工具包，微调时需克隆 FunASR 仓库并以可编辑模式安装。\n3. 支持导出为 ONNX 和 LibTorch 格式以优化推理。\n4. 默认示例使用 'iic\u002FSenseVoiceSmall' 模型，支持中、英、粤、日、韩等多语种。","未说明 (需安装 requirements.txt 中的依赖)",[100,101,102,103],"funasr","funasr-onnx (可选)","funasr-torch (可选)","fastapi (服务部署用)",[13,35,105,15,106,14],"视频","音频",[108,109,110,111,112,113,114,115,116,117,118,119,120],"ai","asr","gpt-4o","speech-recognition","speech-to-text","aigc","audio-event-classification","cross-lingual","llm","python","pytorch","speech-emotion-recognition","multilingual","2026-03-27T02:49:30.150509","2026-04-08T20:49:49.973199",[124,129,134,139,144,149],{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},25095,"运行 demo_funasr.py 时遇到 ImportError: cannot import name 'rich_transcription_postprocess' 报错怎么办？","这是因为 funasr 版本不兼容导致的。请升级或安装指定版本的 funasr，执行命令：pip install funasr==1.1.1（或确保版本 >= 1.1.1）。","https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FSenseVoice\u002Fissues\u002F39",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},25096,"使用 FunASR 推理时语音识别不准确或加载模型失败，可能是什么原因？","这通常是由于 PyTorch 版本过高导致的兼容性问题。请确保 torch 版本 \u003C= 2.3（例如 2.3.1）。如果更换 torch 版本后仍报错，建议重新安装 torchaudio 和 funasr 以确保环境一致。","https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FSenseVoice\u002Fissues\u002F77",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},25097,"遇到 AssertionError: choose a window size 400 that is [2, 0] 错误如何解决？","该错误与 merge_vad 参数有关。最简单的解决方法是不在 model.generate 方法中设置 merge_vad 参数。如果必须修改，可以将源码中 min_length 的默认值改为 400，或者直接去掉 merge_vad 的设置，因为 VAD 检测通常已经比较全面。","https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FSenseVoice\u002Fissues\u002F30",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},25098,"微调 SenseVoice 识别方言时，训练数据中的 text_language 字段应该填什么？其他字段是必须的吗？","对于方言训练，text_language 字段通常保持为通用语种标签（如 \u003C|zh|>），除非你有自定义的语种标识配置。训练数据示例中的字段（如 emo_target, with_or_wo_itn）并非全部强制必须，最小集可只包含 key, text_language, target, source，但缺少某些字段可能导致输出效果不佳或需要默认值处理。具体配置需参考源码或进一步自定义语种标识。","https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FSenseVoice\u002Fissues\u002F58",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},25099,"AED（音频事件检测）能力较差或训练报错，常见原因是什么？","常见原因包括：1. 训练数据不足，建议在数据集（如 ESC-50）上扩展更多事件类型以提升精度；2. 配置文件中的 token 写错（例如事件目标 token 拼写错误）；3. 数据格式问题，确保非语音数据的语种设为 \u003C|nospeech|>，情感设为 \u003C|EMO_UNKNOWN|>，事件设为对应 token，且目标文本（target）设为空。","https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FSenseVoice\u002Fissues\u002F91",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},25100,"使用 LibTorch 导出或推理时遇到设备不一致报错或速度较慢的问题如何解决？","1. 设备不一致报错：LibTorch 模型导出时的设备（CPU\u002FGPU）需与推理时的设备保持一致，或者在代码中处理设备映射逻辑。2. 推理速度慢：LibTorch 推理比 Python 脚本慢可能是因为缺少 VAD 预处理或批次排序优化。建议参考 FunASR 源码中的 auto_model.py 实现，利用 vad_model 和对 vad_slices 进行排序批处理来提升速度；或者尝试使用 int8 量化的 ONNX 模型进行 C++ 推理以获得更高性能。","https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FSenseVoice\u002Fissues\u002F68",[]]