[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-huggingface--distil-whisper":3,"tool-huggingface--distil-whisper":64},[4,23,32,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,2,"2026-04-10T11:13:16",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},4128,"GPT-SoVITS","RVC-Boss\u002FGPT-SoVITS","GPT-SoVITS 是一款强大的开源语音合成与声音克隆工具，旨在让用户仅需极少量的音频数据即可训练出高质量的个性化语音模型。它核心解决了传统语音合成技术依赖海量录音数据、门槛高且成本大的痛点，实现了“零样本”和“少样本”的快速建模：用户只需提供 5 秒参考音频即可即时生成语音，或使用 1 分钟数据进行微调，从而获得高度逼真且相似度极佳的声音效果。\n\n该工具特别适合内容创作者、独立开发者、研究人员以及希望为角色配音的普通用户使用。其内置的友好 WebUI 界面集成了人声伴奏分离、自动数据集切片、中文语音识别及文本标注等辅助功能，极大地降低了数据准备和模型训练的技术门槛，让非专业人士也能轻松上手。\n\n在技术亮点方面，GPT-SoVITS 不仅支持中、英、日、韩、粤语等多语言跨语种合成，还具备卓越的推理速度，在主流显卡上可实现实时甚至超实时的生成效率。无论是需要快速制作视频配音，还是进行多语言语音交互研究，GPT-SoVITS 都能以极低的数据成本提供专业级的语音合成体验。",56375,3,"2026-04-05T22:15:46",[21],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},2863,"TTS","coqui-ai\u002FTTS","🐸TTS 是一款功能强大的深度学习文本转语音（Text-to-Speech）开源库，旨在将文字自然流畅地转化为逼真的人声。它解决了传统语音合成技术中声音机械生硬、多语言支持不足以及定制门槛高等痛点，让高质量的语音生成变得触手可及。\n\n无论是希望快速集成语音功能的开发者，还是致力于探索前沿算法的研究人员，亦或是需要定制专属声音的数据科学家，🐸TTS 都能提供得力支持。它不仅预置了覆盖全球 1100 多种语言的训练模型，让用户能够即刻上手，还提供了完善的工具链，支持用户利用自有数据训练新模型或对现有模型进行微调，轻松实现特定风格的声音克隆。\n\n在技术亮点方面，🐸TTS 表现卓越。其最新的 ⓍTTSv2 模型支持 16 种语言，并在整体性能上大幅提升，实现了低于 200 毫秒的超低延迟流式输出，极大提升了实时交互体验。此外，它还无缝集成了 🐶Bark、🐢Tortoise 等社区热门模型，并支持调用上千个 Fairseq 模型，展现了极强的兼容性与扩展性。配合丰富的数据集分析与整理工具，🐸TTS 已成为科研与生产环境中备受信赖的语音合成解决方案。",44971,"2026-04-03T14:47:02",[21,20,13],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":29,"last_commit_at":46,"category_tags":47,"status":22},2375,"LocalAI","mudler\u002FLocalAI","LocalAI 是一款开源的本地人工智能引擎，旨在让用户在任意硬件上轻松运行各类 AI 模型，包括大语言模型、图像生成、语音识别及视频处理等。它的核心优势在于彻底打破了高性能计算的门槛，无需昂贵的专用 GPU，仅凭普通 CPU 或常见的消费级显卡（如 NVIDIA、AMD、Intel 及 Apple Silicon）即可部署和运行复杂的 AI 任务。\n\n对于担心数据隐私的用户而言，LocalAI 提供了“隐私优先”的解决方案，确保所有数据处理均在本地基础设施内完成，无需上传至云端。同时，它完美兼容 OpenAI、Anthropic 等主流 API 接口，这意味着开发者可以无缝迁移现有应用，直接利用本地资源替代云服务，既降低了成本又提升了可控性。\n\nLocalAI 内置了超过 35 种后端支持（如 llama.cpp、vLLM、Whisper 等），并集成了自主 AI 代理、工具调用及检索增强生成（RAG）等高级功能，且具备多用户管理与权限控制能力。无论是希望保护敏感数据的企业开发者、进行算法实验的研究人员，还是想要在个人电脑上体验最新 AI 技术的极客玩家，都能通过 LocalAI 获",44782,"2026-04-02T22:14:26",[13,21,19,17,20,14,16],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":29,"last_commit_at":54,"category_tags":55,"status":22},3108,"bark","suno-ai\u002Fbark","Bark 是由 Suno 推出的开源生成式音频模型，能够根据文本提示创造出高度逼真的多语言语音、音乐、背景噪音及简单音效。与传统仅能朗读文字的语音合成工具不同，Bark 基于 Transformer 架构，不仅能模拟说话，还能生成笑声、叹息、哭泣等非语言声音，甚至能处理带有情感色彩和语气停顿的复杂文本，极大地丰富了音频表达的可能性。\n\n它主要解决了传统语音合成声音机械、缺乏情感以及无法生成非语音类音效的痛点，让创作者能通过简单的文字描述获得生动自然的音频素材。无论是需要为视频配音的内容创作者、探索多模态生成的研究人员，还是希望快速原型设计的开发者，都能从中受益。普通用户也可通过集成的演示页面轻松体验其神奇效果。\n\n技术亮点方面，Bark 支持商业使用（MIT 许可），并在近期更新中实现了显著的推理速度提升，同时提供了适配低显存 GPU 的版本，降低了使用门槛。此外，社区还建立了丰富的提示词库，帮助用户更好地驾驭模型生成特定风格的声音。只需几行 Python 代码，即可将创意文本转化为高质量音频，是连接文字与声音世界的强大桥梁。",39067,"2026-04-04T03:33:35",[21],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":29,"last_commit_at":62,"category_tags":63,"status":22},5908,"ChatTTS","2noise\u002FChatTTS","ChatTTS 是一款专为日常对话场景打造的生成式语音模型，特别适用于大语言模型助手等交互式应用。它主要解决了传统文本转语音（TTS）技术在对话中缺乏自然感、情感表达单一以及难以处理停顿、笑声等细微语气的问题，让机器生成的语音听起来更像真人在聊天。\n\n这款工具非常适合开发者、研究人员以及希望为应用增添自然语音交互功能的设计师使用。普通用户也可以通过社区开发的衍生产品体验其能力。ChatTTS 的核心亮点在于其对对话任务的深度优化：它不仅支持中英文双语，还能精准控制韵律细节，自动生成自然的 laughter（笑声）、pauses（停顿）和 interjections（插入语），从而实现多说话人的互动对话效果。在韵律表现上，ChatTTS 超越了大多数开源 TTS 模型。目前开源版本基于 4 万小时数据预训练而成，虽主要用于学术研究与教育目的，但已展现出强大的潜力，并支持流式音频生成与零样本推理，为后续的多情绪控制等进阶功能奠定了基础。",39042,"2026-04-09T11:54:03",[19,17,20,21],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":76,"owner_website":81,"owner_url":82,"languages":83,"stars":96,"forks":97,"last_commit_at":98,"license":99,"difficulty_score":10,"env_os":100,"env_gpu":101,"env_ram":102,"env_deps":103,"category_tags":111,"github_topics":112,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":22,"created_at":116,"updated_at":117,"faqs":118,"releases":148},7524,"huggingface\u002Fdistil-whisper","distil-whisper","Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.","distil-whisper 是专为英语语音识别打造的高效模型，它是知名开源模型 Whisper 的“蒸馏”版本。针对原始 Whisper 模型体积大、推理速度慢导致在实时应用或资源受限设备上部署困难的问题，distil-whisper 通过知识蒸馏技术，在保持极高精度的同时实现了显著的性能飞跃。\n\n该工具的核心亮点在于其卓越的效率：相比原版大型模型，它的体积缩小了约 50%，推理速度提升了 6 倍，而在各类测试集上的词错误率（WER）差距仅为 1% 以内。这意味着用户可以在几乎不牺牲识别准确度的前提下，大幅降低计算成本和等待时间。目前提供的多个版本中，distil-large-v3 综合性能最佳，适用于大多数场景；而参数量极小的 distil-small.en 则特别适合内存有限的移动端或嵌入式设备。\n\n需要注意的是，distil-whisper 目前仅支持英语识别，若需多语言支持，建议参考 OpenAI 发布的 Whisper Turbo 方案。这款工具非常适合需要构建快速、轻量级语音转文字应用的开发者，以及希望在本地设备运行高精度识别模型的研究人员。借助 Hugging Face","distil-whisper 是专为英语语音识别打造的高效模型，它是知名开源模型 Whisper 的“蒸馏”版本。针对原始 Whisper 模型体积大、推理速度慢导致在实时应用或资源受限设备上部署困难的问题，distil-whisper 通过知识蒸馏技术，在保持极高精度的同时实现了显著的性能飞跃。\n\n该工具的核心亮点在于其卓越的效率：相比原版大型模型，它的体积缩小了约 50%，推理速度提升了 6 倍，而在各类测试集上的词错误率（WER）差距仅为 1% 以内。这意味着用户可以在几乎不牺牲识别准确度的前提下，大幅降低计算成本和等待时间。目前提供的多个版本中，distil-large-v3 综合性能最佳，适用于大多数场景；而参数量极小的 distil-small.en 则特别适合内存有限的移动端或嵌入式设备。\n\n需要注意的是，distil-whisper 目前仅支持英语识别，若需多语言支持，建议参考 OpenAI 发布的 Whisper Turbo 方案。这款工具非常适合需要构建快速、轻量级语音转文字应用的开发者，以及希望在本地设备运行高精度识别模型的研究人员。借助 Hugging Face Transformers 库，用户可以轻松集成并快速上手，让高效的语音识别触手可及。","# Distil-Whisper\n\n[[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00430)\n[[Models]](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fdistil-whisper\u002Fdistil-whisper-models-65411987e6727569748d2eb6)\n[[Colab]](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsanchit-gandhi\u002Fnotebooks\u002Fblob\u002Fmain\u002FDistil_Whisper_Benchmark.ipynb)\n[[Training Code]](training)\n\nDistil-Whisper is a distilled version of Whisper for English speech recognition that is **6 times faster**, 49% smaller, and performs **within 1% word \nerror rate (WER)** on out-of-distribution evaluation sets:\n\n| Model                                                                      | Params \u002F M | Rel. Latency ↑ | Short-Form WER ↓ | Long-Form WER ↓ |\n|----------------------------------------------------------------------------|------------|----------------|------------------|-----------------|\n| [large-v3](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-large-v3)                 | 1550       | 1.0            | **8.4**          | 11.0            |\n|                                                                            |            |                |                  |                 |\n| [distil-large-v3](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v3)   | 756        | 6.3            | 9.7              | **10.8**        |\n| [distil-large-v2](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v2)   | 756        | 5.8            | 10.1             | 11.6            |\n| [distil-medium.en](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-medium.en) | 394        | **6.8**        | 11.1             | 12.4            |\n| [distil-small.en](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-small.en)   | **166**    | 5.6            | 12.1             | 12.8            |\n\nFor most applications, we recommend the latest [distil-large-v3](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v3) checkpoint,\nsince it is the most performant distilled checkpoint and compatible across all Whisper libraries. The only exception is \nresource-constrained applications with very little memory, such as on-device or mobile applications, where the \n[distil-small.en](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-small.en) is a great choice, since it is only 166M \nparameters and performs within 4% WER of Whisper large-v3.\n\n> [!NOTE]  \n> Distil-Whisper is only available for English speech recognition. For multilingual speech recognition, we recommend using the [Whisper Turbo](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-large-v3-turbo) checkpoint, which was released by OpenAI and leverages the same principles as Distil-Whisper. For details, refer to the Whisper turbo [release statement](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper\u002Fdiscussions\u002F2363).\n\n## 1. Usage\n\nDistil-Whisper is supported in Hugging Face 🤗 Transformers from version 4.35 onwards. To run the model, first \ninstall the latest version of the Transformers library. For this example, we'll also install 🤗 Datasets to load a toy \naudio dataset from the Hugging Face Hub:\n\n```bash\npip install --upgrade pip\npip install --upgrade transformers accelerate datasets[audio]\n```\n\n### Short-Form Transcription\n\nShort-form transcription is the process of transcribing audio samples that are less than 30-seconds long, which is the \nmaximum receptive field of the Whisper models. This means the entire audio clip can be processed in one go without the \nneed for chunking.\n\nFirst, we load Distil-Whisper via the convenient [`AutoModelForSpeechSeq2Seq`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmodel_doc\u002Fauto#transformers.AutoModelForSpeechSeq2Seq) and [`AutoProcessor`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmodel_doc\u002Fauto#transformers.AutoProcessor) classes.\n\nWe load the model in `float16` precision and make sure that loading time takes as little time as possible by passing `low_cpu_mem_usage=True`.\nIn addition, we want to make sure that the model is loaded in [`safetensors`](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fsafetensors) format by passing `use_safetensors=True`:\n\n```python\nimport torch\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\nmodel_id = \"distil-whisper\u002Fdistil-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n```\n\nThe model and processor can then be passed to the [`pipeline`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain_classes\u002Fpipelines#transformers.AutomaticSpeechRecognitionPipeline).\nNote that if you would like to have more control over the generation process, you can directly make use of model + processor API as shown below.\n\n```python\npipe = pipeline(\n    \"automatic-speech-recognition\",\n    model=model,\n    tokenizer=processor.tokenizer,\n    feature_extractor=processor.feature_extractor,\n    max_new_tokens=128,\n    torch_dtype=torch_dtype,\n    device=device,\n)\n```\n\nNext, we load an example short-form audio from the LibriSpeech corpus:\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"hf-internal-testing\u002Flibrispeech_asr_dummy\", \"clean\", split=\"validation\")\nsample = dataset[0][\"audio\"]\n```\n\nFinally, we can call the pipeline to transcribe the audio:\n\n```python\nresult = pipe(sample)\nprint(result[\"text\"])\n```\n\nTo transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:\n\n```python\nresult = pipe(\"audio.mp3\")\nprint(result[\"text\"])\n```\n\nFor more information on how to customize the automatic speech recognition pipeline, please refer to the ASR pipeline [docs](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fv4.34.1\u002Fen\u002Fmain_classes\u002Fpipelines#transformers.AutomaticSpeechRecognitionPipeline).\nWe also provide an end-to-end [Google Colab](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsanchit-gandhi\u002Fnotebooks\u002Fblob\u002Fmain\u002FDistil_Whisper_Benchmark.ipynb) that benchmarks Whisper against Distil-Whisper.\n\n\u003Cdetails>\n\u003Csummary> For more control over the generation parameters, use the model + processor API directly: \u003C\u002Fsummary>\n\nAd-hoc generation arguments can be passed to `model.generate`, including `num_beams` for beam-search, `return_timestamps` \nfor segment-level timestamps, and `prompt_ids` for prompting. See the [docstrings](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fen\u002Fmodel_doc\u002Fwhisper#transformers.WhisperForConditionalGeneration.generate)\nfor more details.\n\n```python\nimport torch\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor\nfrom datasets import Audio, load_dataset\n\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\nmodel_id = \"distil-whisper\u002Fdistil-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n\ndataset = load_dataset(\"hf-internal-testing\u002Flibrispeech_asr_dummy\", \"clean\", split=\"validation\")\ndataset = dataset.cast_column(\"audio\", Audio(processor.feature_extractor.sampling_rate))\nsample = dataset[0][\"audio\"]\n\ninput_features = processor(\n  sample[\"array\"], sampling_rate=sample[\"sampling_rate\"], return_tensors=\"pt\"\n).input_features\n\ninput_features = input_features.to(device, dtype=torch_dtype)\n\ngen_kwargs = {\n  \"max_new_tokens\": 128,\n  \"num_beams\": 1,\n  \"return_timestamps\": False,\n}\n\npred_ids = model.generate(input_features, **gen_kwargs)\npred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs[\"return_timestamps\"])\n\nprint(pred_text)\n```\n\n\u003C\u002Fdetails>\n\n### Sequential Long-Form\n\nThe latest [distil-large-v3](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v3) checkpoint is specifically designed \nto be compatible with OpenAI's sequential long-form transcription algorithm. This algorithm uses a sliding window for \nbuffered inference of long audio files (> 30-seconds), and returns more accurate transcriptions compared to the \n[chunked long-form algorithm](#chunked-long-form).\n\nThe sequential long-form algorithm should be used in either of the following scenarios:\n1. Transcription accuracy is the most important factor, and latency is less of a consideration\n2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate\n\nIf you are transcribing single long audio files and latency is the most important factor, you should use the chunked algorithm\ndescribed [below](#chunked-long-form). For a detailed explanation of the different algorithms, refer to Sections 5 of \nthe [Distil-Whisper paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.00430.pdf).\n\nWe start by loading the model and processor as before:\n\n```python\nimport torch\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\nmodel_id = \"distil-whisper\u002Fdistil-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n```\n\nThe model and processor can then be passed to the [`pipeline`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain_classes\u002Fpipelines#transformers.AutomaticSpeechRecognitionPipeline).\nNote that if you would like to have more control over the generation process, you can directly make use of `model.generate(...)` API as shown below.\n\n```python\npipe = pipeline(\n    \"automatic-speech-recognition\",\n    model=model,\n    tokenizer=processor.tokenizer,\n    feature_extractor=processor.feature_extractor,\n    max_new_tokens=128,\n    torch_dtype=torch_dtype,\n    device=device,\n)\n```\n\nNext, we load a long-form audio sample. Here, we use an example of concatenated samples from the LibriSpeech corpus:\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"distil-whisper\u002Flibrispeech_long\", \"clean\", split=\"validation\")\nsample = dataset[0][\"audio\"]\n```\n\nFinally, we can call the pipeline to transcribe the audio:\n\n```python\nresult = pipe(sample)\nprint(result[\"text\"])\n```\n\nTo transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:\n\n```python\nresult = pipe(\"audio.mp3\")\nprint(result[\"text\"])\n```\n\n\u003Cdetails>\n\n\u003Csummary> For more control over the generation parameters, use the model + processor API directly: \u003C\u002Fsummary>\n\n```python\nimport torch\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor\nfrom datasets import Audio, load_dataset\n\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\nmodel_id = \"distil-whisper\u002Fdistil-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n\ndataset = load_dataset(\"hf-internal-testing\u002Flibrispeech_asr_dummy\", \"clean\", split=\"validation\")\ndataset = dataset.cast_column(\"audio\", Audio(processor.feature_extractor.sampling_rate))\nsample = dataset[0][\"audio\"]\n\ninputs = processor(\n    sample[\"array\"],\n    sampling_rate=sample[\"sampling_rate\"],\n    return_tensors=\"pt\",\n    truncation=False,\n    padding=\"longest\",\n    return_attention_mask=True,\n)\ninputs = inputs.to(device, dtype=torch_dtype)\n\ngen_kwargs = {\n    \"max_new_tokens\": 448,\n    \"num_beams\": 1,\n    \"condition_on_prev_tokens\": False,\n    \"compression_ratio_threshold\": 1.35,  # zlib compression ratio threshold (in token space)\n    \"temperature\": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),\n    \"logprob_threshold\": -1.0,\n    \"no_speech_threshold\": 0.6,\n    \"return_timestamps\": True,\n}\n\npred_ids = model.generate(**inputs, **gen_kwargs)\npred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)\n\nprint(pred_text)\n```\n\n\u003C\u002Fdetails>\n\n### Chunked Long-Form\n\ndistil-large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when \na single large audio file is being transcribed and the fastest possible inference is required. In such circumstances, \nthe chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the \n[Distil-Whisper paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.00430.pdf)).\n\nWe can load the model and processor as before:\n\n```python\nimport torch\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\nmodel_id = \"distil-whisper\u002Fdistil-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n```\n\nTo enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds\nis optimal. To activate batching, pass the argument `batch_size`:\n\n```python\npipe = pipeline(\n    \"automatic-speech-recognition\",\n    model=model,\n    tokenizer=processor.tokenizer,\n    feature_extractor=processor.feature_extractor,\n    max_new_tokens=128,\n    chunk_length_s=25,\n    batch_size=16,\n    torch_dtype=torch_dtype,\n    device=device,\n)\n```\n\nThe argument `max_new_tokens` controls the maximum number of generated tokens *per-chunk*. In the typical speech setting,\nwe have no more than 3 words spoken per-second. Therefore, for a 30-second input, we have at most 90 words (approx 128 tokens).\nWe set the maximum number of generated tokens per-chunk to 128 to truncate any possible hallucinations that occur at the \nend of the segment. These tokens get removed at the chunk borders using the long-form chunking transcription algorithm, \nso it is more efficient to truncate them directly during generation to avoid redundant generation steps in the decoder.\n\nNow, let's load a long-form audio sample. Here, we use an example of concatenated samples from the LibriSpeech corpus:\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"distil-whisper\u002Flibrispeech_long\", \"clean\", split=\"validation\")\nsample = dataset[0][\"audio\"]\n```\n\nFinally, we can call the pipeline to transcribe the audio:\n\n```python\nresult = pipe(sample)\nprint(result[\"text\"])\n```\n\nFor more information on how to customize the automatic speech recognition pipeline, please refer to the ASR pipeline [docs](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fv4.34.1\u002Fen\u002Fmain_classes\u002Fpipelines#transformers.AutomaticSpeechRecognitionPipeline).\n\n### Speculative Decoding\n\nDistil-Whisper can be used as an assistant model to Whisper for [speculative decoding](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fwhisper-speculative-decoding). \nSpeculative decoding mathematically ensures the exact same outputs as Whisper are obtained while being 2 times faster. \nThis makes it the perfect drop-in replacement for existing Whisper pipelines, since the same outputs are guaranteed.\n\nFor speculative decoding, we need to load both the teacher: [`openai\u002Fwhisper-large-v3`](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-large-v3).\nAs well as the assistant (*a.k.a* student) [`distil-whisper\u002Fdistil-large-v3`](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v3).\n\nLet's start by loading the teacher model and processor. We do this in much the same way we loaded the Distil-Whisper \nmodel in the previous examples:\n\n```python\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor\nimport torch\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\nmodel_id = \"openai\u002Fwhisper-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n```\n\nNow let's load the assistant. Since Distil-Whisper shares exactly same encoder as the teacher model, we only need \nto load the 2-layer decoder as a \"Decoder-only\" model:\n\n```python\nfrom transformers import AutoModelForCausalLM\nassistant_model_id = \"distil-whisper\u002Fdistil-large-v2\"\n\nassistant_model = AutoModelForCausalLM.from_pretrained(\n    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nassistant_model.to(device)\n```\n\nThe assistant model shares the same processor as the teacher, so there's no need to load a student processor.\n\nWe can now pass the assistant model to the pipeline to be used for speculative decoding. We pass it as a `generate_kwarg`\nwith the key [`\"assistant_model\"`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fmain_classes\u002Ftext_generation#transformers.GenerationMixin.generate.assistant_model) \nso that speculative decoding is enabled:\n\n```python\npipe = pipeline(\n    \"automatic-speech-recognition\",\n    model=model,\n    tokenizer=processor.tokenizer,\n    feature_extractor=processor.feature_extractor,\n    max_new_tokens=128,\n    generate_kwargs={\"assistant_model\": assistant_model},\n    torch_dtype=torch_dtype,\n    device=device,\n)\n```\n\nAs before, we can pass any sample to the pipeline to be transcribed:\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"hf-internal-testing\u002Flibrispeech_asr_dummy\", \"clean\", split=\"validation\")\nsample = dataset[0][\"audio\"]\n\nresult = pipe(sample)\nprint(result[\"text\"])\n```\n\n**Note:** speculative decoding should be on average 2x faster than using \"only\" Whisper large-v2 at a mere 8% increase \nin VRAM memory usage while mathematically ensuring the same results. This makes it the perfect replacement for Whisper large-v2\nin existing speech recognition pipelines.\n\nFor more details on speculative decoding, refer to the following resources:\n* [Speculative decoding for 2x faster Whisper inference](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fwhisper-speculative-decoding) blog post by Sanchit Gandhi\n* [Assisted Generation: a new direction toward low-latency text generation](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fassisted-generation) blog post by Joao Gante\n* [Fast Inference from Transformers via Speculative Decoding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.17192) paper by Leviathan et. al.\n\n### Additional Speed & Memory Improvements\n\nYou can apply additional speed and memory improvements to Distil-Whisper which we cover in the following.\n\n#### Flash Attention\n\nWe recommend using [Flash Attention 2](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fperf_infer_gpu_one#flashattention-2) if your GPU allows for it.\nTo do so, you first need to install [Flash Attention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention):\n\n```\npip install flash-attn --no-build-isolation\n```\n\nYou can then pass `use_flash_attention_2=True` to `from_pretrained` to enable Flash Attention 2:\n\n```diff\n- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)\n+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)\n```\n\n#### Torch Scale-Product-Attention (SDPA)\n\nIf your GPU does not support Flash Attention, we recommend making use of [BetterTransformers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fperf_infer_gpu_one#bettertransformer).\nTo do so, you first need to install optimum:\n\n```\npip install --upgrade optimum\n```\n\nAnd then convert your model to a \"BetterTransformer\" model before using it:\n\n```diff\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)\n+ model = model.to_bettertransformer()\n```\n\n### Exporting to Other Libraries\n\nDistil-Whisper has support in the following libraries with the original \"sequential\" long-form transcription algorithm. \nClick the links in the table to see the relevant code-snippets for each:\n\n| Library         | distil-small.en                                                                                 | distil-medium.en                                                                                 | distil-large-v2                                                                                 |\n|-----------------|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|\n| OpenAI Whisper  | [link](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-small.en#running-whisper-in-openai-whisper) | [link](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-medium.en#running-whisper-in-openai-whisper) | [link](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v2#running-whisper-in-openai-whisper) |\n| Whisper cpp     | [link](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-small.en#whispercpp)                        | [link](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-medium.en#whispercpp)                        | [link](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v2#whispercpp)                        |\n| Transformers js | [link](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-small.en#transformersjs)                    | [link](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-medium.en#transformersjs)                    | [link](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v2#transformersjs)                    |\n| Candle (Rust)   | [link](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-small.en#candle)                            | [link](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-medium.en#candle)                            | [link](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v2#candle)                            |\n\nUpdates will be posted here with the integration of the \"chunked\" long-form transcription algorithm into the respective \nlibraries.\n\nFor the 🤗 Transformers code-examples, refer to the sections [Short-Form](#short-form-transcription) and [Long-Form](#long-form-transcription) Transcription.\n\n## 2. Why use Distil-Whisper? ⁉️\n\nDistil-Whisper is designed to be a drop-in replacement for Whisper on English speech recognition. Here are 5 reasons for making the\nswitch to Distil-Whisper:\n\n1. **Faster inference:** 6 times faster inference speed, while performing to within 1% WER of Whisper on out-of-distribution audio:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdistil-whisper\u002Ffigures\u002Fresolve\u002Fmain\u002Fmain_table.png?raw=true\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\n2. **Robustness to noise:** demonstrated by strong WER performance at low signal-to-noise ratios:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_distil-whisper_readme_98a7eb680341.png\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\n3. **Robustness to hallucinations:** quantified by 1.3 times fewer repeated 5-gram word duplicates (5-Dup.) and 2.1% lower insertion error rate (IER) than Whisper:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_distil-whisper_readme_85e16fc3bc1a.png\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\n4. **Designed for speculative decoding:** Distil-Whisper can be used as an assistant model to Whisper, giving 2 times faster inference speed while mathematically ensuring the same outputs as the Whisper model.\n5. **Permissive license:** Distil-Whisper is [MIT licensed](.\u002FLICENSE), meaning it can be used for commercial applications.\n\n## 3. Approach ✍️\n\nTo distill Whisper, we copy the entire encoder module and freeze it during training. We copy only two decoder layers, \nwhich are initialised from the first and last decoder layers from Whisper. All other decoder layers from Whisper\nare discarded:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_distil-whisper_readme_8fa148278c00.png\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\nDistil-Whisper is trained on a *knowledge distillation* objective. Specifically, it is trained to minimise the KL divergence\nbetween the distilled model and the Whisper model, as well as the cross-entropy loss on pseudo-labelled audio data.\n\nWe train Distil-Whisper on a total of 22k hours of pseudo-labelled audio data, spanning 10 domains with over 18k speakers:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdistil-whisper\u002Ffigures\u002Fresolve\u002Fmain\u002Fdatasets.png?raw=true\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\nThis diverse audio dataset is paramount to ensuring robustness of Distil-Whisper to different datasets and domains. \n\nIn addition, we use a WER filter to discard pseudo-labels where Whisper mis-transcribes or hallucinates. This greatly \nimproves WER performance of the downstream distilled model.\n\nFor full details on the distillation set-up and evaluation results, refer to the [Distil-Whisper paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00430).\n\n## 4. Training Code\n\nTraining code to reproduce Distil-Whisper can be found in the directory [training](training). This code has been adapted \nbe general enough to distill Whisper for multilingual speech recognition, facilitating anyone in the community to distill \nWhisper on their choice of language.\n\n## 5. Acknowledgements\n* OpenAI for the Whisper [model](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-large-v3) and [original codebase](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper)\n* Hugging Face 🤗 [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) for the model integration\n* Google's [TPU Research Cloud (TRC)](https:\u002F\u002Fsites.research.google\u002Ftrc\u002Fabout\u002F) program for Cloud TPU v4s\n\n## 6. Citation\n\nIf you use this model, please consider citing the Distil-Whisper paper:\n```\n@misc{gandhi2023distilwhisper,\n      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling}, \n      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},\n      year={2023},\n      eprint={2311.00430},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\nAnd also the Whisper paper:\n```\n@misc{radford2022robust,\n      title={Robust Speech Recognition via Large-Scale Weak Supervision}, \n      author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},\n      year={2022},\n      eprint={2212.04356},\n      archivePrefix={arXiv},\n      primaryClass={eess.AS}\n}\n```\n","# Distil-Whisper\n\n[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00430)\n[[模型]](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fdistil-whisper\u002Fdistil-whisper-models-65411987e6727569748d2eb6)\n[[Colab]](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsanchit-gandhi\u002Fnotebooks\u002Fblob\u002Fmain\u002FDistil_Whisper_Benchmark.ipynb)\n[[训练代码]](training)\n\nDistil-Whisper 是 Whisper 的蒸馏版英语语音识别模型，其速度提升了 **6 倍**，模型大小缩小了 49%，并且在分布外评估集上的词错误率（WER）仅比原模型高 **1%**：\n\n| 模型                                                                      | 参数 \u002F 百万 | 相对延迟 ↑ | 短文本 WER ↓ | 长文本 WER ↓ |\n|----------------------------------------------------------------------------|------------|----------------|------------------|-----------------|\n| [large-v3](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-large-v3)                 | 1550       | 1.0            | **8.4**          | 11.0            |\n|                                                                            |            |                |                  |                 |\n| [distil-large-v3](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v3)   | 756        | 6.3            | 9.7              | **10.8**        |\n| [distil-large-v2](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v2)   | 756        | 5.8            | 10.1             | 11.6            |\n| [distil-medium.en](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-medium.en) | 394        | **6.8**        | 11.1             | 12.4            |\n| [distil-small.en](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-small.en)   | **166**    | 5.6            | 12.1             | 12.8            |\n\n对于大多数应用场景，我们推荐使用最新的 [distil-large-v3](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v3) 检查点，因为它是在所有 Whisper 库中兼容性最好、性能最强的蒸馏模型。唯一的例外是资源受限、内存极少的应用场景，例如设备端或移动应用，在这种情况下，[distil-small.en](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-small.en) 是一个很好的选择，因为它只有 1.66 亿参数，且在 WER 方面仅比 Whisper large-v3 高 4%。\n\n> [!注意]  \n> Distil-Whisper 目前仅支持英语语音识别。对于多语言语音识别，我们建议使用由 OpenAI 发布的 [Whisper Turbo](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-large-v3-turbo) 检查点，它采用了与 Distil-Whisper 相同的原理。有关详细信息，请参阅 Whisper Turbo 的 [发布说明](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper\u002Fdiscussions\u002F2363)。\n\n## 1. 使用方法\n\nDistil-Whisper 自 Hugging Face 🤗 Transformers 4.35 版本起得到支持。要运行该模型，首先需要安装最新版本的 Transformers 库。在此示例中，我们还将安装 🤗 Datasets，以便从 Hugging Face Hub 加载一个玩具音频数据集：\n\n```bash\npip install --upgrade pip\npip install --upgrade transformers accelerate datasets[audio]\n```\n\n### 短格式转录\n\n短格式转录是指对长度小于30秒的音频片段进行转录的过程，这正是Whisper模型的最大接收范围。这意味着整个音频片段可以一次性处理完毕，而无需将其分割成多个部分。\n\n首先，我们通过便捷的[`AutoModelForSpeechSeq2Seq`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmodel_doc\u002Fauto#transformers.AutoModelForSpeechSeq2Seq)和[`AutoProcessor`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmodel_doc\u002Fauto#transformers.AutoProcessor)类加载Distil-Whisper模型。\n\n我们将模型以`float16`精度加载，并通过传递`low_cpu_mem_usage=True`来尽可能缩短加载时间。此外，我们还希望确保模型以[`safetensors`](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fsafetensors)格式加载，因此传递了`use_safetensors=True`：\n\n```python\nimport torch\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\nmodel_id = \"distil-whisper\u002Fdistil-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n```\n\n随后，可以将模型和处理器传递给[`pipeline`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain_classes\u002Fpipelines#transformers.AutomaticSpeechRecognitionPipeline)。请注意，如果您希望对生成过程有更多控制，可以直接使用模型加处理器的API，如下所示。\n\n```python\npipe = pipeline(\n    \"automatic-speech-recognition\",\n    model=model,\n    tokenizer=processor.tokenizer,\n    feature_extractor=processor.feature_extractor,\n    max_new_tokens=128,\n    torch_dtype=torch_dtype,\n    device=device,\n)\n```\n\n接下来，我们从LibriSpeech语料库中加载一个示例短格式音频：\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"hf-internal-testing\u002Flibrispeech_asr_dummy\", \"clean\", split=\"validation\")\nsample = dataset[0][\"audio\"]\n```\n\n最后，我们可以调用管道来转录音频：\n\n```python\nresult = pipe(sample)\nprint(result[\"text\"])\n```\n\n若要转录本地音频文件，只需在调用管道时传入音频文件的路径即可：\n\n```python\nresult = pipe(\"audio.mp3\")\nprint(result[\"text\"])\n```\n\n有关如何自定义自动语音识别管道的更多信息，请参阅ASR管道的[文档](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fv4.34.1\u002Fen\u002Fmain_classes\u002Fpipelines#transformers.AutomaticSpeechRecognitionPipeline)。我们还提供了一个端到端的[Google Colab](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsanchit-gandhi\u002Fnotebooks\u002Fblob\u002Fmain\u002FDistil_Whisper_Benchmark.ipynb)，用于对比Whisper与Distil-Whisper的性能。\n\n\u003Cdetails>\n\u003Csummary> 如需更精细地控制生成参数，请直接使用模型加处理器的API： \u003C\u002Fsummary>\n\n可以向`model.generate`传递临时生成参数，包括用于束搜索的`num_beams`、用于段落级时间戳的`return_timestamps`以及用于提示的`prompt_ids`。更多详细信息请参阅[文档字符串](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fen\u002Fmodel_doc\u002Fwhisper#transformers.WhisperForConditionalGeneration.generate)。\n\n```python\nimport torch\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor\nfrom datasets import Audio, load_dataset\n\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\nmodel_id = \"distil-whisper\u002Fdistil-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n\ndataset = load_dataset(\"hf-internal-testing\u002Flibrispeech_asr_dummy\", \"clean\", split=\"validation\")\ndataset = dataset.cast_column(\"audio\", Audio(processor.feature_extractor.sampling_rate))\nsample = dataset[0][\"audio\"]\n\ninput_features = processor(\n  sample[\"array\"], sampling_rate=sample[\"sampling_rate\"], return_tensors=\"pt\"\n).input_features\n\ninput_features = input_features.to(device, dtype=torch_dtype)\n\ngen_kwargs = {\n  \"max_new_tokens\": 128,\n  \"num_beams\": 1,\n  \"return_timestamps\": False,\n}\n\npred_ids = model.generate(input_features, **gen_kwargs)\npred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs[\"return_timestamps\"])\n\nprint(pred_text)\n```\n\n\u003C\u002Fdetails>\n\n### 顺序长文本模式\n\n最新的 [distil-large-v3](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v3) 检查点专门设计用于兼容 OpenAI 的顺序长文本转录算法。该算法采用滑动窗口对较长的音频文件（超过 30 秒）进行缓冲推理，与 [分块长文本算法](#chunked-long-form) 相比，能够生成更准确的转录结果。\n\n顺序长文本算法应在以下任一场景中使用：\n1. 转录准确性是最重要的因素，而延迟相对不那么重要。\n2. 您正在批量转录较长的音频文件，在这种情况下，顺序模式的延迟与分块模式相当，但字错误率（WER）可低至 0.5%。\n\n如果您只转录单个较长的音频文件，并且延迟是最关键的因素，则应使用下文所述的分块算法。有关不同算法的详细说明，请参阅 [Distil-Whisper 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.00430.pdf) 的第 5 节。\n\n首先，我们像之前一样加载模型和处理器：\n\n```python\nimport torch\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\nmodel_id = \"distil-whisper\u002Fdistil-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n```\n\n随后，可以将模型和处理器传递给 [`pipeline`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain_classes\u002Fpipelines#transformers.AutomaticSpeechRecognitionPipeline)。请注意，如果您希望对生成过程有更多控制，可以直接使用 `model.generate(...)` API，如下所示。\n\n```python\npipe = pipeline(\n    \"automatic-speech-recognition\",\n    model=model,\n    tokenizer=processor.tokenizer,\n    feature_extractor=processor.feature_extractor,\n    max_new_tokens=128,\n    torch_dtype=torch_dtype,\n    device=device,\n)\n```\n\n接下来，我们加载一个长文本音频样本。这里我们使用 LibriSpeech 语料库中的拼接示例：\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"distil-whisper\u002Flibrispeech_long\", \"clean\", split=\"validation\")\nsample = dataset[0][\"audio\"]\n```\n\n最后，我们可以调用管道来转录音频：\n\n```python\nresult = pipe(sample)\nprint(result[\"text\"])\n```\n\n要转录音频文件，只需在调用管道时传入音频文件的路径即可：\n\n```python\nresult = pipe(\"audio.mp3\")\nprint(result[\"text\"])\n```\n\n\u003Cdetails>\n\n\u003Csummary> 如需更精细地控制生成参数，请直接使用模型和处理器 API： \u003C\u002Fsummary>\n\n```python\nimport torch\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor\nfrom datasets import Audio, load_dataset\n\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\nmodel_id = \"distil-whisper\u002Fdistil-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n\ndataset = load_dataset(\"hf-internal-testing\u002Flibrispeech_asr_dummy\", \"clean\", split=\"validation\")\ndataset = dataset.cast_column(\"audio\", Audio(processor.feature_extractor.sampling_rate))\nsample = dataset[0][\"audio\"]\n\ninputs = processor(\n    sample[\"array\"],\n    sampling_rate=sample[\"sampling_rate\"],\n    return_tensors=\"pt\",\n    truncation=False,\n    padding=\"longest\",\n    return_attention_mask=True,\n)\ninputs = inputs.to(device, dtype=torch_dtype)\n\ngen_kwargs = {\n    \"max_new_tokens\": 448,\n    \"num_beams\": 1,\n    \"condition_on_prev_tokens\": False,\n    \"compression_ratio_threshold\": 1.35,  # zlib 压缩比阈值（以 token 空间为单位）\n    \"temperature\": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),\n    \"logprob_threshold\": -1.0,\n    \"no_speech_threshold\": 0.6,\n    \"return_timestamps\": True,\n}\n\npred_ids = model.generate(**inputs, **gen_kwargs)\npred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)\n\nprint(pred_text)\n```\n\n\u003C\u002Fdetails>\n\n### 分块长文本模式\n\ndistil-large-v3 仍然兼容 Transformers 的分块长文本算法。当需要转录单个大型音频文件且要求尽可能快的推理速度时，应使用此算法。在这种情况下，分块算法的速度可比 OpenAI 的顺序长文本实现快高达 9 倍（参见 [Distil-Whisper 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.00430.pdf) 中的表 7）。\n\n我们可以像之前一样加载模型和处理器：\n\n```python\nimport torch\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\nmodel_id = \"distil-whisper\u002Fdistil-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n```\n\n要启用分块功能，需将 `chunk_length_s` 参数传递给 `pipeline`。对于 distil-large-v3，最佳的分块长度为 25 秒。若要激活批处理，则需传入 `batch_size` 参数：\n\n```python\npipe = pipeline(\n    \"automatic-speech-recognition\",\n    model=model,\n    tokenizer=processor.tokenizer,\n    feature_extractor=processor.feature_extractor,\n    max_new_tokens=128,\n    chunk_length_s=25,\n    batch_size=16,\n    torch_dtype=torch_dtype,\n    device=device,\n)\n```\n\n`max_new_tokens` 参数控制每个分块生成的最大标记数。在典型的语音场景中，每秒通常不会超过 3 个词。因此，对于 30 秒的输入，最多约有 90 个词（约 128 个标记）。我们将每个分块的最大生成标记数设置为 128，以截断可能在片段末尾出现的幻觉内容。这些标记会在分块边界处通过长文本分块转录算法被移除，因此直接在生成过程中进行截断更为高效，可以避免解码器中多余的生成步骤。\n\n现在，让我们加载一个长文本音频样本。这里我们使用 LibriSpeech 语料库中的拼接示例：\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"distil-whisper\u002Flibrispeech_long\", \"clean\", split=\"validation\")\nsample = dataset[0][\"audio\"]\n```\n\n最后，我们可以调用管道来转录音频：\n\n```python\nresult = pipe(sample)\nprint(result[\"text\"])\n```\n\n有关如何自定义自动语音识别管道的更多信息，请参阅 ASR 管道的 [文档](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fv4.34.1\u002Fen\u002Fmain_classes\u002Fpipelines#transformers.AutomaticSpeechRecognitionPipeline)。\n\n### 推测解码\n\nDistil-Whisper 可以作为 Whisper 的辅助模型，用于 [推测解码](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fwhisper-speculative-decoding)。推测解码在数学上能够确保与 Whisper 完全相同的输出，同时速度提升至两倍。这使其成为现有 Whisper 管道的理想替代品，因为可以保证输出完全一致。\n\n进行推测解码时，我们需要同时加载教师模型：[`openai\u002Fwhisper-large-v3`](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-large-v3)，以及辅助模型（即学生模型）：[`distil-whisper\u002Fdistil-large-v3`](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v3)。\n\n首先，我们加载教师模型及其处理器。加载方式与前面示例中加载 Distil-Whisper 模型的方式大致相同：\n\n```python\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor\nimport torch\n\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\nmodel_id = \"openai\u002Fwhisper-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n```\n\n接下来，我们加载辅助模型。由于 Distil-Whisper 与教师模型共享完全相同的编码器，因此我们只需加载一个两层的解码器，将其视为“仅解码器”模型：\n\n```python\nfrom transformers import AutoModelForCausalLM\nassistant_model_id = \"distil-whisper\u002Fdistil-large-v2\"\n\nassistant_model = AutoModelForCausalLM.from_pretrained(\n    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n)\nassistant_model.to(device)\n```\n\n辅助模型与教师模型共用同一处理器，因此无需单独加载学生处理器。\n\n现在，我们可以将辅助模型传递给管道，以用于推测解码。我们将它作为 `generate_kwarg` 参数，键名为 [`\"assistant_model\"`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fmain_classes\u002Ftext_generation#transformers.GenerationMixin.generate.assistant_model)，从而启用推测解码功能：\n\n```python\npipe = pipeline(\n    \"automatic-speech-recognition\",\n    model=model,\n    tokenizer=processor.tokenizer,\n    feature_extractor=processor.feature_extractor,\n    max_new_tokens=128,\n    generate_kwargs={\"assistant_model\": assistant_model},\n    torch_dtype=torch_dtype,\n    device=device,\n)\n```\n\n与之前一样，我们可以将任意样本传递给管道进行转录：\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"hf-internal-testing\u002Flibrispeech_asr_dummy\", \"clean\", split=\"validation\")\nsample = dataset[0][\"audio\"]\n\nresult = pipe(sample)\nprint(result[\"text\"])\n```\n\n**注意：** 推测解码的平均速度应比仅使用 Whisper large-v2 快两倍，而显存占用仅增加约 8%，同时在数学上确保结果完全一致。这使得它成为现有语音识别管道中替换 Whisper large-v2 的理想选择。\n\n有关推测解码的更多详细信息，请参阅以下资源：\n* Sanchit Gandhi 撰写的博客文章 [用于 2 倍更快 Whisper 推理的推测解码](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fwhisper-speculative-decoding)\n* Joao Gante 撰写的博客文章 [辅助生成：迈向低延迟文本生成的新方向](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fassisted-generation)\n* Leviathan 等人撰写的论文 [通过推测解码实现 Transformers 的快速推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.17192)\n\n### 其他速度与内存优化\n\n我们可以在下文中介绍一些额外的速度和内存优化方法，适用于 Distil-Whisper。\n\n#### Flash Attention\n\n如果您的 GPU 支持，我们建议使用 [Flash Attention 2](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fperf_infer_gpu_one#flashattention-2)。为此，您需要先安装 [Flash Attention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention)：\n\n```\npip install flash-attn --no-build-isolation\n```\n\n然后，您可以将 `use_flash_attention_2=True` 传递给 `from_pretrained` 方法，以启用 Flash Attention 2：\n\n```diff\n- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)\n+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)\n```\n\n#### Torch Scale-Product-Attention (SDPA)\n\n如果您的 GPU 不支持 Flash Attention，我们建议使用 [BetterTransformers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fperf_infer_gpu_one#bettertransformer)。为此，您需要先安装 optimum：\n\n```\npip install --upgrade optimum\n```\n\n然后，在使用模型之前将其转换为 \"BetterTransformer\" 模型：\n\n```diff\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)\n+ model = model.to_bettertransformer()\n```\n\n### 导出到其他库\n\nDistil-Whisper 在以下库中支持原始的“顺序式”长文本转录算法。点击表格中的链接即可查看各库的相关代码片段：\n\n| 库         | distil-small.en                                                                                 | distil-medium.en                                                                                 | distil-large-v2                                                                                 |\n|-----------------|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|\n| OpenAI Whisper  | [链接](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-small.en#running-whisper-in-openai-whisper) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-medium.en#running-whisper-in-openai-whisper) | [链接](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v2#running-whisper-in-openai-whisper) |\n| Whisper cpp     | [链接](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-small.en#whispercpp)                        | [链接](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-medium.en#whispercpp)                        | [链接](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v2#whispercpp)                        |\n| Transformers js | [链接](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-small.en#transformersjs)                    | [链接](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-medium.en#transformersjs)                    | [链接](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v2#transformersjs)                    |\n| Candle (Rust)   | [链接](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-small.en#candle)                            | [链接](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-medium.en#candle)                            | [链接](https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v2#candle)                            |\n\n随着“分块式”长文本转录算法集成到各个库中，相关更新将在此处发布。\n\n有关 🤗 Transformers 的代码示例，请参阅“短文本转录”和“长文本转录”两节。\n\n## 2. 为什么使用 Distil-Whisper？ ⁉️\n\nDistil-Whisper 被设计为英语语音识别领域中 Whisper 的直接替代品。以下是切换到 Distil-Whisper 的 5 个理由：\n\n1. **更快的推理速度：** 推理速度是 Whisper 的 6 倍，同时在分布外音频上的 WER 误差率仅比 Whisper 高 1%：\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdistil-whisper\u002Ffigures\u002Fresolve\u002Fmain\u002Fmain_table.png?raw=true\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\n2. **抗噪能力强：** 在低信噪比条件下表现出色的 WER 性能：\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_distil-whisper_readme_98a7eb680341.png\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\n3. **减少幻觉现象：** 经量化分析，Distil-Whisper 的重复 5-gram 单词数量（5-Dup.）比 Whisper 少 1.3 倍，插入错误率（IER）也降低了 2.1%：\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_distil-whisper_readme_85e16fc3bc1a.png\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\n4. **专为推测解码设计：** Distil-Whisper 可用作 Whisper 的辅助模型，使推理速度提升至原来的 2 倍，同时在数学上保证输出结果与 Whisper 相同。\n5. **宽松的许可证：** Distil-Whisper 采用 [MIT 许可证](.\u002FLICENSE)，因此可用于商业用途。\n\n## 3. 方法 ✍️\n\n为了蒸馏 Whisper，我们将整个编码器模块复制并冻结，只保留两个解码器层，并从 Whisper 的第一个和最后一个解码器层初始化它们。其余的解码器层则被舍弃：\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_distil-whisper_readme_8fa148278c00.png\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\nDistil-Whisper 是基于知识蒸馏的目标进行训练的。具体来说，它通过最小化蒸馏模型与 Whisper 模型之间的 KL 散度，以及伪标签音频数据上的交叉熵损失来进行训练。\n\n我们使用总计 22,000 小时的伪标签音频数据对 Distil-Whisper 进行训练，这些数据涵盖了 10 个领域，涉及超过 18,000 名说话者：\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdistil-whisper\u002Ffigures\u002Fresolve\u002Fmain\u002Fdatasets.png?raw=true\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\n如此多样化的音频数据集对于确保 Distil-Whisper 在不同数据集和领域中的鲁棒性至关重要。\n\n此外，我们还使用 WER 过滤器来剔除 Whisper 错误转录或产生幻觉的伪标签，这大大提升了下游蒸馏模型的 WER 性能。\n\n有关蒸馏设置和评估结果的详细信息，请参阅 [Distil-Whisper 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00430)。\n\n## 4. 训练代码\n\n用于复现 Distil-Whisper 的训练代码可在 [training](training) 目录中找到。该代码经过调整，足以支持多语言语音识别的 Whisper 蒸馏工作，方便社区中的任何人根据自己的需求蒸馏任意语言的 Whisper。\n\n## 5. 致谢\n* OpenAI 提供的 Whisper [模型](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-large-v3)及[原始代码库](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper)\n* Hugging Face 🤗 [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) 在模型集成方面的支持\n* Google 的 [TPU 研究云 (TRC)](https:\u002F\u002Fsites.research.google\u002Ftrc\u002Fabout\u002F) 计划提供的 Cloud TPU v4 资源\n\n## 6. 引用\n\n如果您使用本模型，请考虑引用 Distil-Whisper 论文：\n```\n@misc{gandhi2023distilwhisper,\n      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling}, \n      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},\n      year={2023},\n      eprint={2311.00430},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n同时请引用 Whisper 论文：\n```\n@misc{radford2022robust,\n      title={Robust Speech Recognition via Large-Scale Weak Supervision}, \n      author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},\n      year={2022},\n      eprint={2212.04356},\n      archivePrefix={arXiv},\n      primaryClass={eess.AS}\n}\n```","# Distil-Whisper 快速上手指南\n\nDistil-Whisper 是 Whisper 模型的蒸馏版本，专为**英语语音识别**设计。相比原版，它速度快 **6 倍**，体积小 **49%**，且在分布外测试集上的词错误率（WER）差异控制在 **1%** 以内。\n\n> **注意**：本模型仅支持英语。如需多语言支持，建议使用 OpenAI 发布的 Whisper Turbo 模型。\n\n## 1. 环境准备\n\n*   **系统要求**：支持 Linux、macOS 或 Windows。\n*   **硬件建议**：推荐使用 NVIDIA GPU 以获得最佳推理速度（需安装 CUDA）。若无 GPU，可在 CPU 上运行但速度较慢。\n*   **Python 版本**：建议 Python 3.8+。\n*   **核心依赖**：\n    *   `transformers` >= 4.35\n    *   `torch` (PyTorch)\n    *   `accelerate`\n    *   `datasets` (用于加载测试音频)\n\n## 2. 安装步骤\n\n使用 pip 安装最新版本的依赖库。国内用户建议使用清华源或阿里源加速下载。\n\n```bash\n# 升级 pip\npip install --upgrade pip\n\n# 安装核心依赖 (推荐使用国内镜像源)\npip install --upgrade transformers accelerate datasets[audio] -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 3. 基本使用\n\n以下示例演示如何加载 `distil-large-v3` 模型并对短音频（\u003C30 秒）进行转录。该模型会自动检测是否有 GPU 并启用半精度 (`float16`) 加速。\n\n### 最简单的使用示例\n\n```python\nimport torch\nfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline\nfrom datasets import load_dataset\n\n# 1. 设置设备和数据类型\ndevice = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\ntorch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\n# 2. 加载模型和处理器\n# 推荐使用 distil-large-v3，它在性能和兼容性之间取得了最佳平衡\nmodel_id = \"distil-whisper\u002Fdistil-large-v3\"\n\nmodel = AutoModelForSpeechSeq2Seq.from_pretrained(\n    model_id, \n    torch_dtype=torch_dtype, \n    low_cpu_mem_usage=True, \n    use_safetensors=True\n)\nmodel.to(device)\n\nprocessor = AutoProcessor.from_pretrained(model_id)\n\n# 3. 创建语音识别管道\npipe = pipeline(\n    \"automatic-speech-recognition\",\n    model=model,\n    tokenizer=processor.tokenizer,\n    feature_extractor=processor.feature_extractor,\n    max_new_tokens=128,\n    torch_dtype=torch_dtype,\n    device=device,\n)\n\n# 4. 执行转录\n\n# 方式 A: 转录在线测试数据集中的一个样本\ndataset = load_dataset(\"hf-internal-testing\u002Flibrispeech_asr_dummy\", \"clean\", split=\"validation\")\nsample = dataset[0][\"audio\"]\nresult = pipe(sample)\nprint(\"在线样本转录结果:\", result[\"text\"])\n\n# 方式 B: 转录本地音频文件 (取消注释并替换文件名即可使用)\n# result = pipe(\"path\u002Fto\u002Fyour\u002Faudio.mp3\")\n# print(\"本地文件转录结果:\", result[\"text\"])\n```\n\n### 长音频处理提示\n如果需要转录超过 30 秒的长音频，`distil-large-v3` 支持顺序长形式（Sequential Long-Form）算法。只需将音频文件路径传入上述 `pipe` 即可，管道会自动处理滑动窗口推理，无需手动切片。","某初创团队正在开发一款实时英语会议记录 SaaS 服务，需要在云端低成本处理大量用户上传的短时段会议录音。\n\n### 没有 distil-whisper 时\n- **响应延迟高**：使用原版 Whisper large-v3 模型处理音频时，推理速度较慢，用户上传录音后往往需要等待数秒甚至更久才能看到转录结果，严重影响实时体验。\n- **服务器成本高昂**：由于模型参数量高达 15.5 亿，推理时需要占用大量 GPU 显存和计算资源，导致云服务商账单激增，初创团队难以承受。\n- **并发能力受限**：单张显卡同时能处理的请求数量有限，在会议高峰期容易出现排队拥堵，迫使团队不得不增加机器数量来维持服务稳定性。\n- **部署门槛较高**：大模型对内存要求苛刻，限制了其在边缘设备或低配服务器上的部署可能性，无法灵活扩展应用场景。\n\n### 使用 distil-whisper 后\n- **推理速度飞跃**：切换至 distil-large-v3 后，推理速度提升了 6 倍以上，会议录音几乎实现“秒级”转写，用户无感等待，交互流畅度大幅提升。\n- **运营成本骤降**：模型体积缩小了 49%，显著降低了单次推理的显存占用和计算开销，使得同等预算下可支持的并发用户数成倍增长。\n- **精度几乎无损**：在保持英文识别准确率与原版相差不到 1% 的前提下，成功实现了性能与效率的最佳平衡，确保了会议记录的可靠性。\n- **部署更加灵活**：更小的模型尺寸让团队可以考虑将服务部署在更低成本的实例上，甚至为未来拓展到端侧设备留下了技术空间。\n\ndistil-whisper 通过极致的效率优化，帮助团队在几乎不牺牲识别精度的情况下，大幅降低了延迟与成本，让高质量语音转写服务得以规模化落地。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_distil-whisper_98a7eb68.png","huggingface","Hugging Face","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhuggingface_90da21a4.png","The AI community building the future.",null,"https:\u002F\u002Fhuggingface.co\u002F","https:\u002F\u002Fgithub.com\u002Fhuggingface",[84,88,92],{"name":85,"color":86,"percentage":87},"Python","#3572A5",88.7,{"name":89,"color":90,"percentage":91},"Shell","#89e051",11.2,{"name":93,"color":94,"percentage":95},"Makefile","#427819",0,4069,351,"2026-04-14T10:05:51","MIT","未说明","非必需（支持 CPU），但推荐使用 NVIDIA GPU 以启用 float16 加速；显存需求取决于模型大小（distil-small.en 约 166M 参数，distil-large-v3 约 756M 参数）","未说明（建议使用 low_cpu_mem_usage=True 优化加载）",{"notes":104,"python":100,"dependencies":105},"该工具仅支持英语语音识别。加载模型时建议指定 torch_dtype=torch.float16（若使用 GPU）并开启 use_safetensors=True 和 low_cpu_mem_usage=True 以优化性能。长音频转录支持顺序长表单（sequential long-form）和分块长表单（chunked long-form）两种模式。",[106,107,108,109,110],"transformers>=4.35","torch","accelerate","datasets[audio]","safetensors",[21],[113,114,115],"audio","speech-recognition","whisper","2026-03-27T02:49:30.150509","2026-04-15T06:08:27.109850",[119,124,129,134,139,144],{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},33739,"目前是否支持 distil-large-v3 模型？","是的，项目已经支持 large-v3 版本。您可以在 Hugging Face Hub 上找到该模型，地址为：https:\u002F\u002Fhuggingface.co\u002Fdistil-whisper\u002Fdistil-large-v3。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdistil-whisper\u002Fissues\u002F36",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},33735,"Distil-Whisper 模型是否支持商业使用？","是的，Distil-Whisper 模型继承了 Whisper 的宽松 MIT 许可证，允许商业使用。虽然训练数据集可能包含非商业限制（如 CC BY-NC-ND），但模型本身的许可证是独立的。维护者确认模型文件本身遵循 MIT 协议，用户可以放心将其用于商业应用程序中。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdistil-whisper\u002Fissues\u002F7",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},33736,"为什么我无法复现论文中报告的 WER（词错误率）结果？","如果您使用 PyTorch 脚本 `run_eval.py` 进行评估，得到的结果与论文中报告的 Flax\u002FTPU 结果存在细微差异（通常在 0.1% 以内）是正常的。这是因为矩阵乘法在 GPU (PyTorch) 和 TPU (Flax) 上的实现方式不同。只要差异在 0.1% 左右（例如论文报告 12.9%，实测 13.0%），即视为成功复现。确保使用最新的代码库（合并了相关 PR 后）以获得最佳一致性。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdistil-whisper\u002Fissues\u002F131",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},33737,"在使用 BetterTransformer 或 flash_attn 进行训练优化时遇到报错怎么办？","如果遇到关于 BetterTransformer 的警告，请升级至 `transformers>=4.36` 和 `torch>=2.1.1`，新版本已原生支持 `scaled_dot_product_attention`，无需再调用 `model.to_bettertransformers()`。如果设置 `--attn_type \"flash_attn_2\"` 报错，请注意参数名称已更新，应使用 `--attn_implementation`，且有效值为 `\"eager\"`, `\"sdpa\"`, 或 `\"flash_attention_2\"`（注意是 `flash_attention_2` 而非 `flash_attn_2`）。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdistil-whisper\u002Fissues\u002F93",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},33738,"在 Jetson Xavier 等边缘设备上运行模型时输出乱码或翻译错误如何解决？","这通常是因为使用了错误的分词器（Tokenizer）。Distil-Whisper 模型（如 `distil-whisper\u002Fdistil-medium.en`）使用的分词器与原始 OpenAI Whisper 模型（如 `openai\u002Fwhisper-base`）不同。解决方案是：务必加载与蒸馏模型对应的专用分词器来解码生成的 ID。请在代码中显式加载 `AutoTokenizer.from_pretrained(\"distil-whisper\u002Fdistil-medium.en\")` 而不是原始模型的分词器。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdistil-whisper\u002Fissues\u002F30",{"id":145,"question_zh":146,"answer_zh":147,"source_url":133},33740,"如何在评估脚本中正确配置参数以匹配论文结果？","建议使用以下命令配置进行评估，以确保环境设置正确：\n```bash\npython run_eval.py \\\n    --model_name_or_path \"distil-whisper\u002Fdistil-large-v2\" \\\n    --dataset_name \"distil-whisper\u002Fcommon_voice_13_0\" \\\n    --dataset_config_name \"en\" \\\n    --dataset_split_name \"test\" \\\n    --text_column_name \"text\" \\\n    --batch_size 128 \\\n    --dtype \"bfloat16\" \\\n    --generation_max_length 256 \\\n    --language \"en\" \\\n    --streaming True\n```\n注意数据类型应设为 `bfloat16`，并开启流式处理以适配大数据集。",[]]