[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Alexander-H-Liu--End-to-end-ASR-Pytorch":3,"tool-Alexander-H-Liu--End-to-end-ASR-Pytorch":64},[4,23,32,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,2,"2026-04-10T11:13:16",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},4128,"GPT-SoVITS","RVC-Boss\u002FGPT-SoVITS","GPT-SoVITS 是一款强大的开源语音合成与声音克隆工具，旨在让用户仅需极少量的音频数据即可训练出高质量的个性化语音模型。它核心解决了传统语音合成技术依赖海量录音数据、门槛高且成本大的痛点，实现了“零样本”和“少样本”的快速建模：用户只需提供 5 秒参考音频即可即时生成语音，或使用 1 分钟数据进行微调，从而获得高度逼真且相似度极佳的声音效果。\n\n该工具特别适合内容创作者、独立开发者、研究人员以及希望为角色配音的普通用户使用。其内置的友好 WebUI 界面集成了人声伴奏分离、自动数据集切片、中文语音识别及文本标注等辅助功能，极大地降低了数据准备和模型训练的技术门槛，让非专业人士也能轻松上手。\n\n在技术亮点方面，GPT-SoVITS 不仅支持中、英、日、韩、粤语等多语言跨语种合成，还具备卓越的推理速度，在主流显卡上可实现实时甚至超实时的生成效率。无论是需要快速制作视频配音，还是进行多语言语音交互研究，GPT-SoVITS 都能以极低的数据成本提供专业级的语音合成体验。",56375,3,"2026-04-05T22:15:46",[21],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},2863,"TTS","coqui-ai\u002FTTS","🐸TTS 是一款功能强大的深度学习文本转语音（Text-to-Speech）开源库，旨在将文字自然流畅地转化为逼真的人声。它解决了传统语音合成技术中声音机械生硬、多语言支持不足以及定制门槛高等痛点，让高质量的语音生成变得触手可及。\n\n无论是希望快速集成语音功能的开发者，还是致力于探索前沿算法的研究人员，亦或是需要定制专属声音的数据科学家，🐸TTS 都能提供得力支持。它不仅预置了覆盖全球 1100 多种语言的训练模型，让用户能够即刻上手，还提供了完善的工具链，支持用户利用自有数据训练新模型或对现有模型进行微调，轻松实现特定风格的声音克隆。\n\n在技术亮点方面，🐸TTS 表现卓越。其最新的 ⓍTTSv2 模型支持 16 种语言，并在整体性能上大幅提升，实现了低于 200 毫秒的超低延迟流式输出，极大提升了实时交互体验。此外，它还无缝集成了 🐶Bark、🐢Tortoise 等社区热门模型，并支持调用上千个 Fairseq 模型，展现了极强的兼容性与扩展性。配合丰富的数据集分析与整理工具，🐸TTS 已成为科研与生产环境中备受信赖的语音合成解决方案。",44971,"2026-04-03T14:47:02",[21,20,13],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":29,"last_commit_at":46,"category_tags":47,"status":22},2375,"LocalAI","mudler\u002FLocalAI","LocalAI 是一款开源的本地人工智能引擎，旨在让用户在任意硬件上轻松运行各类 AI 模型，包括大语言模型、图像生成、语音识别及视频处理等。它的核心优势在于彻底打破了高性能计算的门槛，无需昂贵的专用 GPU，仅凭普通 CPU 或常见的消费级显卡（如 NVIDIA、AMD、Intel 及 Apple Silicon）即可部署和运行复杂的 AI 任务。\n\n对于担心数据隐私的用户而言，LocalAI 提供了“隐私优先”的解决方案，确保所有数据处理均在本地基础设施内完成，无需上传至云端。同时，它完美兼容 OpenAI、Anthropic 等主流 API 接口，这意味着开发者可以无缝迁移现有应用，直接利用本地资源替代云服务，既降低了成本又提升了可控性。\n\nLocalAI 内置了超过 35 种后端支持（如 llama.cpp、vLLM、Whisper 等），并集成了自主 AI 代理、工具调用及检索增强生成（RAG）等高级功能，且具备多用户管理与权限控制能力。无论是希望保护敏感数据的企业开发者、进行算法实验的研究人员，还是想要在个人电脑上体验最新 AI 技术的极客玩家，都能通过 LocalAI 获",44782,"2026-04-02T22:14:26",[13,21,19,17,20,14,16],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":29,"last_commit_at":54,"category_tags":55,"status":22},3108,"bark","suno-ai\u002Fbark","Bark 是由 Suno 推出的开源生成式音频模型，能够根据文本提示创造出高度逼真的多语言语音、音乐、背景噪音及简单音效。与传统仅能朗读文字的语音合成工具不同，Bark 基于 Transformer 架构，不仅能模拟说话，还能生成笑声、叹息、哭泣等非语言声音，甚至能处理带有情感色彩和语气停顿的复杂文本，极大地丰富了音频表达的可能性。\n\n它主要解决了传统语音合成声音机械、缺乏情感以及无法生成非语音类音效的痛点，让创作者能通过简单的文字描述获得生动自然的音频素材。无论是需要为视频配音的内容创作者、探索多模态生成的研究人员，还是希望快速原型设计的开发者，都能从中受益。普通用户也可通过集成的演示页面轻松体验其神奇效果。\n\n技术亮点方面，Bark 支持商业使用（MIT 许可），并在近期更新中实现了显著的推理速度提升，同时提供了适配低显存 GPU 的版本，降低了使用门槛。此外，社区还建立了丰富的提示词库，帮助用户更好地驾驭模型生成特定风格的声音。只需几行 Python 代码，即可将创意文本转化为高质量音频，是连接文字与声音世界的强大桥梁。",39067,"2026-04-04T03:33:35",[21],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":29,"last_commit_at":62,"category_tags":63,"status":22},5908,"ChatTTS","2noise\u002FChatTTS","ChatTTS 是一款专为日常对话场景打造的生成式语音模型，特别适用于大语言模型助手等交互式应用。它主要解决了传统文本转语音（TTS）技术在对话中缺乏自然感、情感表达单一以及难以处理停顿、笑声等细微语气的问题，让机器生成的语音听起来更像真人在聊天。\n\n这款工具非常适合开发者、研究人员以及希望为应用增添自然语音交互功能的设计师使用。普通用户也可以通过社区开发的衍生产品体验其能力。ChatTTS 的核心亮点在于其对对话任务的深度优化：它不仅支持中英文双语，还能精准控制韵律细节，自动生成自然的 laughter（笑声）、pauses（停顿）和 interjections（插入语），从而实现多说话人的互动对话效果。在韵律表现上，ChatTTS 超越了大多数开源 TTS 模型。目前开源版本基于 4 万小时数据预训练而成，虽主要用于学术研究与教育目的，但已展现出强大的潜力，并支持流式音频生成与零样本推理，为后续的多情绪控制等进阶功能奠定了基础。",39042,"2026-04-09T11:54:03",[19,17,20,21],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":86,"forks":87,"last_commit_at":88,"license":89,"difficulty_score":29,"env_os":90,"env_gpu":91,"env_ram":92,"env_deps":93,"category_tags":102,"github_topics":78,"view_count":10,"oss_zip_url":78,"oss_zip_packed_at":78,"status":22,"created_at":103,"updated_at":104,"faqs":105,"releases":141},6546,"Alexander-H-Liu\u002FEnd-to-end-ASR-Pytorch","End-to-end-ASR-Pytorch","This is an open source project (formerly named Listen, Attend and Spell - PyTorch Implementation) for end-to-end ASR implemented with Pytorch, the well known deep learning toolkit.","End-to-end-ASR-Pytorch 是一个基于 PyTorch 构建的开源端到端自动语音识别（ASR）项目，前身是著名的“听、关注并拼写”（Listen, Attend and Spell）实现。它致力于解决将原始音频直接转换为文本的核心难题，支持从特征提取到模型训练及解码的全流程开发。\n\n该工具特别适合人工智能研究人员和深度学习开发者使用，尤其是那些希望复现经典算法或探索最新语音识别技术的团队。普通用户若无编程基础则较难直接上手。其技术亮点在于高度的灵活性与模块化：不仅实现了基于 Seq2seq 和 CTC 的多种主流架构，还支持两者混合建模；提供在线特征提取、子词编码以及束搜索解码等高级功能。此外，项目采用 YAML 配置文件管理超参数，并集成 TensorBoard 可视化训练过程与注意力对齐效果，极大提升了实验效率与可解释性。无论是学术研究还是工程原型验证，End-to-end-ASR-Pytorch 都是一个功能完备且易于扩展的理想起点。","# End-to-end Automatic Speech Recognition Systems - PyTorch Implementation\n\nThis is an open source project (formerly named **Listen, Attend and Spell - PyTorch Implementation**) for end-to-end ASR by [Tzu-Wei Sung](https:\u002F\u002Fgithub.com\u002FWindQAQ) and me.\nImplementation was mostly done with Pytorch, the well known deep learning toolkit.\n\nThe end-to-end ASR was based on Listen, Attend and Spell\u003Csup>[1](#Reference)\u003C\u002Fsup>. Multiple techniques proposed recently were also implemented, serving as additional plug-ins for better performance. For the list of techniques implemented, please refer to the [highlights](#Highlights), [configuration](config\u002F) and [references](#Reference).\n\nFeel free to use\u002Fmodify them, any bug report or improvement suggestion will be appreciated. If you find this project helpful for your research, please do consider to cite [our paper](#Citation), thanks!\n\n## Highlights\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlexander-H-Liu_End-to-end-ASR-Pytorch_readme_fc18f1cda282.png\" width=\"570\" height=\"300\">\n\u003C\u002Fp>\n\n- Feature Extraction\n    - On-the-fly feature extraction using torchaudio as backend\n    - Character\u002Fsubword\u003Csup>[2](#Reference)\u003C\u002Fsup>\u002Fword encoding of text\n\n- Training End-to-end ASR \n    - Seq2seq ASR with different types of encoder\u002Fattention\u003Csup>[3](#Reference)\u003C\u002Fsup>\n    - CTC-based ASR\u003Csup>[4](#Reference)\u003C\u002Fsup>, which can also be hybrid\u003Csup>[5](#Reference)\u003C\u002Fsup> with the former\n    - *yaml*-styled model construction and hyper parameters setting\n    -  Training process visualization with [TensorBoard](https:\u002F\u002Fwww.tensorflow.org\u002Fguide\u002Fsummaries_and_tensorboard), including attention alignment\n\n- Speech Recognition with End-to-end ASR (i.e. Decoding)\n    - Beam search decoding\n    - RNN language model training and joint decoding for ASR\u003Csup>[6](#Reference)\u003C\u002Fsup>\n    - Joint CTC-attention based decoding\u003Csup>[6](#Reference)\u003C\u002Fsup>\n    - Greedy decoding & CTC beam search contributed by [Heng-Jui (Harry) Chang](https:\u002F\u002Fgithub.com\u002Fvectominist)\n\n*You may checkout some example log files with TensorBoard by downloading them from [`coming soon`]()*\n\n## Dependencies\n\n- Python 3\n- Computing power (high-end GPU) and memory space (both RAM\u002FGPU's RAM) is **extremely important** if you'd like to train your own model.\n- Required packages and their use are listed [requirements.txt](requirements.txt).\n\n## Instructions\n\n\n\n### Step 0. Preprocessing - Generate Text Encoder\n\n*You may use the text encoders provided at [`tests\u002Fsample_data\u002F`](tests\u002Fsample_data\u002F) and skip this step.*\n\nThe subword model is trained with `sentencepiece`. As for character\u002Fword model, you have to generate the vocabulary file containing the vocabulary line by line. You may also use `util\u002Fgenerate_vocab_file.py` so that you only have to prepare a text file, which contains all texts you want to use for generating the vocabulary file or subword model. Please update `data.text.*` field in the config file if you want to change the mode or vocabulary file. For subword model, use the one ended with `.model` as `vocab_file`.\n\n```shell=zsh\npython3 util\u002Fgenerate_vocab_file.py --input_file TEXT_FILE \\\n                                    --output_file OUTPUT_FILE \\\n                                    --vocab_size VOCAB_SIZE \\\n                                    --mode MODE\n```\nFor more details, please refer to `python3 util\u002Fgenerate_vocab_file.py -h`.\n\n### Step 1. Configuring - Model Design & Hyperparameter Setup\n\n\nAll the parameters related to training\u002Fdecoding will be stored in a yaml file. Hyperparameter tuning and experiments can be managed easily this way. See [documentation and examples](config\u002F) for the exact format. **Note that the example configs provided were not fine-tuned**, you may want to write your own config for best performance.\n\n\n### Step 2. Training (End-to-end ASR or RNN-LM) \n\nOnce the config file is ready, run the following command to train end-to-end ASR (or language model)\n```\npython3 main.py --config \u003Cpath of config file> \n```\n\nFor example, train an ASR on LibriSpeech and watch the log with\n```shell=zsh\n# Checkout options available\npython3 main.py -h\n# Start training with specific config\npython3 main.py --config config\u002Flibri\u002Fasr_example.yaml\n# Open TensorBoard to see log\ntensorboard --logdir log\u002F\n# Train an external language model\npython3 main.py --config config\u002Flibri\u002Flm_example.yaml --lm\n```\n\nAll settings will be parsed from the config file automatically to start training, the log file can be accessed through TensorBoard. ***Please notice that the error rate reported on the TensorBoard is biased (see issue #10), you should run the testing phase in order to get the true performance of model***. \nOptions available in this phase include the followings\n\n| Options | Description                                                                                   |\n|---------|-----------------------------------------------------------------------------------------------|\n| config  | Path of config file. |\n| seed    | Random seed, **note this is an option that affects the result**                                         |\n| name    | Experiments for logging and saving model. \u003Cp> By default it's `\u003Cname of config file>_\u003Crandom seed>` |\n| logdir  | Path to store training logs (log files for tensorboard), default `log\u002F`.|\n| ckpdir  | The directory to store model, default `ckpt\u002F`.|\n| njobs   | Number of workers used for data loader, consider increase this if you find data preprocessing takes most of your training time, default using `6`.|\n| no-ping | Disable the pin-memory option of pytorch dataloader. |\n| cpu     | CPU-only mode, not recommended, use it for debugging.|\n| no-msg  | Hide all message from stdout. |\n| lm      | Switch to rnnlm training mode. |\n| test    | Switch to decoding mode (do not use during training phase) |\n|cudnn-ctc| Use CuDNN as the backend of PyTorch CTC. Unstable, see [this issue](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fissues\u002F26797), not sure if solved in [latest Pytorch with cudnn version > 7.6](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fcommit\u002Ff461184505149560803855f3a40d9e0e54c64826)|\n\n\n\n### Step 3. Speech Recognition & Performance Evaluation\n\nTo test a model, run the following command\n```\npython3 main.py --config \u003Cpath of config file> --test --njobs \u003Cint>\n```\n***Please notice that the decoding is performed without batch processing, use more workers to speedup at the cost of using more RAM.***\nBy default, recognition result will be stored at `result\u002F\u003Cname>\u002F` as two csv files with auto-naming according to the decoding config file. `output.csv` will store the best hypothesis provided by ASR and `beam.csv` will recored the top hypotheses during beam search. The result file may be evaluated with `eval.py`. For example, test the example ASR trained on LibriSpeech and check performance with\n```\npython3 main.py --config config\u002Flibri\u002Fdecode_example.yaml --test --njobs 8\n# Check WER\u002FCER\npython3 eval.py --file result\u002Fasr_example_sd0_dev_output.csv\n```\n\nMost of the options work similar to training phase except the followings:\n\n| Options | Description |\n|---------|-------------|\n| test    | *Must be enabled*|\n| config  | Path to the decoding config file.|\n| outdir  | Path to store decode result.|\n| njobs   | Number of threads used for decoding, very important in terms of efficiency. Large value equals fast decoding yet RAM\u002FGPU RAM expensive.    |\n\n\n## Troubleshooting \n\n- Loss becomes `nan` right after training begins\n    \n    For CTC, `len(pred)>len(label)` is necessary.\n    Also consider set `zero_infinity=True` for `torch.nn.CTCLoss`\n\n\n\n\n\n## ToDo\n\n- Provide examples\n- Pure CTC training \u002F CTC beam decode bug (out-of-candidate)\n- Greedy decoding\n- Customized dataset\n- Util. scripts\n- Finish CLM migration and reference\n- Store preprocessed dataset on RAM\n\n## Acknowledgements \n- Parts of the implementation refer to [ESPnet](https:\u002F\u002Fgithub.com\u002Fespnet\u002Fespnet), a great end-to-end speech processing toolkit by Watanabe *et al*.\n- Special thanks to [William Chan](http:\u002F\u002Fwilliamchan.ca\u002F), the first author of LAS, for answering my questions during implementation.\n- Thanks [xiaoming](https:\u002F\u002Fgithub.com\u002Flezasantaizi), [Odie Ko](https:\u002F\u002Fgithub.com\u002Fodie2630463), [b-etienne](https:\u002F\u002Fgithub.com\u002Fb-etienne), [Jinserk Baik](https:\u002F\u002Fgithub.com\u002Fjinserk) and [Zhong-Yi Li](https:\u002F\u002Fgithub.com\u002FChung-I) for identifying several issues in our implementation. \n\n## Reference\n\n1. [Listen, Attend and Spell](https:\u002F\u002Farxiv.org\u002Fabs\u002F1508.01211v2), W Chan *et al.*\n2. [Neural Machine Translation of Rare Words with Subword Units](http:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP16-1162), R Sennrich *et al.*\n3. [Attention-Based Models for Speech Recognition](https:\u002F\u002Farxiv.org\u002Fabs\u002F1506.07503), J Chorowski *et al*.\n4. [Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks](https:\u002F\u002Fwww.cs.toronto.edu\u002F~graves\u002Ficml_2006.pdf), A Graves *et al*.\n5. [Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.06773), S Kim *et al.* \n6.  [Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.02737), T Hori *et al.* \n\n## Citation\n\n```\n@inproceedings{liu2019adversarial,\n  title={Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model},\n  author={Liu, Alexander and Lee, Hung-yi and Lee, Lin-shan},\n  booktitle={Acoustics, Speech and Signal Processing (ICASSP)},\n  year={2019},\n  organization={IEEE}\n}\n\n@misc{alex2019sequencetosequence,\n    title={Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding},\n    author={Alexander H. Liu and Tzu-Wei Sung and Shun-Po Chuang and Hung-yi Lee and Lin-shan Lee},\n    year={2019},\n    eprint={1910.12740},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL}\n}\n```\n","# 端到端自动语音识别系统 - PyTorch 实现\n\n这是一个开源项目（原名 **Listen, Attend and Spell - PyTorch Implementation**），由 [Tzu-Wei Sung](https:\u002F\u002Fgithub.com\u002FWindQAQ) 和我共同开发的端到端 ASR 项目。  \n实现主要基于广为人知的深度学习框架 PyTorch。\n\n该端到端 ASR 的基础是 Listen, Attend and Spell\u003Csup>[1](#Reference)\u003C\u002Fsup>。此外，我们还实现了近年来提出的多种技术，作为插件以提升性能。已实现的技术列表请参阅 [亮点](#Highlights)、[配置文件](config\u002F) 和 [参考文献](#Reference)。\n\n欢迎使用或修改这些代码，任何 bug 报告或改进建议都将不胜感激。如果您觉得本项目对您的研究有所帮助，请考虑引用我们的论文\u003Csup>[Citation]\u003C\u002Fsup>，谢谢！\n\n## 亮点\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlexander-H-Liu_End-to-end-ASR-Pytorch_readme_fc18f1cda282.png\" width=\"570\" height=\"300\">\n\u003C\u002Fp>\n\n- 特征提取\n    - 使用 torchaudio 作为后端进行实时特征提取\n    - 文本的字符\u002F子词\u003Csup>[2](#Reference)\u003C\u002Fsup>\u002F词编码\n\n- 训练端到端 ASR\n    - 具有不同编码器\u002F注意力机制\u003Csup>[3](#Reference)\u003C\u002Fsup>的序列到序列 ASR\n    - 基于 CTC 的 ASR\u003Csup>[4](#Reference)\u003C\u002Fsup>,也可与前者混合\u003Csup>[5](#Reference)\u003C\u002Fsup>\n    - *yaml* 风格的模型构建和超参数设置\n    - 使用 [TensorBoard](https:\u002F\u002Fwww.tensorflow.org\u002Fguide\u002Fsummaries_and_tensorboard) 可视化训练过程，包括注意力对齐\n\n- 使用端到端 ASR 进行语音识别（即解码）\n    - 波束搜索解码\n    - RNN 语言模型训练及与 ASR 的联合解码\u003Csup>[6](#Reference)\u003C\u002Fsup>\n    - 联合 CTC-注意力机制的解码\u003Csup>[6](#Reference)\u003C\u002Fsup>\n    - 贪心解码和 CTC 波束搜索由 [Heng-Jui (Harry) Chang](https:\u002F\u002Fgithub.com\u002Fvectominist) 贡献\n\n*您可以通过下载 [`coming soon`]() 中的示例日志文件，使用 TensorBoard 查看相关可视化内容*\n\n## 依赖项\n\n- Python 3\n- 如果您想训练自己的模型，计算能力（高端 GPU）和内存空间（RAM 和 GPU 显存）都至关重要。\n- 所需的软件包及其用途列在 [requirements.txt](requirements.txt) 中。\n\n## 使用说明\n\n\n\n### 步骤 0. 预处理 - 生成文本编码器\n\n*您可以直接使用 [`tests\u002Fsample_data\u002F`](tests\u002Fsample_data\u002F) 目录下提供的文本编码器，跳过此步骤。*\n\n子词模型使用 `sentencepiece` 训练。对于字符\u002F词模型，您需要逐行生成包含词汇表的文件。也可以使用 `util\u002Fgenerate_vocab_file.py` 脚本，只需准备一个文本文件，其中包含所有用于生成词汇表或子词模型的文本即可。如果要更改模式或词汇表文件，请更新配置文件中的 `data.text.*` 字段。对于子词模型，应使用以 `.model` 结尾的文件作为 `vocab_file`。\n\n```shell=zsh\npython3 util\u002Fgenerate_vocab_file.py --input_file TEXT_FILE \\\n                                    --output_file OUTPUT_FILE \\\n                                    --vocab_size VOCAB_SIZE \\\n                                    --mode MODE\n```\n更多详细信息请参阅 `python3 util\u002Fgenerate_vocab_file.py -h`。\n\n### 步骤 1. 配置 - 模型设计与超参数设置\n\n\n所有与训练\u002F解码相关的参数都将存储在一个 yaml 文件中。通过这种方式可以轻松管理超参数调优和实验。具体格式请参阅 [文档和示例](config\u002F)。**请注意，提供的示例配置并未经过精细调优**，为了获得最佳性能，建议您编写自己的配置文件。\n\n### 步骤 2. 训练（端到端 ASR 或 RNN-LM）\n\n配置文件准备好后，运行以下命令即可开始训练端到端 ASR（或语言模型）：\n```\npython3 main.py --config \u003C配置文件路径> \n```\n\n例如，在 LibriSpeech 数据集上训练一个 ASR，并使用以下命令查看日志：\n```shell=zsh\n# 查看可用选项\npython3 main.py -h\n# 使用特定配置开始训练\npython3 main.py --config config\u002Flibri\u002Fasr_example.yaml\n# 打开 TensorBoard 查看日志\ntensorboard --logdir log\u002F\n# 训练外部语言模型\npython3 main.py --config config\u002Flibri\u002Flm_example.yaml --lm\n```\n\n所有设置将自动从配置文件中解析并开始训练，训练日志可通过 TensorBoard 查看。***请注意，TensorBoard 上显示的错误率存在偏差（见 issue #10），您应运行测试阶段以获得模型的真实性能***。  \n本阶段可用的选项包括：\n\n| 选项 | 描述                                                                                   |\n|------|----------------------------------------------------------------------------------------|\n| config  | 配置文件路径。 |\n| seed    | 随机种子，**请注意，此选项会影响结果**                                         |\n| name    | 用于记录和保存模型的实验名称。\u003Cp>默认为 `\u003C配置文件名>_\u003C随机种子>` |\n| logdir  | 存储训练日志（TensorBoard 日志文件）的路径，默认为 `log\u002F`。|\n| ckpdir  | 存放模型的目录，默认为 `ckpt\u002F`。|\n| njobs   | 数据加载器使用的工作者数量，如果发现数据预处理占用了大部分训练时间，可适当增加，默认使用 `6`。|\n| no-ping | 禁用 PyTorch 数据加载器的固定内存选项。 |\n| cpu     | 仅 CPU 模式，不推荐使用，可用于调试。|\n| no-msg  | 隐藏所有标准输出消息。 |\n| lm      | 切换到 RNNLM 训练模式。 |\n| test    | 切换到解码模式（请勿在训练阶段使用） |\n|cudnn-ctc| 使用 CuDNN 作为 PyTorch CTC 的后端。不稳定，详见 [此问题](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fissues\u002F26797)，不确定是否已在 [最新版本的 PyTorch 中修复，且 cuDNN 版本 > 7.6](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fcommit\u002Ff461184505149560803855f3a40d9e0e54c64826)|\n\n\n\n### 步骤 3. 语音识别与性能评估\n\n要测试模型，请运行以下命令：\n```\npython3 main.py --config \u003C配置文件路径> --test --njobs \u003C整数>\n```\n***请注意，解码过程不采用批处理方式，因此需要使用更多的工作者来加快速度，但这会消耗更多内存。***\n默认情况下，识别结果将存储在 `result\u002F\u003Cname>\u002F` 目录下，以两个 CSV 文件的形式保存，并根据解码配置文件自动命名。`output.csv` 将存储 ASR 提供的最佳假设，而 `beam.csv` 将记录波束搜索过程中的前几条假设。结果文件可以使用 `eval.py` 进行评估。例如，测试在 LibriSpeech 上训练的示例 ASR，并检查其性能：\n```\npython3 main.py --config config\u002Flibri\u002Fdecode_example.yaml --test --njobs 8\n\n# 检查 WER\u002FCER\npython3 eval.py --file result\u002Fasr_example_sd0_dev_output.csv\n```\n\n大多数选项与训练阶段类似，除了以下几点：\n\n| 选项 | 描述 |\n|------|------|\n| test | *必须启用* |\n| config | 解码配置文件的路径。 |\n| outdir | 存储解码结果的路径。 |\n| njobs | 用于解码的线程数，对效率非常重要。数值越大，解码速度越快，但会占用更多的内存\u002FGPU显存。 |\n\n\n## 故障排除\n\n- 训练刚开始时损失就变为 `nan`\n  \n    对于 CTC 损失，`len(pred)>len(label)` 是必要的条件。此外，可以考虑将 `torch.nn.CTCLoss` 的 `zero_infinity` 参数设置为 `True`。\n\n\n\n\n\n## 待办事项\n\n- 提供示例\n- 纯 CTC 训练 \u002F CTC 束搜索解码中的问题（候选列表外）\n- 贪心解码\n- 自定义数据集\n- 工具脚本\n- 完成 CLM 的迁移和参考实现\n- 将预处理后的数据集存储在内存中\n\n## 致谢\n- 部分实现参考了 [ESPnet](https:\u002F\u002Fgithub.com\u002Fespnet\u002Fespnet)，这是一款由 Watanabe 等人开发的强大端到端语音处理工具包。\n- 特别感谢 LAS 的第一作者 [William Chan](http:\u002F\u002Fwilliamchan.ca\u002F)，他在实现过程中解答了我的疑问。\n- 感谢 [xiaoming](https:\u002F\u002Fgithub.com\u002Flezasantaizi)、[Odie Ko](https:\u002F\u002Fgithub.com\u002Fodie2630463)、[b-etienne](https:\u002F\u002Fgithub.com\u002Fb-etienne)、[Jinserk Baik](https:\u002F\u002Fgithub.com\u002Fjinserk) 和 [Zhong-Yi Li](https:\u002F\u002Fgithub.com\u002FChung-I)，他们指出了我们实现中的几个问题。\n\n## 参考文献\n\n1. [Listen, Attend and Spell](https:\u002F\u002Farxiv.org\u002Fabs\u002F1508.01211v2)，W Chan 等人。\n2. [Neural Machine Translation of Rare Words with Subword Units](http:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP16-1162)，R Sennrich 等人。\n3. [Attention-Based Models for Speech Recognition](https:\u002F\u002Farxiv.org\u002Fabs\u002F1506.07503)，J Chorowski 等人。\n4. [Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks](https:\u002F\u002Fwww.cs.toronto.edu\u002F~graves\u002Ficml_2006.pdf)，A Graves 等人。\n5. [Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.06773)，S Kim 等人。\n6. [Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.02737)，T Hori 等人。\n\n## 引用\n\n```\n@inproceedings{liu2019adversarial,\n  title={Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model},\n  author={Liu, Alexander and Lee, Hung-yi and Lee, Lin-shan},\n  booktitle={Acoustics, Speech and Signal Processing (ICASSP)},\n  year={2019},\n  organization={IEEE}\n}\n\n@misc{alex2019sequencetosequence,\n    title={Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding},\n    author={Alexander H. Liu and Tzu-Wei Sung and Shun-Po Chuang and Hung-yi Lee and Lin-shan Lee},\n    year={2019},\n    eprint={1910.12740},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL}\n}\n```","# End-to-end-ASR-Pytorch 快速上手指南\n\n## 环境准备\n\n在开始之前，请确保你的开发环境满足以下要求：\n\n*   **操作系统**：Linux \u002F macOS (推荐 Linux)\n*   **Python 版本**：Python 3.x\n*   **硬件要求**：\n    *   **GPU**：强烈建议使用高性能 NVIDIA GPU 进行模型训练。\n    *   **内存**：需要充足的系统内存 (RAM) 和显存 (GPU RAM)，否则训练过程可能因内存不足而失败。\n*   **核心依赖**：\n    *   PyTorch (深度学习框架)\n    *   torchaudio (音频特征提取后端)\n    *   sentencepiece (子词模型训练)\n    *   TensorBoard (训练过程可视化)\n    *   其他依赖详见项目根目录下的 `requirements.txt`。\n\n> **提示**：国内用户安装 PyTorch 时，建议访问 [PyTorch 官网](https:\u002F\u002Fpytorch.org\u002F) 选择对应的 CUDA 版本，并使用清华或中科大镜像源加速安装。\n\n## 安装步骤\n\n1.  **克隆项目代码**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002FWindQAQ\u002FEnd-to-end-ASR-Pytorch.git\n    cd End-to-end-ASR-Pytorch\n    ```\n\n2.  **安装 Python 依赖包**\n    建议使用虚拟环境（如 conda 或 venv）以避免冲突。\n    ```bash\n    pip install -r requirements.txt\n    ```\n    *(国内加速方案)*：\n    ```bash\n    pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n3.  **验证安装**\n    运行帮助命令确认环境正常：\n    ```bash\n    python3 main.py -h\n    ```\n\n## 基本使用\n\n本项目采用 YAML 配置文件管理所有超参数和模型结构。以下是从零开始训练一个简单 ASR 模型的流程。\n\n### 第一步：生成文本编码器 (可选)\n如果你使用项目提供的示例数据 (`tests\u002Fsample_data\u002F`)，可跳过此步。若需自定义数据集，需先生成词汇表或子词模型。\n\n```bash\npython3 util\u002Fgenerate_vocab_file.py --input_file TEXT_FILE \\\n                                    --output_file OUTPUT_FILE \\\n                                    --vocab_size VOCAB_SIZE \\\n                                    --mode MODE\n```\n*   `TEXT_FILE`: 包含所有训练文本的文件。\n*   `MODE`: 可选择 `char` (字符), `word` (单词), 或 `unigram` (子词，需配合 sentencepiece)。\n*   生成的 `.model` 文件需在配置文件中指定为 `vocab_file`。\n\n### 第二步：配置模型\n编辑 `config\u002F` 目录下的 YAML 文件（例如 `config\u002Flibri\u002Fasr_example.yaml`）。\n*   修改数据路径、模型架构、学习率等参数。\n*   **注意**：示例配置未经精细调优，生产环境请根据实际数据调整。\n\n### 第三步：开始训练\n使用配置好的 YAML 文件启动训练，并可通过 TensorBoard 实时监控。\n\n```bash\n# 启动训练 (以 LibriSpeech 示例配置为例)\npython3 main.py --config config\u002Flibri\u002Fasr_example.yaml\n\n# (可选) 启动外部语言模型 (RNN-LM) 训练\npython3 main.py --config config\u002Flibri\u002Flm_example.yaml --lm\n```\n\n**查看训练日志：**\n在新终端窗口运行以下命令打开 TensorBoard：\n```bash\ntensorboard --logdir log\u002F\n```\n> **注意**：TensorBoard 中显示的误差率可能存在偏差，请以测试阶段的结果为准。\n\n### 第四步：测试与评估\n训练完成后，使用测试集评估模型性能（解码）。\n\n```bash\n# 执行解码测试 (建议增加 njobs 以提高速度)\npython3 main.py --config config\u002Flibri\u002Fdecode_example.yaml --test --njobs 8\n```\n\n解码结果默认保存在 `result\u002F\u003Cname>\u002F` 目录下：\n*   `output.csv`: 存储最佳识别结果。\n*   `beam.csv`: 存储 Beam Search 过程中的顶部候选结果。\n\n**计算字错率 (CER) 或词错率 (WER)：**\n```bash\npython3 eval.py --file result\u002Fasr_example_sd0_dev_output.csv\n```","某初创团队正在为医疗问诊系统开发定制化的语音转文字引擎，需处理大量带有专业术语的医生口述录音。\n\n### 没有 End-to-end-ASR-Pytorch 时\n- 团队需从零搭建复杂的序列到序列模型，手动编写编码器、注意力机制及 CTC 损失函数，开发周期长达数月且极易出错。\n- 缺乏统一的配置文件管理，每次调整超参数或切换模型结构（如从纯 Attention 改为混合 CTC-Attention）都需要修改大量底层代码，实验迭代效率极低。\n- 训练过程如同“黑盒”，无法直观查看注意力对齐效果或损失曲线，难以定位模型为何无法识别特定的医学术语。\n- 解码策略单一，仅支持简单的贪婪搜索，导致在噪音环境下识别准确率大幅下降，且难以集成外部语言模型进行优化。\n\n### 使用 End-to-end-ASR-Pytorch 后\n- 直接复用基于 Listen, Attend and Spell 架构的成熟代码，利用内置的多种编码器和注意力模块，将核心模型搭建时间缩短至几天。\n- 通过 YAML 文件即可灵活定义模型结构和超参数，轻松实现混合 CTC-Attention 模型的快速切换与对比实验，大幅提升研发迭代速度。\n- 集成 TensorBoard 可视化面板，实时监控训练进度并观察注意力对齐图，迅速发现并修正模型对长尾医学词汇的关注偏差。\n- 启用束搜索（Beam Search）及联合 CTC-注意力解码功能，并结合 RNN 语言模型进行联合推理，显著提升了复杂医疗场景下的识别鲁棒性。\n\nEnd-to-end-ASR-Pytorch 通过提供模块化、可配置且可视化的全链路解决方案，让团队能以最低成本快速构建出高精度的垂直领域语音识别系统。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlexander-H-Liu_End-to-end-ASR-Pytorch_cd6102c9.png","Alexander-H-Liu","Alexander H. Liu","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FAlexander-H-Liu_a0cb7ebd.jpg",null,"https:\u002F\u002Falexander-h-liu.github.io\u002F","https:\u002F\u002Fgithub.com\u002FAlexander-H-Liu",[82],{"name":83,"color":84,"percentage":85},"Python","#3572A5",100,1214,315,"2026-04-06T16:03:48","MIT","未说明","必需（训练自有模型时极度重要），具体型号和显存大小未说明，需支持 PyTorch CUDA 后端","未说明（文中强调训练时 RAM 和 GPU 显存空间极度重要，解码时增加线程数会消耗更多 RAM）",{"notes":94,"python":95,"dependencies":96},"1. 训练自有模型对计算能力（高端 GPU）和内存空间（RAM\u002FGPU 显存）要求极高。2. 解码过程默认不使用批处理，可通过增加工作线程数（njobs）加速，但会以消耗更多 RAM 为代价。3. TensorBoard 报告的错误率存在偏差，需运行测试阶段以获取模型真实性能。4. 若使用 CTC 损失函数出现 NaN，需确保预测长度大于标签长度，或设置 zero_infinity=True。5. 使用 CuDNN 作为 CTC 后端可能不稳定。","3",[97,98,99,100,101],"torch","torchaudio","sentencepiece","tensorboard","pyyaml",[21],"2026-03-27T02:49:30.150509","2026-04-11T17:40:52.301131",[106,111,116,121,126,131,136],{"id":107,"question_zh":108,"answer_zh":109,"source_url":110},29745,"如何在 LibriSpeech 数据集上训练模型？为什么我的模型无法收敛？","作者已确认该代码可以在 LibriSpeech 上运行。如果遇到无法收敛的问题，请尝试使用项目的 dev 分支（https:\u002F\u002Fgithub.com\u002FAlexander-H-Liu\u002FListen-Attend-and-Spell-Pytorch\u002Ftree\u002Fdev），该分支包含了针对 LibriSpeech 测试和确认的代码。注意，解码过程和文档可能仍在开发中。此外，作者最初是在中文语料库上进行测试的。","https:\u002F\u002Fgithub.com\u002FAlexander-H-Liu\u002FEnd-to-end-ASR-Pytorch\u002Fissues\u002F9",{"id":112,"question_zh":113,"answer_zh":114,"source_url":115},29746,"在 TIMIT、LibriSpeech 或 WSJ 数据集上训练模型需要多长时间？","根据维护者的回复：\n1. TIMIT 模型：在单张 1080Ti 显卡上大约需要 5~7 小时。\n2. LibriSpeech 模型：取决于所选的训练集大小，在 100 小时到 960 小时的数据集上，训练时间约为 1 天到 5 天。\n3. WSJ 模型：作者表示会尽力添加支持，建议查看最新的示例配置文件。","https:\u002F\u002Fgithub.com\u002FAlexander-H-Liu\u002FEnd-to-end-ASR-Pytorch\u002Fissues\u002F12",{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},29747,"运行时报错 'FileNotFoundError: No such file or directory: spm_train' 如何解决？","该错误通常意味着 SentencePiece 未正确安装或未添加到系统路径中。即使使用了 pip 安装 sentencepiece 包，如果终端无法识别 `spm_train` 命令，说明安装有问题。请确保 SentencePiece 二进制文件已正确安装并可在终端直接调用。如果问题依旧，可能需要检查环境变量或重新编译安装 SentencePiece。","https:\u002F\u002Fgithub.com\u002FAlexander-H-Liu\u002FEnd-to-end-ASR-Pytorch\u002Fissues\u002F25",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},29748,"测试模型时遇到缺少位置参数 'init_adadelta' 的错误怎么办？","这是一个已知问题，在最近的更新中测试脚本遗漏了该参数。解决方法是修改 `test_asr.py` 文件中 `Solver` 类的 `set_model` 方法，添加以下代码以正确初始化模型：\n\ninit_adadelta = self.config['hparas']['optimizer'] == 'Adadelta'\nself.model = ASR(self.feat_dim, self.vocab_size, init_adadelta, **self.config['model'])\n\n确保在实例化 ASR 模型时传入 `init_adadelta` 参数。","https:\u002F\u002Fgithub.com\u002FAlexander-H-Liu\u002FEnd-to-end-ASR-Pytorch\u002Fissues\u002F53",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},29749,"如何使用该代码对单个音频文件进行推理（识别）？","该仓库主要是一个研究框架，旨在预处理固定数据集并训练模型进行比较，因此默认不提供直接对单个音频文件进行推理的脚本。\n解决方案有两种：\n1. 自行编写脚本：利用仓库中已有的数据转换函数来实现单音频推理。\n2. 变通方法：创建一个文件夹，其结构与训练数据类似，放入你的单个音频文件，运行 `preprocess_corpus.py` 对该文件夹进行预处理，然后对该文件夹运行测试脚本。","https:\u002F\u002Fgithub.com\u002FAlexander-H-Liu\u002FEnd-to-end-ASR-Pytorch\u002Fissues\u002F33",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},29750,"为什么目标字符串填充（padding）使用的是 SOS token (0) 而不是专门的填充 token？这样不会影响训练吗？","这是有意为之的设计。作者确实使用 0 同时作为 `\u003Csos>`（句首标记）、填充 token 以及 CTC 的 `\u003Cblank>` 标记。\n这样做是安全的，因为在定义损失函数时设置了 `ignore_index=0`：\ntorch.nn.CrossEntropyLoss(ignore_index=0, reduction='none')\n这意味着在计算损失时，索引为 0 的位置会被忽略，不会作为目标参与梯度更新。因此，基于注意力的解码器永远不会学习到输出 0，无论将 0 视为 `\u003Csos>` 还是填充标记都不会产生负面影响。","https:\u002F\u002Fgithub.com\u002FAlexander-H-Liu\u002FEnd-to-end-ASR-Pytorch\u002Fissues\u002F22",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},29751,"Attention 类中获取 context 向量的方式是否最优？是否有更高效的实现？","是的，有更优雅且高效的方式。根据论文公式，context 向量可以通过批量矩阵乘法（batch matrix multiplication）直接获得。\n原代码逻辑已被优化，建议使用 `torch.bmm(attention_score, listener_feature)` 来计算 context。如果 decoder_state 的维度是 `(batch_size, 1, state_dim)`，可以通过 `.squeeze(dim=1)` 调整维度。维护者确认已在项目重构中修复并采用了这种更高效的方法。","https:\u002F\u002Fgithub.com\u002FAlexander-H-Liu\u002FEnd-to-end-ASR-Pytorch\u002Fissues\u002F8",[]]