[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-mravanelli--pytorch-kaldi":3,"tool-mravanelli--pytorch-kaldi":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",151314,2,"2026-04-11T23:32:58",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":78,"owner_email":79,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":95,"forks":96,"last_commit_at":97,"license":79,"difficulty_score":98,"env_os":99,"env_gpu":100,"env_ram":101,"env_deps":102,"category_tags":109,"github_topics":111,"view_count":32,"oss_zip_url":79,"oss_zip_packed_at":79,"status":17,"created_at":129,"updated_at":130,"faqs":131,"releases":160},6718,"mravanelli\u002Fpytorch-kaldi","pytorch-kaldi","pytorch-kaldi is a project for developing state-of-the-art DNN\u002FRNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.","pytorch-kaldi 是一个专为构建顶尖语音识别系统而设计的开源工具包。它巧妙地将 PyTorch 在深度神经网络（DNN\u002FRNN）训练上的灵活性与 Kaldi  toolkit 在特征提取、标签计算及解码方面的成熟能力相结合，形成了一套高效的混合架构。\n\n这一设计主要解决了传统语音识别开发中面临的痛点：研究人员既希望利用现代深度学习框架快速实验新模型，又难以舍弃 Kaldi 在信号处理和解码流程中经过工业验证的稳定性。通过分工协作，pytorch-kaldi 让用户无需重复造轮子，即可专注于核心算法的创新与调优。\n\n该工具特别适合语音技术领域的研究人员和开发者使用，尤其是那些需要复现前沿论文成果、进行 DNN-HMM 混合系统实验，或希望在标准基准上验证自定义模型的团队。其独特的技术亮点在于清晰的模块化配置，支持用户轻松插入自定义模型、调整超参数或使用私有数据集，同时保持了极高的结果可复现性。\n\n值得注意的是，虽然该项目目前仍提供维护支持，但其核心开发团队已推出功能更全面的继任者 SpeechBrain，涵盖了从语音识别到说话人确认等更多任务。对于新项目，建议关注 Speech","pytorch-kaldi 是一个专为构建顶尖语音识别系统而设计的开源工具包。它巧妙地将 PyTorch 在深度神经网络（DNN\u002FRNN）训练上的灵活性与 Kaldi  toolkit 在特征提取、标签计算及解码方面的成熟能力相结合，形成了一套高效的混合架构。\n\n这一设计主要解决了传统语音识别开发中面临的痛点：研究人员既希望利用现代深度学习框架快速实验新模型，又难以舍弃 Kaldi 在信号处理和解码流程中经过工业验证的稳定性。通过分工协作，pytorch-kaldi 让用户无需重复造轮子，即可专注于核心算法的创新与调优。\n\n该工具特别适合语音技术领域的研究人员和开发者使用，尤其是那些需要复现前沿论文成果、进行 DNN-HMM 混合系统实验，或希望在标准基准上验证自定义模型的团队。其独特的技术亮点在于清晰的模块化配置，支持用户轻松插入自定义模型、调整超参数或使用私有数据集，同时保持了极高的结果可复现性。\n\n值得注意的是，虽然该项目目前仍提供维护支持，但其核心开发团队已推出功能更全面的继任者 SpeechBrain，涵盖了从语音识别到说话人确认等更多任务。对于新项目，建议关注 SpeechBrain；而对于需要经典混合架构的研究场景，pytorch-kaldi 依然是一个可靠且广泛引用的选择。","# The PyTorch-Kaldi Speech Recognition Toolkit\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmravanelli_pytorch-kaldi_readme_29d0b7fd6555.png\" width=\"220\" img align=\"left\">\nPyTorch-Kaldi is an open-source repository for developing state-of-the-art DNN\u002FHMM speech recognition systems. The DNN part is managed by PyTorch, while feature extraction, label computation, and decoding are performed with the Kaldi toolkit.\n\nThis repository contains the last version of the  PyTorch-Kaldi toolkit (PyTorch-Kaldi-v1.0). To take a look into the previous version (PyTorch-Kaldi-v0.1), [click here](https:\u002F\u002Fbitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-v0.0\u002Fsrc\u002Fmaster\u002F).\n\nIf you use this code or part of it, please cite the following paper:\n\n*M. Ravanelli, T. Parcollet, Y. Bengio, \"The PyTorch-Kaldi Speech Recognition Toolkit\", [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.07453)*\n\n```\n@inproceedings{pytorch-kaldi,\ntitle    = {The PyTorch-Kaldi Speech Recognition Toolkit},\nauthor    = {M. Ravanelli and T. Parcollet and Y. Bengio},\nbooktitle    = {In Proc. of ICASSP},\nyear    = {2019}\n}\n```\n\nThe toolkit is released under a **Creative Commons Attribution 4.0 International license**. You can copy, distribute, modify the code for research, commercial and non-commercial purposes. We only ask to cite our paper referenced above.\n\nTo improve transparency and replicability of speech recognition results, we give users the possibility to release their PyTorch-Kaldi model within this repository. Feel free to contact us (or doing a pull request) for that. Moreover, if your paper uses PyTorch-Kaldi, it is also possible to advertise it in this repository.\n\n[See a short introductory video on the PyTorch-Kaldi Toolkit](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=VDQaf0SS4K0&t=2s)\n\n## SpeechBrain \nWe are happy to announce that the SpeechBrain project (https:\u002F\u002Fspeechbrain.github.io\u002F) is now public! We strongly encourage users to migrate to [Speechbrain](https:\u002F\u002Fspeechbrain.github.io\u002F). It is a much better project which already supports several speech processing tasks, such as speech recognition, speaker recognition, SLU, speech enhancement, speech separation, multi-microphone signal processing and many others. \n\nThe goal is to develop a *single*, *flexible*, and *user-friendly* toolkit that can be used to easily develop state-of-the-art speech systems for speech recognition (both end-to-end and HMM-DNN), speaker recognition, speech separation, multi-microphone signal processing (e.g, beamforming), self-supervised learning, and many others.\n\nThe project will be lead by Mila and is sponsored by Samsung, Nvidia, Dolby. \nSpeechBrain will also benefit from the collaboration and expertise of other companies such as Facebook\u002FPyTorch, IBMResearch, FluentAI. \n\nWe are actively looking for collaborators. Feel free to contact us at speechbrainproject@gmail.com if you are interested to collaborate.\n\nThanks to our sponsors we are also able to hire interns working at Mila on the SpeechBrain project. The ideal candidate is a PhD student with experience on pytorch and speech technologies (send your CV to speechbrainproject@gmail.com)\n\nThe development of SpeechBrain will require some months before having a working repository. Meanwhile, we will continue providing support for the pytorch-kaldi project.\n\nStay Tuned!\n\n\n## Table of Contents\n* [Introduction](#introduction)\n* [Prerequisites](#prerequisites)\n* [How to install](#how-to-install)\n* [Recent Updates](#recent-updates)\n* [Tutorials:](#timit-tutorial)\n  * [TIMIT tutorial](#timit-tutorial)\n  * [Librispeech tutorial](#librispeech-tutorial)\n* [Toolkit Overview:](#overview-of-the-toolkit-architecture)\n  * [Toolkit architecture](#overview-of-the-toolkit-architecture)\n  * [Configuration files](#description-of-the-configuration-files)\n* [FAQs:](#how-can-i-plug-in-my-model)\n  * [How can I plug-in my model?](#how-can-i-plug-in-my-model)\n  * [How can I tune the hyperparameters?](#how-can-i-tune-the-hyperparameters)\n  * [How can I use my own dataset?](#how-can-i-use-my-own-dataset)\n  * [How can I plug-in my own features?](#how-can-i-plug-in-my-own-features)\n  * [How can I transcript my own audio files?](#how-can-i-transcript-my-own-audio-files)\n  * [Batch size, learning rate, and droput scheduler](#Batch-size,-learning-rate,-and-dropout-scheduler)\n  * [How can I contribute to the project?](#how-can-i-contribute-to-the-project)\n* [EXTRA:](#speech-recognition-from-the-raw-waveform-with-sincnet)  \n  * [Speech recognition from the raw waveform with SincNet](#speech-recognition-from-the-raw-waveform-with-sincnet)\n  * [Joint training between speech enhancement and ASR](#joint-training-between-speech-enhancement-and-asr)\n  * [Distant Speech Recognition with DIRHA](#distant-speech-recognition-with-dirha)\n  * [Training an autoencoder](#training-an-autoencoder)\n* [References](#references)\n\n\n## Introduction\nThe PyTorch-Kaldi project aims to bridge the gap between the Kaldi and the PyTorch toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these toolkits, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with rich documentation and is designed to properly work locally or on HPC clusters.\n\nSome features of the new version of the PyTorch-Kaldi toolkit:\n\n- Easy interface with Kaldi.\n- Easy plug-in of user-defined models.\n- Several pre-implemented models (MLP, CNN, RNN, LSTM, GRU, Li-GRU, SincNet).\n- Natural implementation of complex models based on multiple features, labels, and neural architectures.\n- Easy and flexible configuration files.\n- Automatic recovery from the last processed chunk.\n- Automatic chunking and context expansions of the input features.\n- Multi-GPU training.\n- Designed to work locally or on HPC clusters.\n- Tutorials on TIMIT and Librispeech Datasets.\n\n## Prerequisites\n1. If not already done, install Kaldi (http:\u002F\u002Fkaldi-asr.org\u002F). As suggested during the installation, do not forget to add the path of the Kaldi binaries into $HOME\u002F.bashrc. For instance, make sure that .bashrc contains the following paths:\n```\nexport KALDI_ROOT=\u002Fhome\u002Fmirco\u002Fkaldi-trunk\nPATH=$PATH:$KALDI_ROOT\u002Ftools\u002Fopenfst\nPATH=$PATH:$KALDI_ROOT\u002Fsrc\u002Ffeatbin\nPATH=$PATH:$KALDI_ROOT\u002Fsrc\u002Fgmmbin\nPATH=$PATH:$KALDI_ROOT\u002Fsrc\u002Fbin\nPATH=$PATH:$KALDI_ROOT\u002F\u002Fsrc\u002Fnnetbin\nexport PATH\n```\nRemember to change the KALDI_ROOT variable using your path. As a first test to check the installation, open a bash shell, type \"copy-feats\" or \"hmm-info\" and make sure no errors appear.\n\n2. If not already done, install PyTorch (http:\u002F\u002Fpytorch.org\u002F). We tested our codes on PyTorch 1.0 and PyTorch 0.4. An older version of PyTorch is likely to raise errors. To check your installation, type “python” and, once entered into the console, type “import torch”, and make sure no errors appear.\n\n3. We recommend running the code on a GPU machine. Make sure that the CUDA libraries (https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads) are installed and correctly working. We tested our system on Cuda 9.0, 9.1 and 8.0. Make sure that python is installed (the code is tested with python 2.7 and python 3.7). Even though not mandatory, we suggest using Anaconda (https:\u002F\u002Fanaconda.org\u002Fanaconda\u002Fpython).\n\n## Recent updates\n\n**19 Feb. 2019: updates:**\n- It is now possible to dynamically change batch size, learning rate, and dropout factors during training. We thus implemented a scheduler that supports the following formalism within the config files:\n```\nbatch_size_train = 128*12 | 64*10 | 32*2\n```\nThe line above means: do 12 epochs with 128 batches, 10 epochs with 64 batches, and 2 epochs with 32 batches. A similar formalism can be used for learning rate and dropout scheduling. [See this section for more information](#batch-size,-learning-rate,-and-dropout-scheduler).\n\n**5 Feb. 2019: updates:**\n1. Our toolkit now supports parallel data loading (i.e., the next chunk is stored in memory while processing the current chunk). This allows a significant speed up.\n2. When performing monophone regularization users can now set “dnn_lay = N_lab_out_mono”. This way the number of monophones is automatically inferred by our toolkit.\n3. We integrated the kaldi-io toolkit from the [kaldi-io-for-python](https:\u002F\u002Fgithub.com\u002Fvesis84\u002Fkaldi-io-for-python) project into data_io-py.\n4. We provided a better hyperparameter setting for SincNet ([see this section](#speech-recognition-from-the-raw-waveform-with-sincnet))\n5. We released some baselines with the DIRHA dataset ([see this section](#distant-speech-recognition-with-dirha)). We also provide some configuration examples for a simple autoencoder ([see this section](#training-an-autoencoder)) and for a system that jointly trains a speech enhancement and a speech recognition module ([see this section](#joint-training-between-speech-enhancement-and-asr))\n6. We fixed some minor bugs.\n\n**Notes on the next version:**\nIn the next version, we plan to further extend the functionalities of our toolkit, supporting more models and features formats. The goal is to make our toolkit suitable for other speech-related tasks such as end-to-end speech recognition, speaker-identification, keyword spotting, speech separation, speech activity detection, speech enhancement, etc. If you would like to propose some novel functionalities, please give us your feedback by [filling this survey](https:\u002F\u002Fdocs.google.com\u002Fforms\u002Fd\u002F12jd-QP5m8NAJVpiypvtVGy1n_d2iuWaLozXq5hsg4yA\u002Fedit?usp=sharing).\n\n\n\n## How to install\nTo install PyTorch-Kaldi, do the following steps:\n\n1. Make sure all the software recommended in the “Prerequisites” sections are installed and are correctly working\n2. Clone the PyTorch-Kaldi repository:\n```\ngit clone https:\u002F\u002Fgithub.com\u002Fmravanelli\u002Fpytorch-kaldi\n```\n3.  Go into the project folder and Install the needed packages with:\n```\npip install -r requirements.txt\n```\n\n\n## TIMIT tutorial\nIn the following, we provide a short tutorial of the PyTorch-Kaldi toolkit based on the popular TIMIT dataset.\n\n1. Make sure you have the TIMIT dataset. If not, it can be downloaded from the LDC website (https:\u002F\u002Fcatalog.ldc.upenn.edu\u002FLDC93S1).\n\n2. Make sure Kaldi and PyTorch installations are fine. Make also sure that your KALDI paths are currently working (you should add the Kaldi paths into the .bashrc as reported in the section \"Prerequisites\"). For instance, type \"copy-feats\" and \"hmm-info\" and make sure no errors appear. \n\n3. Run the Kaldi s5 baseline of TIMIT. This step is necessary to compute features and labels later used to train the PyTorch neural network. We recommend running the full timit s5 recipe (including the DNN training): \n\n```\ncd kaldi\u002Fegs\u002Ftimit\u002Fs5\n.\u002Frun.sh\n.\u002Flocal\u002Fnnet\u002Frun_dnn.sh\n```\n\nThis way all the necessary files are created and the user can directly compare the results obtained by Kaldi with that achieved with our toolkit.\n\n4. Compute the alignments (i.e, the phone-state labels) for test and dev data with the following commands (go into $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5). If you want to use tri3 alignments, type:\n```\nsteps\u002Falign_fmllr.sh --nj 4 data\u002Fdev data\u002Flang exp\u002Ftri3 exp\u002Ftri3_ali_dev\n\nsteps\u002Falign_fmllr.sh --nj 4 data\u002Ftest data\u002Flang exp\u002Ftri3 exp\u002Ftri3_ali_test\n```\n\nIf you want to use dnn alignments (as suggested), type:\n```\nsteps\u002Fnnet\u002Falign.sh --nj 4 data-fmllr-tri3\u002Ftrain data\u002Flang exp\u002Fdnn4_pretrain-dbn_dnn exp\u002Fdnn4_pretrain-dbn_dnn_ali\n\nsteps\u002Fnnet\u002Falign.sh --nj 4 data-fmllr-tri3\u002Fdev data\u002Flang exp\u002Fdnn4_pretrain-dbn_dnn exp\u002Fdnn4_pretrain-dbn_dnn_ali_dev\n\nsteps\u002Fnnet\u002Falign.sh --nj 4 data-fmllr-tri3\u002Ftest data\u002Flang exp\u002Fdnn4_pretrain-dbn_dnn exp\u002Fdnn4_pretrain-dbn_dnn_ali_test\n```\n\n5. We start this tutorial with a very simple MLP network trained on mfcc features.  Before launching the experiment, take a look at the configuration file  *cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_mfcc_basic.cfg*. See the [Description of the configuration files](#description-of-the-configuration-files) for a detailed description of all its fields. \n\n6. Change the config file according to your paths. In particular:\n- Set “fea_lst” with the path of your mfcc training list (that should be in $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fdata\u002Ftrain\u002Ffeats.scp)\n- Add your path (e.g., $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fdata\u002Ftrain\u002Futt2spk) into “--utt2spk=ark:”\n- Add your CMVN transformation e.g.,$KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fmfcc\u002Fcmvn_train.ark\n- Add the folder where labels are stored (e.g.,$KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fexp\u002Fdnn4_pretrain-dbn_dnn_ali for training and ,$KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fexp\u002Fdnn4_pretrain-dbn_dnn_ali_dev for dev data).\n\nTo avoid errors make sure that all the paths in the cfg file exist. **Please, avoid using paths containing bash variables since paths are read literally and are not automatically expanded** (e.g., use \u002Fhome\u002Fmirco\u002Fkaldi-trunk\u002Fegs\u002Ftimit\u002Fs5\u002Fexp\u002Fdnn4_pretrain-dbn_dnn_ali instead of $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fexp\u002Fdnn4_pretrain-dbn_dnn_ali)\n\n7. Run the ASR experiment:\n```\npython run_exp.py cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_mfcc_basic.cfg\n```\n\nThis script starts a full ASR experiment and performs training, validation, forward, and decoding steps.  A progress bar shows the evolution of all the aforementioned phases. The script *run_exp.py* progressively creates the following files in the output directory:\n\n- *res.res*: a file that summarizes training and validation performance across various validation epochs.\n- *log.log*: a file that contains possible errors and warnings.\n- *conf.cfg*: a copy of the configuration file.\n- *model.svg* is a picture that shows the considered model and how the various neural networks are connected. This is really useful to debug models that are more complex than this one (e.g, models based on multiple neural networks).\n- The folder *exp_files* contains several files that summarize the evolution of training and validation over the various epochs. For instance, files *.info report chunk-specific information such as the chunk_loss and error and the training time. The *.cfg files are the chunk-specific configuration files (see general architecture for more details), while files *.lst report the list of features used to train each specific chunk.\n- At the end of training, a directory called *generated outputs* containing plots of loss and errors during the various training epochs is created.\n\n**Note that you can stop the experiment at any time.** If you run again the script it will automatically start from the last chunk correctly processed. The training could take a couple of hours, depending on the available GPU. Note also that if you would like to change some parameters of the configuration file (e.g., n_chunks=,fea_lst=,batch_size_train=,..) you must specify a different output folder (output_folder=).\n\n**Debug:** If you run into some errors, we suggest to do the following checks:\n1.    Take a look into the standard output.\n2.    If it is not helpful, take a look into the log.log file.\n3.    Take a look into the function run_nn into the core.py library. Add some prints in the various part of the function to isolate the problem and figure out the issue.\n\n\n8. At the end of training, the phone error rate (PER\\%) is appended into the res.res file. To see more details on the decoding results, you can go into “decoding_test” in the output folder and take a look to the various files created.  For this specific example, we obtained the following *res.res* file:\n\n\n```\nep=000 tr=['TIMIT_tr'] loss=3.398 err=0.721 valid=TIMIT_dev loss=2.268 err=0.591 lr_architecture1=0.080000 time(s)=86\nep=001 tr=['TIMIT_tr'] loss=2.137 err=0.570 valid=TIMIT_dev loss=1.990 err=0.541 lr_architecture1=0.080000 time(s)=87\nep=002 tr=['TIMIT_tr'] loss=1.896 err=0.524 valid=TIMIT_dev loss=1.874 err=0.516 lr_architecture1=0.080000 time(s)=87\nep=003 tr=['TIMIT_tr'] loss=1.751 err=0.494 valid=TIMIT_dev loss=1.819 err=0.504 lr_architecture1=0.080000 time(s)=88\nep=004 tr=['TIMIT_tr'] loss=1.645 err=0.472 valid=TIMIT_dev loss=1.775 err=0.494 lr_architecture1=0.080000 time(s)=89\nep=005 tr=['TIMIT_tr'] loss=1.560 err=0.453 valid=TIMIT_dev loss=1.773 err=0.493 lr_architecture1=0.080000 time(s)=88\n.........\nep=020 tr=['TIMIT_tr'] loss=0.968 err=0.304 valid=TIMIT_dev loss=1.648 err=0.446 lr_architecture1=0.002500 time(s)=89\nep=021 tr=['TIMIT_tr'] loss=0.965 err=0.304 valid=TIMIT_dev loss=1.649 err=0.446 lr_architecture1=0.002500 time(s)=90\nep=022 tr=['TIMIT_tr'] loss=0.960 err=0.302 valid=TIMIT_dev loss=1.652 err=0.447 lr_architecture1=0.001250 time(s)=88\nep=023 tr=['TIMIT_tr'] loss=0.959 err=0.301 valid=TIMIT_dev loss=1.651 err=0.446 lr_architecture1=0.000625 time(s)=88\n%WER 18.1 | 192 7215 | 84.0 11.9 4.2 2.1 18.1 99.5 | -0.583 | \u002Fhome\u002Fmirco\u002Fpytorch-kaldi-new\u002Fexp\u002FTIMIT_MLP_basic5\u002Fdecode_TIMIT_test_out_dnn1\u002Fscore_6\u002Fctm_39phn.filt.sys\n```\n\nThe achieved PER(%) is 18.1%. Note that there could be some variability in the results, due to different initializations on different machines. We believe that averaging the performance obtained with different initialization seeds (i.e., change the field *seed* in the config file) is crucial for TIMIT since the natural performance variability might completely hide the experimental evidence. We noticed a standard deviation of about 0.2% for the TIMIT experiments.\n\nIf you want to change the features, you have to first compute them with the Kaldi toolkit. To compute fbank features, you have to open *$KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Frun.sh* and compute them with the following lines:\n```\nfeadir=fbank\n\nfor x in train dev test; do\n  steps\u002Fmake_fbank.sh --cmd \"$train_cmd\" --nj $feats_nj data\u002F$x exp\u002Fmake_fbank\u002F$x $feadir\n  steps\u002Fcompute_cmvn_stats.sh data\u002F$x exp\u002Fmake_fbank\u002F$x $feadir\ndone\n```\n\nThen, change the aforementioned configuration file with the new feature list.\nIf you already have run the full timit Kaldi recipe, you can directly find the fmllr features in *$KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fdata-fmllr-tri3*.\nIf you feed the neural network with such features you should expect a substantial performance improvement, due to the adoption of the speaker adaptation.\n\nIn the TIMIT_baseline folder, we propose several other examples of possible TIMIT baselines. Similarly to the previous example, you can run them by simply typing:\n```\npython run_exp.py $cfg_file\n```\nThere are some examples with recurrent (TIMIT_RNN*,TIMIT_LSTM*,TIMIT_GRU*,TIMIT_LiGRU*) and CNN architectures (TIMIT_CNN*). We also propose a more advanced model (TIMIT_DNN_liGRU_DNN_mfcc+fbank+fmllr.cfg) where we used a combination of feed-forward and recurrent neural networks fed by a concatenation of mfcc, fbank, and fmllr features. Note that the latter configuration files correspond to the best architecture described in the reference paper. As you might see from the above-mentioned configuration files, we improve the ASR performance by including some tricks such as the monophone regularization (i.e., we jointly estimate both context-dependent and context-independent targets). The following table reports the results obtained by running the latter systems (average PER\\%):\n\n| Model  | mfcc | fbank | fMLLR | \n| ------ | -----| ------| ------| \n|  Kaldi DNN Baseline | -----| ------| 18.5 |\n|  MLP  | 18.2 | 18.7 | 16.7 | \n|  RNN  | 17.7 | 17.2 | 15.9 | \n|  SRU  | -----| 16.6 | -----|\n|LSTM| 15.1  | 14.3  |14.5  | \n|GRU| 16.0 | 15.2|  14.9 | \n|li-GRU| **15.5**  | **14.9**|  **14.2** | \n\nResults show that, as expected, fMLLR features outperform MFCCs and FBANKs coefficients, thanks to the speaker adaptation process. Recurrent models significantly outperform the standard MLP one, especially when using LSTM, GRU, and Li-GRU architecture, that effectively address gradient vanishing through multiplicative gates. The best result *PER=$14.2$\\%* is obtained with the [Li-GRU model](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.10225.pdf) [2,3], that is based on a single gate and thus saves 33% of the computations over a standard GRU. \n\nThe best results are actually obtained with a more complex architecture that combines MFCC, FBANK, and fMLLR features (see *cfg\u002FTIMI_baselines\u002FTIMIT_mfcc_fbank_fmllr_liGRU_best.cfg*). To the best of our knowledge, the **PER=13.8\\%** achieved by the latter system yields the best-published performance on the TIMIT test-set. \n\nThe Simple Recurrent Units (SRU) is an efficient and highly parallelizable recurrent model. Its performance on ASR is worse than standard LSTM, GRU, and Li-GRU models, but it is significantly faster. SRU is implemented [here](https:\u002F\u002Fgithub.com\u002Ftaolei87\u002Fsru) and described in the following paper:\n\nT. Lei, Y. Zhang, S. I. Wang, H. Dai, Y. Artzi, \"Simple Recurrent Units for Highly Parallelizable Recurrence, Proc. of EMNLP 2018. [arXiv](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1709.02755.pdf)\n\nTo do experiments with this model, use the config file *cfg\u002FTIMIT_baselines\u002FTIMIT_SRU_fbank.cfg*. Before you should install the model using ```pip install sru``` and you should uncomment \"import sru\" in *neural_networks.py*.\n\n\nYou can directly compare your results with ours by going [here](https:\u002F\u002Fbitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-exp-timit\u002Fsrc\u002Fmaster\u002F). In this external repository, you can find all the folders containing the generated files.\n\n## Librispeech tutorial\nThe steps to run PyTorch-Kaldi on the Librispeech dataset are similar to that reported above for TIMIT. The following tutorial is based on the *100h sub-set*, but it can be easily extended to the full dataset (960h).\n\n1. Run the Kaldi recipe for librispeech at least until Stage 13 (included)\n2. Copy exp\u002Ftri4b\u002Ftrans.* files into exp\u002Ftri4b\u002Fdecode_tgsmall_train_clean_100\u002F\n```\nmkdir exp\u002Ftri4b\u002Fdecode_tgsmall_train_clean_100 && cp exp\u002Ftri4b\u002Ftrans.* exp\u002Ftri4b\u002Fdecode_tgsmall_train_clean_100\u002F\n```\n3. Compute the fmllr features by running the following script. \n\n```\n. .\u002Fcmd.sh ## You'll want to change cmd.sh to something that will work on your system.\n. .\u002Fpath.sh ## Source the tools\u002Futils (import the queue.pl)\n\ngmmdir=exp\u002Ftri4b\n\nfor chunk in train_clean_100 dev_clean test_clean; do\n    dir=fmllr\u002F$chunk\n    steps\u002Fnnet\u002Fmake_fmllr_feats.sh --nj 10 --cmd \"$train_cmd\" \\\n        --transform-dir $gmmdir\u002Fdecode_tgsmall_$chunk \\\n            $dir data\u002F$chunk $gmmdir $dir\u002Flog $dir\u002Fdata || exit 1\n\n    compute-cmvn-stats --spk2utt=ark:data\u002F$chunk\u002Fspk2utt scp:fmllr\u002F$chunk\u002Ffeats.scp ark:$dir\u002Fdata\u002Fcmvn_speaker.ark\ndone\n```\n\n4. compute aligmenents using:\n```\n# aligments on dev_clean and test_clean\nsteps\u002Falign_fmllr.sh --nj 30 data\u002Ftrain_clean_100 data\u002Flang exp\u002Ftri4b exp\u002Ftri4b_ali_clean_100\nsteps\u002Falign_fmllr.sh --nj 10 data\u002Fdev_clean data\u002Flang exp\u002Ftri4b exp\u002Ftri4b_ali_dev_clean_100\nsteps\u002Falign_fmllr.sh --nj 10 data\u002Ftest_clean data\u002Flang exp\u002Ftri4b exp\u002Ftri4b_ali_test_clean_100\n```\n\n5. run the experiments with the following command:\n```\n  python run_exp.py cfg\u002FLibrispeech_baselines\u002Flibri_MLP_fmllr.cfg\n```\n\nIf you would like to use a recurrent model you can use *libri_RNN_fmllr.cfg*, *libri_LSTM_fmllr.cfg*, *libri_GRU_fmllr.cfg*, or *libri_liGRU_fmllr.cfg*. The training of recurrent models might take some days (depending on the adopted GPU).  The performance obtained with the tgsmall graph are reported in the following table:\n\n| Model  | WER% | \n| ------ | -----| \n|  MLP  |  9.6 |\n|LSTM   |  8.6  |\n|GRU     | 8.6 | \n|li-GRU| 8.6 |\n\nThese results are obtained without adding a lattice rescoring (i.e., using only the *tgsmall* graph). You can improve the performance by adding lattice rescoring in this way (run it from the *kaldi_decoding_script* folder of Pytorch-Kaldi):\n```\ndata_dir=\u002Fdata\u002Fmilatmp1\u002Fravanelm\u002Flibrispeech\u002Fs5\u002Fdata\u002F\ndec_dir=\u002Fu\u002Fravanelm\u002Fpytorch-Kaldi-new\u002Fexp\u002Flibri_fmllr\u002Fdecode_test_clean_out_dnn1\u002F\nout_dir=\u002Fu\u002Fravanelm\u002Fpytorch-kaldi-new\u002Fexp\u002Flibri_fmllr\u002F\n\nsteps\u002Flmrescore_const_arpa.sh  $data_dir\u002Flang_test_{tgsmall,fglarge} \\\n          $data_dir\u002Ftest_clean $dec_dir $out_dir\u002Fdecode_test_clean_fglarge   || exit 1;\n```\nThe final results obtaineed using rescoring (*fglarge*) are reported in the following table:\n\n| Model  | WER% |  \n| ------ | -----| \n|  MLP  |  6.5 |\n|LSTM   |  6.4  |\n|GRU     | 6.3 | \n|li-GRU| **6.2**  |\n\n\nYou can take a look into the results obtained [here](https:\u002F\u002Fbitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-exp-librispeech\u002Fsrc\u002Fmaster\u002F).\n\n\n\n\n## Overview of the toolkit architecture\nThe main script to run an ASR experiment is **run_exp.py**. This python script performs training, validation, forward, and decoding steps.  Training is performed over several epochs, that progressively process all the training material with the considered neural network.\nAfter each training epoch, a validation step is performed to monitor the system performance on *held-out* data. At the end of training, the forward phase is performed by computing the posterior probabilities of the specified test dataset. The posterior probabilities are normalized by their priors (using a count file) and stored into an ark file. A decoding step is then performed to retrieve the final sequence of words uttered by the speaker in the test sentences.\n\nThe *run_exp.py* script takes in input a global config file (e.g., *cfg\u002FTIMIT_MLP_mfcc.cfg*) that specifies all the needed options to run a full experiment. The code *run_exp.py* calls another function  **run_nn** (see core.py library) that performs training, validation, and forward operations on each chunk of data.\nThe function *run_nn* takes in input a chunk-specific config file (e.g, exp\u002FTIMIT_MLP_mfcc\u002Fexp_files\u002Ftrain_TIMIT_tr+TIMIT_dev_ep000_ck00.cfg*) that specifies all the needed parameters for running a single-chunk experiment. The run_nn function outputs some info filles (e.g., *exp\u002FTIMIT_MLP_mfcc\u002Fexp_files\u002Ftrain_TIMIT_tr+TIMIT_dev_ep000_ck00.info*) that summarize losses and errors of the processed chunk.\n\nThe results are summarized into the *res.res* files, while errors and warnings are redirected into the *log.log* file.\n\n\n## Description of the configuration files:\nThere are two types of config files (global and chunk-specific cfg files). They are both in *INI* format and are read, processed, and modified with the *configparser* library of python. \nThe global file contains several sections, that specify all the main steps of a speech recognition experiments (training, validation, forward, and decoding).\nThe structure of the config file is described in a prototype file (see for instance *proto\u002Fglobal.proto*) that not only lists all the required sections and fields but also specifies the type of each possible field. For instance, *N_ep=int(1,inf)* means that the fields *N_ep* (i.e, number of training epochs) must be an integer ranging from 1 to inf. Similarly, *lr=float(0,inf)* means that the lr field (i.e., the learning rate) must be a float ranging from 0 to inf. Any attempt to write a config file not compliant with these specifications will raise an error.\n\nLet's now try to open a config file (e.g., *cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_mfcc_basic.cfg*) and let's describe the main sections:\n\n```\n[cfg_proto]\ncfg_proto = proto\u002Fglobal.proto\ncfg_proto_chunk = proto\u002Fglobal_chunk.proto\n```\nThe current version of the config file first specifies the paths of the global and chunk-specific prototype files in the section *[cfg_proto]*.\n\n```\n[exp]\ncmd = \nrun_nn_script = run_nn\nout_folder = exp\u002FTIMIT_MLP_basic5\nseed = 1234\nuse_cuda = True\nmulti_gpu = False\nsave_gpumem = False\nn_epochs_tr = 24\n```\nThe section [exp] contains some important fields, such as the output folder (*out_folder*) and the path of the chunk-specific processing script *run_nn* (by default this function should be implemented in the core.py library). The field *N_epochs_tr* specifies the selected number of training epochs. Other options about using_cuda, multi_gpu, and save_gpumem can be enabled by the user. The field *cmd* can be used to append a command to run the script on a HPC cluster.\n\n```\n[dataset1]\ndata_name = TIMIT_tr\nfea = fea_name=mfcc\n    fea_lst=quick_test\u002Fdata\u002Ftrain\u002Ffeats_mfcc.scp\n    fea_opts=apply-cmvn --utt2spk=ark:quick_test\u002Fdata\u002Ftrain\u002Futt2spk  ark:quick_test\u002Fmfcc\u002Ftrain_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |\n    cw_left=5\n    cw_right=5\n    \nlab = lab_name=lab_cd\n    lab_folder=quick_test\u002Fdnn4_pretrain-dbn_dnn_ali\n    lab_opts=ali-to-pdf\n    lab_count_file=auto\n    lab_data_folder=quick_test\u002Fdata\u002Ftrain\u002F\n    lab_graph=quick_test\u002Fgraph\n    \nn_chunks = 5\n\n[dataset2]\ndata_name = TIMIT_dev\nfea = fea_name=mfcc\n    fea_lst=quick_test\u002Fdata\u002Fdev\u002Ffeats_mfcc.scp\n    fea_opts=apply-cmvn --utt2spk=ark:quick_test\u002Fdata\u002Fdev\u002Futt2spk  ark:quick_test\u002Fmfcc\u002Fdev_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |\n    cw_left=5\n    cw_right=5\n    \nlab = lab_name=lab_cd\n    lab_folder=quick_test\u002Fdnn4_pretrain-dbn_dnn_ali_dev\n    lab_opts=ali-to-pdf\n    lab_count_file=auto\n    lab_data_folder=quick_test\u002Fdata\u002Fdev\u002F\n    lab_graph=quick_test\u002Fgraph\nn_chunks = 1\n\n[dataset3]\ndata_name = TIMIT_test\nfea = fea_name=mfcc\n    fea_lst=quick_test\u002Fdata\u002Ftest\u002Ffeats_mfcc.scp\n    fea_opts=apply-cmvn --utt2spk=ark:quick_test\u002Fdata\u002Ftest\u002Futt2spk  ark:quick_test\u002Fmfcc\u002Ftest_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |\n    cw_left=5\n    cw_right=5\n    \nlab = lab_name=lab_cd\n    lab_folder=quick_test\u002Fdnn4_pretrain-dbn_dnn_ali_test\n    lab_opts=ali-to-pdf\n    lab_count_file=auto\n    lab_data_folder=quick_test\u002Fdata\u002Ftest\u002F\n    lab_graph=quick_test\u002Fgraph\n    \nn_chunks = 1\n```\n\nThe config file contains a number of sections (*[dataset1]*, *[dataset2]*, *[dataset3]*,...) that describe all the corpora used for the ASR experiment.  The fields on the *[dataset\\*]* section describe all the features and labels considered in the experiment.\nThe features, for instance, are specified in the field *fea:*, where *fea_name* contains the name given to the feature, *fea_lst* is the list of features (in the scp Kaldi format), *fea_opts* allows users to specify how to process the features (e.g., doing CMVN or adding the derivatives), while *cw_left* and *cw_right* set the characteristics of the context window (i.e., number of left and right frames to append). Note that the current version of the PyTorch-Kaldi toolkit supports the definition of multiple features streams. Indeed, as shown in *cfg\u002FTIMIT_baselines\u002FTIMIT_mfcc_fbank_fmllr_liGRU_best.cfg* multiple feature streams (e.g., mfcc, fbank, fmllr) are employed.\n\nSimilarly, the *lab* section contains some sub-fields.  For instance, *lab_name* refers to the name given to the label,  while *lab_folder* contains the folder where the alignments generated by the Kaldi recipe are stored.  *lab_opts* allows the user to specify some options on the considered alignments. For example  *lab_opts=\"ali-to-pdf\"* extracts standard context-dependent phone-state labels, while *lab_opts=ali-to-phones --per-frame=true* can be used to extract monophone targets. *lab_count_file* is used to specify the file that contains the counts of the considered phone states. \nThese counts are important in the forward phase, where the posterior probabilities computed by the neural network are divided by their priors. PyTorch-Kaldi allows users to both specify an external count file or to automatically retrieve it (using *lab_count_file=auto*). Users can also specify *lab_count_file=none* if the count file is not strictly needed, e.g., when the labels correspond to an output not used to generate the posterior probabilities used in the forward phase (see for instance the monophone targets in *cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_mfcc.cfg*). *lab_data_folder*, instead, corresponds to the data folder created during the Kaldi data preparation. It contains several files, including the text file eventually used for the computation of the final WER. The last sub-field *lab_graph* is the path of the Kaldi graph used to generate the labels.  \n\nThe full dataset is usually large and cannot fit the GPU\u002FRAM memory. It should thus be split into several chunks.  PyTorch-Kaldi automatically splits the dataset into the number of chunks specified in *N_chunks*. The number of chunks might depend on the specific dataset. In general, we suggest processing speech chunks of about 1 or 2 hours (depending on the available memory).\n\n```\n[data_use]\ntrain_with = TIMIT_tr\nvalid_with = TIMIT_dev\nforward_with = TIMIT_test\n```\n\nThis section tells how the data listed into the sections *[datasets\\*]* are used within the *run_exp.py* script. \nThe first line means that we perform training with the data called *TIMIT_tr*. Note that this dataset name must appear in one of the dataset sections, otherwise the config parser will raise an error. Similarly,  the second and third lines specify the data used for validation and forward phases, respectively.\n\n```\n[batches]\nbatch_size_train = 128\nmax_seq_length_train = 1000\nincrease_seq_length_train = False\nstart_seq_len_train = 100\nmultply_factor_seq_len_train = 2\nbatch_size_valid = 128\nmax_seq_length_valid = 1000\n```\n\n*batch_size_train* is used to define the number of training examples in the mini-batch.  The fields *max_seq_length_train* truncates the sentences longer than the specified value. When training recurrent models on very long sentences, out-of-memory issues might arise. With this option, we allow users to mitigate such memory problems by truncating long sentences. Moreover, it is possible to progressively grow the maximum sentence length during training by setting *increase_seq_length_train=True*. If enabled, the training starts with a maximum sentence length specified in start_seq_len_train (e.g, *start_seq_len_train=100*). After each epoch the maximum sentence length is multiplied by the multply_factor_seq_len_train (e.g *multply_factor_seq_len_train=2*).\nWe have observed that this simple strategy generally improves the system performance since it encourages the model to first focus on short-term dependencies and learn longer-term ones only at a later stage.\n\nSimilarly,*batch_size_valid* and *max_seq_length_valid* specify the number of examples in the mini-batches and the maximum length for the dev dataset.\n\n```\n[architecture1]\narch_name = MLP_layers1\narch_proto = proto\u002FMLP.proto\narch_library = neural_networks\narch_class = MLP\narch_pretrain_file = none\narch_freeze = False\narch_seq_model = False\ndnn_lay = 1024,1024,1024,1024,N_out_lab_cd\ndnn_drop = 0.15,0.15,0.15,0.15,0.0\ndnn_use_laynorm_inp = False\ndnn_use_batchnorm_inp = False\ndnn_use_batchnorm = True,True,True,True,False\ndnn_use_laynorm = False,False,False,False,False\ndnn_act = relu,relu,relu,relu,softmax\narch_lr = 0.08\narch_halving_factor = 0.5\narch_improvement_threshold = 0.001\narch_opt = sgd\nopt_momentum = 0.0\nopt_weight_decay = 0.0\nopt_dampening = 0.0\nopt_nesterov = False\n```\n\nThe sections *[architecture\\*]* are used to specify the architectures of the neural networks involved in the ASR experiments. The field *arch_name* specifies the name of the architecture. Since different neural networks can depend on a different set of hyperparameters, the user has to add the path of a proto file that contains the list of hyperparameters into the field *proto*.  For example,  the prototype file for a standard MLP model contains the following fields:\n```\n[proto]\nlibrary=path\nclass=MLP\ndnn_lay=str_list\ndnn_drop=float_list(0.0,1.0)\ndnn_use_laynorm_inp=bool\ndnn_use_batchnorm_inp=bool\ndnn_use_batchnorm=bool_list\ndnn_use_laynorm=bool_list\ndnn_act=str_list\n```\n\nSimilarly to the other prototype files, each line defines a hyperparameter with the related value type. All the hyperparameters defined in the proto file must appear into the global configuration file under the corresponding *[architecture\\*]* section.\nThe field *arch_library* specifies where the model is coded (e.g. *neural_nets.py*), while *arch_class* indicates the name of the class where the architecture is implemented (e.g. if we set *class=MLP* we will do *from neural_nets.py import MLP*).\n\nThe field *arch_pretrain_file* can be used to pre-train the neural network with a previously-trained architecture, while *arch_freeze* can be set to *False* if you want to train the parameters of the architecture during training and should be set to *True* do keep the parameters fixed (i.e., frozen) during training. The section *arch_seq_model* indicates if the architecture is sequential (e.g. RNNs) or non-sequential (e.g., a  feed-forward MLP or CNN). The way PyTorch-Kaldi processes the input batches is different in the two cases. For recurrent neural networks (*arch_seq_model=True*) the sequence of features is not randomized (to preserve the elements of the sequences), while for feedforward models  (*arch_seq_model=False*) we randomize the features (this usually helps to improve the performance). In the case of multiple architectures, sequential processing is used if at least one of the employed architectures is marked as sequential  (*arch_seq_model=True*).\n\nNote that the hyperparameters starting with \"arch_\" and \"opt_\" are mandatory and must be present in all the architecture specified in the config file. The other hyperparameters (e.g., dnn_*, ) are specific of the considered architecture (they depend on how the class MLP is actually implemented by the user) and can define number and typology of hidden layers, batch and layer normalizations, and other parameters.\nOther important parameters are related to the optimization of the considered architecture. For instance, *arch_lr* is the learning rate, while *arch_halving_factor* is used to implement learning rate annealing. In particular, when the relative performance improvement on the dev-set between two consecutive epochs is smaller than that specified in the *arch_improvement_threshold* (e.g, arch_improvement_threshold) we multiply the learning rate by the *arch_halving_factor* (e.g.,*arch_halving_factor=0.5*). The field arch_opt specifies the type of optimization algorithm. We currently support SGD, Adam, and Rmsprop. The other parameters are specific to the considered optimization algorithm (see the PyTorch documentation for exact meaning of all the optimization-specific hyperparameters).\nNote that the different architectures defined in *[archictecture\\*]* can have different optimization hyperparameters and they can even use a different optimization algorithm.\n\n```\n[model]\nmodel_proto = proto\u002Fmodel.proto\nmodel = out_dnn1=compute(MLP_layers1,mfcc)\n    loss_final=cost_nll(out_dnn1,lab_cd)\n    err_final=cost_err(out_dnn1,lab_cd)\n```    \n\nThe way all the various features and architectures are combined is specified in this section with a very simple and intuitive meta-language.\nThe field *model:* describes how features and architectures are connected to generate as output a set of posterior probabilities. The line *out_dnn1=compute(MLP_layers,mfcc)* means \"*feed the architecture called MLP_layers1 with the features called mfcc and store the output into the variable out_dnn1*”.\nFrom the neural network output *out_dnn1* the error and the loss functions are computed using the labels called *lab_cd*, that have to be previously defined into the *[datasets\\*]* sections. The *err_final* and *loss_final* fields are mandatory subfields that define the final output of the model.\n\nA much more complex example (discussed here just to highlight the potentiality of the toolkit) is reported in *cfg\u002FTIMIT_baselines\u002FTIMIT_mfcc_fbank_fmllr_liGRU_best.cfg*:\n```    \n[model]\nmodel_proto=proto\u002Fmodel.proto\nmodel:conc1=concatenate(mfcc,fbank)\n      conc2=concatenate(conc1,fmllr)\n      out_dnn1=compute(MLP_layers_first,conc2)\n      out_dnn2=compute(liGRU_layers,out_dnn1)\n      out_dnn3=compute(MLP_layers_second,out_dnn2)\n      out_dnn4=compute(MLP_layers_last,out_dnn3)\n      out_dnn5=compute(MLP_layers_last2,out_dnn3)\n      loss_mono=cost_nll(out_dnn5,lab_mono)\n      loss_mono_w=mult_constant(loss_mono,1.0)\n      loss_cd=cost_nll(out_dnn4,lab_cd)\n      loss_final=sum(loss_cd,loss_mono_w)     \n      err_final=cost_err(out_dnn4,lab_cd)\n```    \nIn this case we first concatenate mfcc, fbank, and fmllr features and we then feed a MLP. The output of the MLP is fed into the a recurrent neural network (specifically a Li-GRU model). We then have another MLP layer (*MLP_layers_second*) followed by two softmax classifiers (i.e., *MLP_layers_last*, *MLP_layers_last2*). The first one estimates standard context-dependent states, while the second estimates monophone targets. The final cost function is a weighted sum between these two predictions. In this way we implement the monophone regularization, that turned out to be useful to improve the ASR performance.\n\nThe full model can be considered as a single big computational graph, where all the basic architectures used in the [model] section are jointly trained. For each mini-batch, the input features are propagated through the full model and the cost_final is computed using the specified labels. The gradient of the cost function with respect to all the learnable parameters of the architecture is then computed. All the parameters of the employed architectures are then updated together with the algorithm specified in the *[architecture\\*]* sections.\n```    \n[forward]\nforward_out = out_dnn1\nnormalize_posteriors = True\nnormalize_with_counts_from = lab_cd\nsave_out_file = True\nrequire_decoding = True\n```    \n\nThe section forward first defines which is the output to forward (it must be defined into the model section).  if *normalize_posteriors=True*, these posterior are normalized by their priors (using a count file).  If *save_out_file=True*, the posterior file (usually a very big ark file) is stored, while if *save_out_file=False* this file is deleted when not needed anymore.\nThe *require_decoding* is a boolean that specifies if we need to decode the specified output.  The field *normalize_with_counts_from* set which counts using to normalize the posterior probabilities.\n\n```    \n[decoding]\ndecoding_script_folder = kaldi_decoding_scripts\u002F\ndecoding_script = decode_dnn.sh\ndecoding_proto = proto\u002Fdecoding.proto\nmin_active = 200\nmax_active = 7000\nmax_mem = 50000000\nbeam = 13.0\nlatbeam = 8.0\nacwt = 0.2\nmax_arcs = -1\nskip_scoring = false\nscoring_script = local\u002Fscore.sh\nscoring_opts = \"--min-lmwt 1 --max-lmwt 10\"\nnorm_vars = False\n```    \n\nThe decoding section reports parameters about decoding, i.e. the steps that allows one to pass from a sequence of the context-dependent probabilities provided by the DNN into a sequence of words. The field *decoding_script_folder* specifies the folder where the decoding script is stored. The *decoding script* field is the script used for decoding (e.g., *decode_dnn.sh*) that should be in the *decoding_script_folder* specified before.  The field *decoding_proto* reports all the parameters needed for the considered decoding script. \n\nTo make the code more flexible, the config parameters can also be specified within the command line. For example, you can run:\n```    \n python run_exp.py quick_test\u002Fexample_newcode.cfg --optimization,lr=0.01 --batches,batch_size=4\n```    \nThe script will replace the learning rate in the specified cfg file with the specified lr value. The modified config file is then stored into *out_folder\u002Fconfig.cfg*.\n\nThe script *run_exp.py* automatically creates chunk-specific config files, that are used by the *run_nn* function to perform a single chunk training. The structure of chunk-specific cfg files is very similar to that of the global one. The main difference is a field *to_do={train, valid, forward}* that specifies the type of processing to on the features chunk specified in the field *fea*.\n\n*Why proto files?*\nDifferent neural networks, optimization algorithms, and HMM decoders might depend on a different set of hyperparameters. To address this issue, our current solution is based on the definition of some prototype files (for global, chunk, architecture config files). In general, this approach allows a more transparent check of the fields specified into the global config file. Moreover, it allows users to easily add new parameters without changing any line of the python code.\nFor instance, to add a user-defined model, a new proto file (e.g., *user-model.prot*o) that specifies the hyperparameter must be written. Then, the user should only write a class  (e.g., user-model in *neural_networks.py*) that implements the architecture).\n\n## [FAQs]\n## How can I plug-in my model\nThe toolkit is designed to allow users to easily plug-in their own acoustic models. To add a customized neural model do the following steps:\n1. Go into the proto folder and create a new proto file (e.g., *proto\u002FmyDNN.proto*). The proto file is used to specify the list of the hyperparameters of your model that will be later set into the configuration file. To have an idea about the information to add to your proto file, you can take a look into the *MLP.proto* file: \n\n```\n[proto]\ndnn_lay=str_list\ndnn_drop=float_list(0.0,1.0)\ndnn_use_laynorm_inp=bool\ndnn_use_batchnorm_inp=bool\ndnn_use_batchnorm=bool_list\ndnn_use_laynorm=bool_list\ndnn_act=str_list\n```\n2. The parameter *dnn_lay* must be a list of string, *dnn_drop* (i.e., the dropout factors for each layer) is a list of float ranging from 0.0 and 1.0, *dnn_use_laynorm_inp* and *dnn_use_batchnorm_inp* are booleans that enable or disable batch or layer normalization of the input.  *dnn_use_batchnorm* and *dnn_use_laynorm* are a list of boolean that decide layer by layer if batch\u002Flayer normalization has to be used. \nThe parameter *dnn_act* is again a list of string that sets the activation function of each layer. Since every model is based on its own set of hyperparameters,  different models have a different prototype file. For instance, you can take a look into *GRU.proto* and see that the hyperparameter list is different from that of a standard MLP.  Similarly to the previous examples, you should add here your list of hyperparameters and save the file.\n\n3. Write a PyTorch class implementing your model.\n Open the library *neural_networks.py* and look at some of the models already implemented. For simplicity, you can start taking a look into the class MLP.  The classes have two mandatory methods: **init** and **forward**. The first one is used to initialize the architecture, the second specifies the list of computations to do. \nThe method *init* takes in input two variables that are automatically computed within the *run_nn* function.  **inp_dim** is simply the dimensionality of the neural network input, while **options** is a dictionary containing all the parameters specified into the section *architecture* of the configuration file.  \nFor instance, you can access to the DNN activations of the various layers  in this way: \n```options['dnn_lay'].split(',')```. \nAs you might see from the MLP class, the initialization method defines and initializes all the parameters of the neural network. The forward method takes in input a tensor **x** (i.e., the input data) and outputs another vector containing x. \nIf your model is a sequence model (i.e., if there is at least one architecture with arch_seq_model=true in the cfg file), x is a tensor with (time_steps, batches, N_in), otherwise is a (batches, N_in) matrix. The class **forward** defines the list of computations to transform the input tensor into a corresponding output tensor. The output must have the sequential format (time_steps, batches, N_out) for recurrent models and the non-sequential format (batches, N_out) for feed-forward models.\nSimilarly to the already-implemented models the user should write a new class (e.g., myDNN) that implements the customized model:\n```\nclass myDNN(nn.Module):\n    \n    def __init__(self, options,inp_dim):\n        super(myDNN, self).__init__()\n             \u002F\u002F initialize the parameters\n\n            def forward(self, x):\n                 \u002F\u002F do some computations out=f(x)\n                  return out\n```\n\n4. Create a configuration file.\nNow that you have defined your model and the list of its hyperparameters, you can create a configuration file. To create your own configuration file, you can take a look into an already existing config file (e.g., for simplicity you can consider *cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_mfcc_basic.cfg*). After defining the adopted datasets with their related features and labels, the configuration file has some sections called *[architecture\\*]*. Each architecture implements a different neural network. In *cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_mfcc_basic.cfg* we only have *[architecture1]* since the acoustic model is composed of a single neural network. To add your own neural network, you have to write an architecture section (e.g., *[architecture1]*) in the following way:\n```\n[architecture1]\narch_name= mynetwork (this is a name you would like to use to refer to this architecture within the following model section)\narch_proto=proto\u002FmyDNN.proto (here is the name of the proto file defined before)\narch_library=neural_networks (this is the name of the library where myDNN is implemented)\narch_class=myDNN (This must be the name of the  class you have implemented)\narch_pretrain_file=none (With this you can specify if you want to pre-train your model)\narch_freeze=False (set False if you want to update the parameters of your model)\narch_seq_model=False (set False for feed-forward models, True for recurrent models)\n```\nThen, you have to specify proper values for all the hyperparameters specified in *proto\u002FmyDNN.proto*. For the *MLP.proto*, we have:\n```\ndnn_lay=1024,1024,1024,1024,1024,N_out_lab_cd\ndnn_drop=0.15,0.15,0.15,0.15,0.15,0.0\ndnn_use_laynorm_inp=False\ndnn_use_batchnorm_inp=False\ndnn_use_batchnorm=True,True,True,True,True,False\ndnn_use_laynorm=False,False,False,False,False,False\ndnn_act=relu,relu,relu,relu,relu,softmax\n```\nThen, add the following parameters related to the optimization of your own architecture. You can use here standard sdg, adam, or rmsprop (see cfg\u002FTIMIT_baselines\u002FTIMIT_LSTM_mfcc.cfg for an example with rmsprop):\n```\narch_lr=0.08\narch_halving_factor=0.5\narch_improvement_threshold=0.001\narch_opt=sgd\nopt_momentum=0.0\nopt_weight_decay=0.0\nopt_dampening=0.0\nopt_nesterov=False\n```\n\n5. Save the configuration file into the cfg folder (e.g, *cfg\u002FmyDNN_exp.cfg*).\n\n6. Run the experiment with: \n```\npython run_exp.py cfg\u002FmyDNN_exp.cfg\n```\n\n7. To debug the model you can first take a look at the standard output. The config file is automatically parsed by the *run_exp.py* and it raises errors in case of possible problems. You can also take a look into the *log.log* file to see additional information on the possible errors. \n\n\nWhen implementing a new model, an important debug test consists of doing an overfitting experiment (to make sure that the model is able to overfit a tiny dataset). If the model is not able to overfit, it means that there is a major bug to solve.\n\n8. Hyperparameter tuning.\nIn deep learning, it is often important to play with the hyperparameters to find the proper setting for your model. This activity is usually very computational and time-consuming but is often necessary when introducing new architectures. To help hyperparameter tuning, we developed a utility that implements a random search of the hyperparameters (see next section for more details).\n\n\n## How can I tune the hyperparameters\nA hyperparameter tuning is often needed in deep learning to search for proper neural architectures. To help tuning the hyperparameters within PyTorch-Kaldi, we have implemented a simple utility that implements a random search. In particular, the script  *tune_hyperparameters.py* generates a set of random configuration files and can be run in this way:\n```\npython tune_hyperparameters.py cfg\u002FTIMIT_MLP_mfcc.cfg exp\u002FTIMIT_MLP_mfcc_tuning 10 arch_lr=randfloat(0.001,0.01) batch_size_train=randint(32,256) dnn_act=choose_str{relu,relu,relu,relu,softmax|tanh,tanh,tanh,tanh,softmax}\n```\nThe first parameter is the reference cfg file that we would like to modify, while the second one is the folder where the random configuration files are saved. The third parameter is the number of the random config file that we would like to generate. There is then the list of all the hyperparameters that we want to change. For instance, *arch_lr=randfloat(0.001,0.01)* will replace the field *arch_lr* with a random float ranging from 0.001 to 0.01. *batch_size_train=randint(32,256)* will replace *batch_size_train* with a random integer between 32 and 256 and so on.\nOnce the config files are created, they can be run sequentially or in parallel with:\n```\npython run_exp.py $cfg_file\n```\n\n## How can I use my own dataset\nPyTorch-Kaldi can be used with any speech dataset. To use your own dataset, the steps to take are similar to those discussed in the TIMIT\u002FLibrispeech tutorials. In general, what you have to do is the following:\n1. Run the Kaldi recipe with your dataset. Please, see the Kaldi website to have more information on how to perform data preparation.\n2. Compute the alignments on training, validation, and test data.\n3. Write a PyTorch-Kaldi config file *$cfg_file*.\n4. Run the config file with ```python run_exp.py $cfg_file```.\n\n## How can I plug-in my own features\nThe current version of PyTorch-Kaldi supports input features stored with the Kaldi ark format. If the user wants to perform experiments with customized features, the latter must be converted into the ark format. Take a look into the Kaldi-io-for-python git repository (https:\u002F\u002Fgithub.com\u002Fvesis84\u002Fkaldi-io-for-python) for a detailed description about converting numpy arrays into ark files. \nMoreover, you can take a look into our utility called save_raw_fea.py. This script generates Kaldi ark files containing raw features, that are later used to train neural networks fed by the raw waveform directly (see the section about processing audio with SincNet).\n\n## How can I transcript my own audio files\nThe current version of Pytorch-Kaldi supports the standard production process of using a Pytorch-Kaldi pre-trained acoustic model to transcript one or multiples .wav files. It is important to understand that you must have a trained Pytorch-Kaldi model. While you don't need labels or alignments anymore, Pytorch-Kaldi still needs many files to transcript a new audio file:\n1. The features and features list *feats.scp* (with .ark files, see #how-can-i-plug-my-own-features)\n2. The decoding graph (usually created with mkgraph.sh during previous model training such as triphones models). This graph is not needed if you're not decoding.\n\nOnce you have all these files, you can start adding your dataset section to the global configuration file. The easiest way is to copy the *cfg* file used to train your acoustic model and just modify by adding a new *[dataset]*:\n```\n[dataset4]\ndata_name = myWavFile\nfea = fea_name=fbank\n  fea_lst=myWavFilePath\u002Fdata\u002Ffeats.scp\n  fea_opts=apply-cmvn --utt2spk=ark:myWavFilePath\u002Fdata\u002F\u002Futt2spk  ark:myWavFilePath\u002Fcmvn_test.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- |\n  cw_left=5\n  cw_right=5\n\nlab = lab_name=none\n  lab_data_folder=myWavFilePath\u002Fdata\u002F\n  lab_graph=myWavFilePath\u002Fexp\u002Ftri3\u002Fgraph\nn_chunks=1\n\n[data_use]\ntrain_with = TIMIT_tr\nvalid_with = TIMIT_dev\nforward_with = myWavFile\n```\nThe key string for your audio file transcription is *lab_name=none*. The *none* tag asks Pytorch-Kaldi to enter a *production* mode that only does the forward propagation and decoding without any labels. You don't need TIMIT_tr and TIMIT_dev to be on your production server since Pytorch-Kaldi will skip this information to directly go to the forward phase of the dataset given in the *forward_with* field. As you can see, the global *fea* field requires the exact same parameters than standard training or testing dataset, while the *lab* field only requires two parameters. Please, note that *lab_data_folder* is nothing more than the same path as *fea_lst*. Finally, you still need to specify the number of chunks you want to create to process this file (1 hour = 1 chunk).\u003Cbr \u002F> \n**WARNINGS** \u003Cbr \u002F>\nIn your standard .cfg, you might have used keywords such as *N_out_lab_cd* that can not be used anymore. Indeed, in a production scenario, you don't want to have the training data on your machine. Therefore, all the *variables* that were on your .cfg file must be replaced by their true values. To replace all the *N_out_{mono,lab_cd}* you can take a look at the output of:\n```\nhmm-info \u002Fpath\u002Fto\u002Fthe\u002Ffinal.mdl\u002Fused\u002Fto\u002Fgenerate\u002Fthe\u002Ftraining\u002Fali\n```\nThen, if you normalize posteriors as (check in your .cfg Section forward):\n```\nnormalize_posteriors = True\nnormalize_with_counts_from = lab_cd\n```\nYou must replace *lab_cd* by:\n```\nnormalize_posteriors = True\nnormalize_with_counts_from = \u002Fpath\u002Fto\u002Fali_train_pdf.counts\n```\nThis normalization step is crucial for HMM-DNN speech recognition. DNNs, in fact, provide posterior probabilities, while HMMs are generative models that work with likelihoods. To derive the required likelihoods, one can simply divide the posteriors by the prior probabilities. To create this *ali_train_pdf.counts* file you can follow:\n```\nalidir=\u002Fpath\u002Fto\u002Fthe\u002Fexp\u002Ftri_ali (change it with your path to the exp with the ali)\nnum_pdf=$(hmm-info $alidir\u002Ffinal.mdl | awk '\u002Fpdfs\u002F{print $4}')\nlabels_tr_pdf=\"ark:ali-to-pdf $alidir\u002Ffinal.mdl \\\"ark:gunzip -c $alidir\u002Fali.*.gz |\\\" ark:- |\"\nanalyze-counts --verbose=1 --binary=false --counts-dim=$num_pdf \"$labels_tr_pdf\" ali_train_pdf.counts\n```\net voilà ! In a production scenario, you might need to transcript a huge number of audio files, and you don't want to create as much as needed .cfg file. In this extent, and after creating this initial production .cfg file (you can leave the path blank), you can call the run_exp.py script with specific arguments referring to your different.wav features:\n```\npython run_exp.py cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_fbank_prod.cfg --dataset4,fea,0,fea_lst=\"myWavFilePath\u002Fdata\u002Ffeats.scp\" --dataset4,lab,0,lab_data_folder=\"myWavFilePath\u002Fdata\u002F\" --dataset4,lab,0,lab_graph=\"myWavFilePath\u002Fexp\u002Ftri3\u002Fgraph\u002F\"\n```\n\nThis command will internally alter the configuration file with your specified paths, and run and your defined features! Note that passing long arguments to the run_exp.py script requires a specific notation. *--dataset4* specifies the name of the created section, *fea* is the name of the higher level field, *fea_lst* or *lab_graph* are the name of the lowest level field you want to change. The *0* is here to indicate which lowest level field you want to alter, indeed some configuration files may contain multiple *lab_graph* per dataset! Therefore, *0* indicates the first occurrence, *1* the second ... Paths MUST be encapsulated by \" \" to be interpreted as full strings! Note that you need to alter the *data_name* and *forward_with* fields if you don't want different .wav files transcriptions to erase each other (decoding files are stored accordingly to the field*data_name*). ``` --dataset4,data_name=MyNewName --data_use,forward_with=MyNewName ```.\n\n## Batch size, learning rate, and dropout scheduler\nIn order to give users more flexibility, the latest version of PyTorch-Kaldi supports scheduling of the batch size, max_seq_length_train, learning rate, and dropout factor.\nThis means that it is now possible to change these values during training. To support this feature, we implemented the following formalisms within the config files:\n```\nbatch_size_train = 128*12 | 64*10 | 32*2\n```\nIn this case, our batch size will be 128 for the first 12 epochs, 64 for the following 10 epochs, and 32 for the last two epochs. By default \"*\" means \"for N times\", while \"|\" is used to indicate a change of the batch size. Note that if the user simply sets ```batch_size_train = 128```, the batch size is kept fixed during all the training epochs by default.\n\nA similar formalism can be used to perform learning rate scheduling:\n```\narch_lr = 0.08*10|0.04*5|0.02*3|0.01*2|0.005*2|0.0025*2\n```\nIn this case, if the user simply sets ```arch_lr = 0.08``` the learning rate is annealed with the *new-bob* procedure used in the previous version of the toolkit. In practice, we start from the specified learning rate and we multiply it by a halving factor every time that the improvement on the validation dataset is smaller than the threshold specified in the field *arch_improvement_threshold*. \n\nAlso the dropout factor can now be changed during training with the following formalism:\n\n```\ndnn_drop = 0.15*12|0.20*12,0.15,0.15*10|0.20*14,0.15,0.0\n```\nWith the line before we can set a different dropout rate for different layers and for different epochs. \nFor instance, the first hidden layer will have a dropout rate of 0.15 for the first 12 epochs, and 0.20 for the other 12. The dropout factor of the second layer, instead, will remain constant to 0.15 over all the training.  The same formalism is used for all the layers. Note that \"|\" indicates a change in the dropout factor within the same layer, while \",\" indicates a different layer.\n\nYou can take a look here into a config file where batch sizes, learning rates, and dropout factors are changed here:\n```\ncfg\u002FTIMIT_baselines\u002FTIMIT_mfcc_basic_flex.cfg\n```\nor here:\n```\ncfg\u002FTIMIT_baselines\u002FTIMIT_liGRU_fmllr_lr_schedule.cfg\n```\n\n\n\n\n## How can I contribute to the project\nThe project is still in its initial phase and we invite all potential contributors to participate. We hope to build a community of developers larger enough to progressively maintain, improve, and expand the functionalities of our current toolkit.  For instance, it could be helpful to report any bug or any suggestion to improve the current version of the code. People can also contribute by adding additional neural models, that can eventually make richer the set of currently-implemented architectures.\n\n\n\n\n## [EXTRA]\n## Speech recognition from the raw waveform with SincNet\n\n[Take a look into our video introduction to SincNet](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=mXQBObRGUgk&feature=youtu.be)\n\nSincNet is a convolutional neural network recently proposed to process raw audio waveforms. In particular, SincNet encourages the first layer to discover more meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized filter-bank front-end, that only depends on some parameters with a clear physical meaning.\n\nFor a more detailed description of the SincNet model, please refer to the following papers:\n\n- *M. Ravanelli, Y. Bengio, \"Speaker Recognition from raw waveform with SincNet\", in Proc. of SLT 2018 [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1808.00158)*\n\n- *M. Ravanelli, Y.Bengio, \"Interpretable Convolutional Filters with SincNet\", in Proc. of NIPS@IRASL 2018 [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.09725)*\n\nTo use this model for speech recognition on TIMIT, to the following steps:\n1. Follows the steps described in the “TIMIT tutorial”.\n2. Save the raw waveform into the Kaldi ark format. To do it, you can use the save_raw_fea.py utility in our repository. The script saves the input signals into a binary Kaldi archive, keeping the alignments with the pre-computed labels. You have to run it for all the data chunks (e.g., train, dev, test). It can also specify the length of the speech chunk (*sig_wlen=200 # ms*) composing each frame.\n3. Open the *cfg\u002FTIMIT_baselines\u002FTIMIT_SincNet_raw.cfg*, change your paths, and run:\n```    \npython .\u002Frun_exp.py cfg\u002FTIMIT_baselines\u002FTIMIT_SincNet_raw.cfg\n```    \n\n4. With this architecture, we have obtained a **PER(%)=17.1%**. A standard CNN fed the same features gives us a **PER(%)=18.%**. Please, see [here](https:\u002F\u002Fbitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-exp-timit\u002Fsrc\u002Fmaster\u002F) to take a look into our results. Our results on SincNet outperforms results obtained with MFCCs and FBANKs fed by standard feed-forward networks.\n\nIn the following table, we compare the result of SincNet with other feed-forward neural network:\n\n| Model  | WER(\\%) | \n| ------ | -----|\n|  MLP -fbank  | 18.7 | \n|  MLP -mfcc  | 18.2 | \n|  CNN -raw  | 18.1 | \n|SincNet -raw | **17.2**  | \n\n\n## Joint training between speech enhancement and ASR\nIn this section, we show how to use PyTorch-Kaldi to jointly train a cascade between a speech enhancement and a speech recognition neural networks. The speech enhancement has the goal of improving the quality of the speech signal by minimizing the MSE between clean and noisy features. The enhanced features then feed another neural network that predicts context-dependent phone states.\n\nIn the following, we report a toy-task example based on a reverberated version of TIMIT, that is only intended to show how users should set the config file to train such a combination of neural networks. \n Even though some implementation details (and the adopted datasets) are different, this tutorial is inspired by this paper:\n\n- *M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, \"Batch-normalized joint training for DNN-based distant speech recognition\", in Proceedings of STL 2016 [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.08471)*\n\n\nTo run the system do the following steps:\n\n1- Make sure you have the standard clean version of TIMIT available.\n\n2- Run the *Kaldi s5* baseline of TIMIT. This step is necessary to compute the clean features (that will be the labels of the speech enhancement system) and the alignments (that will be the labels of the speech recognition system). We recommend running the full timit s5 recipe (including the DNN training).\n\n3- The standard TIMIT recipe uses MFCCs features. In this tutorial, instead, we use FBANK features. To compute  FBANK features run the following script in *$KALDI_ROOT\u002Fegs\u002FTIMIT\u002Fs5* :\n```    \nfeadir=fbank\n\nfor x in train dev test; do\n  steps\u002Fmake_fbank.sh --cmd \"$train_cmd\" --nj $feats_nj data\u002F$x exp\u002Fmake_fbank\u002F$x $feadir\n  steps\u002Fcompute_cmvn_stats.sh data\u002F$x exp\u002Fmake_fbank\u002F$x $feadir\ndone\n```    \nNote that we use 40 FBANKS here, while Kaldi uses by default 23 FBANKs. To compute 40-dimensional features go into \"$KALDI_ROOT\u002Fegs\u002FTIMIT\u002Fconf\u002Ffbank.conf\" and change the number of considered output filters.\n\n\n4- Go to [this external repository](https:\u002F\u002Fgithub.com\u002Fmravanelli\u002FpySpeechRev\u002Fblob\u002Fmaster\u002FREADME.md) and follow the steps to generate a reverberated version of TIMIT starting from the clean one. Note that this is just a *toy task* that is only helpful to show how setting up a joint-training system.\n\n5- Compute the FBANK features for the TIMIT_rev dataset. To do it, you can copy the scripts in *$KALDI_ROOT\u002Fegs\u002FTIMIT\u002F into $KALDI_ROOT\u002Fegs\u002FTIMIT_rev\u002F*. Please, copy also the data folder. Note that the audio files in the TIMIT_rev folders are saved with the standard *WAV* format, while TIMIT is released with the *SPHERE* format. To bypass this issue, open the files *data\u002Ftrain\u002Fwav.scp*, *data\u002Fdev\u002Fwav.scp*, *data\u002Ftest\u002Fwav.scp* and delete the part about *SPHERE* reading (e.g., *\u002Fhome\u002Fmirco\u002Fkaldi-trunk\u002Ftools\u002Fsph2pipe_v2.5\u002Fsph2pipe -f wav*). You also have to change the paths from the standard TIMIT to the reverberated one (e.g. replace \u002FTIMIT\u002F with \u002FTIMIT_rev\u002F). Remind to remove the final pipeline symbol“ |”. Save the changes and run the computation of the fbank features in this way:\n\n``` \nfeadir=fbank\n\nfor x in train dev test; do\n  steps\u002Fmake_fbank.sh --cmd \"$train_cmd\" --nj $feats_nj data\u002F$x exp\u002Fmake_fbank\u002F$x $feadir\n  steps\u002Fcompute_cmvn_stats.sh data\u002F$x exp\u002Fmake_fbank\u002F$x $feadir\ndone\n```\nRemember to change the $KALDI_ROOT\u002Fegs\u002FTIMIT_rev\u002Fconf\u002Ffbank.conf file in order to compute 40 features rather than the 23 FBANKS of the default configuration.\n\n6- Once features are computed, open the following config file: \n\n```\ncfg\u002FTIMIT_baselines\u002FTIMIT_rev\u002FTIMIT_joint_training_liGRU_fbank.cfg\n``` \n\nRemember to change the paths according to where data are stored in your machine. As you can see, we consider two types of features. The *fbank_rev* features are computed from the TIMIT_rev dataset, while the *fbank_clean* features are derived from the standard TIMIT dataset and are used as targets for the speech enhancement neural network. \nAs you can see in the *[model]* section of the config file, we have the cascade between networks doing speech enhancement and speech recognition. The speech recognition architecture jointly estimates both context-dependent and monophone targets (thus using the so-called monophone regularization). \nTo run an experiment type the following command:\n``` \npython run_exp.py  cfg\u002FTIMIT_baselines\u002FTIMIT_rev\u002FTIMIT_joint_training_liGRU_fbank.cfg\n``` \n\n7- Results\nWith this configuration file, you should obtain a **Phone Error Rate (PER)=28.1%**. Note that some oscillations around this performance are more than natural and are due to different initialization of the neural parameters.\n\nYou can take a closer look into our results [here](https:\u002F\u002Fbitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-exp-timit\u002Fsrc\u002Fmaster\u002FTIMIT_rev\u002FTIMIT_joint_training_liGRU_fbank\u002F)\n\n## Distant Speech Recognition with DIRHA\nIn this tutorial, we use the DIRHA-English dataset to perform a distant speech recognition experiment. The DIRHA English Dataset is a **multi-microphone speech corpus** being developed under the EC project DIRHA. The corpus is composed of both real and simulated sequences recorded with 32 sample-synchronized microphones in a domestic environment. The database contains signals of different characteristics in terms of noise and reverberation making it suitable for various multi-microphone signal processing and distant speech recognition tasks. The part of the dataset currently released is composed of 6 native US speakers (3 Males, 3 Females) uttering 409 wall-street journal sentences.  The training data have been created using a realistic data contamination approach, that is based on contaminating the clean speech *wsj-5k* sentences with high-quality multi-microphone impulse responses measured in the targeted environment. For more details on this dataset, please refer to the following papers:\n\n- *M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, M. Omologo, \"The DIRHA-English corpus and related tasks for distant-speech recognition in domestic environments\", in Proceedings of ASRU 2015. [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.02560)*\n\n- *M. Ravanelli, P. Svaizer, M. Omologo, \"Realistic Multi-Microphone Data Simulation for Distant Speech Recognition\",  in Proceedings of Interspeech 2016. [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.09470)*\n\nIn this tutorial, we use the aforementioned simulated data for training (using LA6 microphone), while test is performed using the real recordings (LA6).  This task is very realistic, but also very challenging. The speech signals are characterized by a reverberation time of about 0.7 seconds. Non-stationary domestic noises (such as vacuum cleaner, steps, phone rings, etc.) are also present in the real recordings.\n\n\nLet’s start now with the practical tutorial.\n\n1- If not available, [download the DIRHA dataset from the LDC website](https:\u002F\u002Fcatalog.ldc.upenn.edu\u002FLDC2018S01). LDC releases the full dataset for a small fee.\n\n2- [Go this external reposotory](https:\u002F\u002Fgithub.com\u002FSHINE-FBK\u002FDIRHA_English_wsj). As reported in this repository, you have to generate the contaminated WSJ dataset with the provided MATLAB script. Then, you can run the proposed KALDI baseline to have features and labels ready for our pytorch-kaldi toolkit.\n\n3- Open the following configuration file:\n``` \ncfg\u002FDIRHA_baselines\u002FDIRHA_liGRU_fmllr.cfg\n``` \nThe latter configuration file implements a simple RNN model based on a Light Gated Recurrent Unit (Li-GRU). We used fMLLR as input features. Change the paths and run the following command:\n\n``` \npython run_exp.py cfg\u002FDIRHA_baselines\u002FDIRHA_liGRU_fmllr.cfg\n``` \n\n4- Results:\nThe aforementioned system should provide  **Word Error Rate (WER%)=23.2%**. \nYou can find the results obtained by us [here](https:\u002F\u002Fbitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-dirha-exp\u002F). \n\nUsing the other configuration files in the *cfg\u002FDIRHA_baselines* folder you can perform experiments with different setups. With the provided configuration files you can obtain the following results:\n\n| Model  | WER(\\%) | \n| ------ | -----|\n|  MLP  | 26.1 | \n|GRU| 25.3 | \n|Li-GRU| **23.8**  | \n\n## Training an autoencoder\nThe current version of the repository is mainly designed for speech recognition experiments. We are actively working a new version, which is much more flexible and can manage input\u002Foutput different from Kaldi features\u002Flabels. Even with the current version, however, it is possible to implement other systems, such as an autoencoder.\n\nAn autoencoder is a neural network whose inputs and outputs are the same. The middle layer normally contains a bottleneck that forces our representations to compress the information of the input. In this tutorial, we provide a toy example based on the TIMIT dataset. For instance, see the following configuration file:\n``` \ncfg\u002FTIMIT_baselines\u002FTIMIT_MLP_fbank_autoencoder.cfg\n``` \nOur inputs are the standard 40-dimensional fbank coefficients that are gathered using a context windows of 11 frames (i.e., the total dimensionality of our input is 440). A feed-forward neural network (called MLP_encoder) encodes our features into a 100-dimensional representation. The decoder (called MLP_decoder) is fed by the learned representations and tries to reconstruct the output. The system is trained with **Mean Squared Error (MSE)** metric.\nNote that in the [Model] section we added this line “err_final=cost_err(dec_out,lab_cd)” at the end. The current version of the model, in fact, by default needs that at least one label is specified (we will remove this limit in the next version). \n\nYou can train the system running the following command:\n``` \npython run_exp.py cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_fbank_autoencoder.cfg\n``` \nThe results should look like this:\n\n``` \nep=000 tr=['TIMIT_tr'] loss=0.139 err=0.999 valid=TIMIT_dev loss=0.076 err=1.000 lr_architecture1=0.080000 lr_architecture2=0.080000 time(s)=41\nep=001 tr=['TIMIT_tr'] loss=0.098 err=0.999 valid=TIMIT_dev loss=0.062 err=1.000 lr_architecture1=0.080000 lr_architecture2=0.080000 time(s)=39\nep=002 tr=['TIMIT_tr'] loss=0.091 err=0.999 valid=TIMIT_dev loss=0.058 err=1.000 lr_architecture1=0.040000 lr_architecture2=0.040000 time(s)=39\nep=003 tr=['TIMIT_tr'] loss=0.088 err=0.999 valid=TIMIT_dev loss=0.056 err=1.000 lr_architecture1=0.020000 lr_architecture2=0.020000 time(s)=38\nep=004 tr=['TIMIT_tr'] loss=0.087 err=0.999 valid=TIMIT_dev loss=0.055 err=0.999 lr_architecture1=0.010000 lr_architecture2=0.010000 time(s)=39\nep=005 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.005000 lr_architecture2=0.005000 time(s)=39\nep=006 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.002500 lr_architecture2=0.002500 time(s)=39\nep=007 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.001250 lr_architecture2=0.001250 time(s)=39\nep=008 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=0.999 lr_architecture1=0.000625 lr_architecture2=0.000625 time(s)=41\nep=009 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=0.999 lr_architecture1=0.000313 lr_architecture2=0.000313 time(s)=38\n```\nYou should only consider the field \"loss=\". The filed \"err=\" only contains not useuful information in this case (for the aforementioned reason).\nYou can take a look into the generated features typing the following command:\n\n``` \ncopy-feats ark:exp\u002FTIMIT_MLP_fbank_autoencoder\u002Fexp_files\u002Fforward_TIMIT_test_ep009_ck00_enc_out.ark  ark,t:- | more\n``` \n\n## References\n[1] M. Ravanelli, T. Parcollet, Y. Bengio, \"The PyTorch-Kaldi Speech Recognition Toolkit\", [ArxIv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.07453)\n\n[2] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, \"Improving speech recognition by revising gated recurrent units\", in Proceedings of Interspeech 2017. [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.00641)\n\n[3] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, \"Light Gated Recurrent Units for Speech Recognition\", in IEEE Transactions on Emerging Topics in Computational Intelligence. [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.10225)\n\n[4] M. Ravanelli, \"Deep Learning for Distant Speech Recognition\", PhD Thesis, Unitn 2017. [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.06086)\n\n[5] T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, C. Trabelsi, R. De Mori, Y. Bengio, \"Quaternion Recurrent Neural Networks\", in Proceedings of ICLR 2019 [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1806.04418)\n\n[6] T. Parcollet, M. Morchid, G. Linarès, R. De Mori, \"Bidirectional Quaternion Long-Short Term Memory Recurrent Neural Networks for Speech Recognition\", in Proceedings of ICASSP 2019 [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.02566)\n\n\n","# PyTorch-Kaldi 语音识别工具包\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmravanelli_pytorch-kaldi_readme_29d0b7fd6555.png\" width=\"220\" img align=\"left\">\nPyTorch-Kaldi 是一个用于开发最先进 DNN\u002FHMM 语音识别系统的开源仓库。其中，DNN 部分由 PyTorch 管理，而特征提取、标签计算和解码则使用 Kaldi 工具包完成。\n\n本仓库包含 PyTorch-Kaldi 工具包的最新版本（PyTorch-Kaldi-v1.0）。如需查看上一版本（PyTorch-Kaldi-v0.1），[请点击此处](https:\u002F\u002Fbitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-v0.0\u002Fsrc\u002Fmaster\u002F)。\n\n如果您使用了此代码或其部分内容，请引用以下论文：\n\n*M. Ravanelli, T. Parcollet, Y. Bengio, “The PyTorch-Kaldi Speech Recognition Toolkit”, [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.07453)*\n\n```\n@inproceedings{pytorch-kaldi,\ntitle    = {The PyTorch-Kaldi Speech Recognition Toolkit},\nauthor    = {M. Ravanelli and T. Parcollet and Y. Bengio},\nbooktitle    = {In Proc. of ICASSP},\nyear    = {2019}\n}\n```\n\n该工具包采用 **知识共享署名 4.0 国际许可协议** 发布。您可以出于研究、商业及非商业目的复制、分发或修改代码。我们仅要求您引用上述论文。\n\n为提高语音识别结果的透明度和可复现性，我们允许用户在此仓库中发布其 PyTorch-Kaldi 模型。如有意发布，请随时与我们联系（或提交拉取请求）。此外，若您的论文使用了 PyTorch-Kaldi，也可在本仓库中进行宣传。\n\n[观看 PyTorch-Kaldi 工具包简短介绍视频](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=VDQaf0SS4K0&t=2s)\n\n## SpeechBrain \n我们很高兴地宣布，SpeechBrain 项目（https:\u002F\u002Fspeechbrain.github.io\u002F）现已公开！我们强烈建议用户迁移到 [Speechbrain](https:\u002F\u002Fspeechbrain.github.io\u002F)。这是一个更优秀的项目，已支持多种语音处理任务，如语音识别、说话人识别、SLU、语音增强、语音分离、多麦克风信号处理等。\n\n我们的目标是开发一个 *单一*、*灵活*且 *用户友好* 的工具包，可用于轻松构建最先进的语音系统，涵盖端到端和 HMM-DNN 语音识别、说话人识别、语音分离、多麦克风信号处理（如波束成形）、自监督学习等多种任务。\n\n该项目由 Mila 主导，并得到三星、Nvidia 和杜比的支持。SpeechBrain 还将受益于 Facebook\u002FPyTorch、IBMResearch 和 FluentAI 等公司的合作与专业知识。\n\n我们正在积极寻找合作伙伴。如果您有意合作，请随时通过 speechbrainproject@gmail.com 与我们联系。\n\n得益于赞助商的支持，我们还能够招聘在 Mila 从事 SpeechBrain 项目的实习生。理想的候选人应为具有 PyTorch 和语音技术经验的博士生（请将简历发送至 speechbrainproject@gmail.com）。\n\nSpeechBrain 的开发还需要几个月时间才能形成可用的仓库。在此期间，我们将继续为 PyTorch-Kaldi 项目提供支持。\n\n敬请期待！\n\n\n## 目录\n* [简介](#introduction)\n* [先决条件](#prerequisites)\n* [安装方法](#how-to-install)\n* [近期更新](#recent-updates)\n* [教程：](#timit-tutorial)\n  * [TIMIT 教程](#timit-tutorial)\n  * [Librispeech 教程](#librispeech-tutorial)\n* [工具包概述：](#overview-of-the-toolkit-architecture)\n  * [工具包架构](#overview-of-the-toolkit-architecture)\n  * [配置文件说明](#description-of-the-configuration-files)\n* [常见问题解答：](#how-can-i-plug-in-my-model)\n  * [如何插入我的模型？](#how-can-i-plug-in-my-model)\n  * [如何调整超参数？](#how-can-i-tune-the-hyperparameters)\n  * [如何使用自己的数据集？](#how-can-i-use-my-own-dataset)\n  * [如何插入自己的特征？](#how-can-i-plug-in-my-own-features)\n  * [如何转录自己的音频文件？](#how-can-i-transcript-my-own-audio-files)\n  * [批量大小、学习率和丢弃率调度器](#Batch-size,-learning-rate,-and-dropout-scheduler)\n  * [如何为项目做出贡献？](#how-can-i-contribute-to-the-project)\n* [附加内容：](#speech-recognition-from-the-raw-waveform-with-sincnet)  \n  * [使用 SincNet 从原始波形进行语音识别](#speech-recognition-from-the-raw-waveform-with-sincnet)\n  * [语音增强与 ASR 的联合训练](#joint-training-between-speech-enhancement-and-asr)\n  * [使用 DIRHA 进行远场语音识别](#distant-speech-recognition-with-dirha)\n  * [训练自编码器](#training-an-autoencoder)\n* [参考文献](#references)\n\n\n## 简介\nPyTorch-Kaldi 项目旨在弥合 Kaldi 和 PyTorch 工具包之间的差距，同时继承 Kaldi 的高效性和 PyTorch 的灵活性。PyTorch-Kaldi 不仅仅是一个简单的接口，它还内置了多项实用功能，用于开发现代语音识别器。例如，代码专门设计为可自然地插入用户自定义的声学模型。作为替代方案，用户也可以利用多个预置的神经网络，并通过直观的配置文件进行定制。PyTorch-Kaldi 支持多路特征和标签流，以及神经网络的组合，从而实现复杂的神经架构。该工具包随丰富的文档一起公开发布，专为本地环境或高性能计算集群运行而设计。\n\n新版 PyTorch-Kaldi 工具包的一些特性：\n\n- 与 Kaldi 的易用接口。\n- 易于插入用户自定义模型。\n- 多个预置模型（MLP、CNN、RNN、LSTM、GRU、Li-GRU、SincNet）。\n- 基于多路特征、标签和复杂神经架构的自然实现。\n- 简单灵活的配置文件。\n- 自动从上次处理的片段恢复。\n- 自动对输入特征进行分块和上下文扩展。\n- 多 GPU 训练。\n- 适用于本地环境或 HPC 集群。\n- 提供 TIMIT 和 Librispeech 数据集的教程。\n\n## 前置条件\n1. 如果尚未完成，请安装 Kaldi（http:\u002F\u002Fkaldi-asr.org\u002F）。按照安装说明，别忘了将 Kaldi 二进制文件的路径添加到 $HOME\u002F.bashrc 中。例如，确保 .bashrc 包含以下路径：\n```\nexport KALDI_ROOT=\u002Fhome\u002Fmirco\u002Fkaldi-trunk\nPATH=$PATH:$KALDI_ROOT\u002Ftools\u002Fopenfst\nPATH=$PATH:$KALDI_ROOT\u002Fsrc\u002Ffeatbin\nPATH=$PATH:$KALDI_ROOT\u002Fsrc\u002Fgmmbin\nPATH=$PATH:$KALDI_ROOT\u002Fsrc\u002Fbin\nPATH=$PATH:$KALDI_ROOT\u002F\u002Fsrc\u002Fnnetbin\nexport PATH\n```\n请记得将 KALDI_ROOT 变量替换为您自己的路径。作为首次测试以检查安装是否成功，请打开一个 bash 终端，输入“copy-feats”或“hmm-info”，并确保没有出现任何错误。\n\n2. 如果尚未完成，请安装 PyTorch（http:\u002F\u002Fpytorch.org\u002F）。我们已在 PyTorch 1.0 和 PyTorch 0.4 上测试过我们的代码。使用较旧版本的 PyTorch 很可能会引发错误。要检查您的安装情况，输入“python”，进入控制台后输入“import torch”，并确保没有出现任何错误。\n\n3. 我们建议在配备 GPU 的机器上运行代码。请确保已安装并正确配置 CUDA 库（https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads）。我们已在 CUDA 9.0、9.1 和 8.0 上测试过我们的系统。请确认 Python 已安装（代码已在 Python 2.7 和 Python 3.7 上测试过）。虽然不是强制要求，但我们建议使用 Anaconda（https:\u002F\u002Fanaconda.org\u002Fanaconda\u002Fpython）。\n\n## 最新更新\n\n**2019年2月19日：更新内容：**\n- 现在可以在训练过程中动态调整批量大小、学习率和 dropout 概率。为此，我们在配置文件中实现了一个调度器，支持以下格式：\n```\nbatch_size_train = 128*12 | 64*10 | 32*2\n```\n上述语句表示：先用 128 个批次训练 12 个 epoch，再用 64 个批次训练 10 个 epoch，最后用 32 个批次训练 2 个 epoch。类似的方式也可用于学习率和 dropout 的调度。[更多信息请参见本节](#batch-size,-learning-rate,-and-dropout-scheduler)。\n\n**2019年2月5日：更新内容：**\n1. 我们的工具包现在支持并行数据加载（即在处理当前数据块的同时，将下一个数据块加载到内存中），从而显著提升速度。\n2. 在进行单音素正则化时，用户现在可以设置“dnn_lay = N_lab_out_mono”。这样，单音素的数量将由我们的工具包自动推断出来。\n3. 我们将 [kaldi-io-for-python](https:\u002F\u002Fgithub.com\u002Fvesis84\u002Fkaldi-io-for-python) 项目的 kaldi-io 工具包集成到了 data_io-py 中。\n4. 我们为 SincNet 提供了更优的超参数设置（[参见本节](#speech-recognition-from-the-raw-waveform-with-sincnet)）。\n5. 我们发布了基于 DIRHA 数据集的一些基准模型（[参见本节](#distant-speech-recognition-with-dirha)）。此外，我们还提供了一些简单自编码器的配置示例（[参见本节](#training-an-autoencoder)）以及联合训练语音增强和语音识别模块的系统配置示例（[参见本节](#joint-training-between-speech-enhancement-and-asr)）。\n6. 我们修复了一些小 bug。\n\n**关于下一版本的说明：**\n在下一版本中，我们计划进一步扩展工具包的功能，支持更多模型和特征格式。目标是使我们的工具包适用于其他语音相关任务，如端到端语音识别、说话人辨识、关键词检测、语音分离、语音活动检测、语音增强等。如果您希望提出一些新颖的功能，请通过填写这份问卷向我们反馈（https:\u002F\u002Fdocs.google.com\u002Fforms\u002Fd\u002F12jd-QP5m8NAJVpiypvtVGy1n_d2iuWaLozXq5hsg4yA\u002Fedit?usp=sharing）。\n\n\n\n## 安装方法\n要安装 PyTorch-Kaldi，请按照以下步骤操作：\n\n1. 确保“前置条件”部分中推荐的所有软件均已安装且正常工作。\n2. 克隆 PyTorch-Kaldi 仓库：\n```\ngit clone https:\u002F\u002Fgithub.com\u002Fmravanelli\u002Fpytorch-kaldi\n```\n3. 进入项目文件夹，并使用以下命令安装所需的依赖包：\n```\npip install -r requirements.txt\n```\n\n\n## TIMIT 教程\n以下我们将基于流行的 TIMIT 数据集，提供一份 PyTorch-Kaldi 工具包的简短教程。\n\n1. 确保您已拥有 TIMIT 数据集。如果没有，可以从 LDC 网站下载（https:\u002F\u002Fcatalog.ldc.upenn.edu\u002FLDC93S1）。\n\n2. 确保 Kaldi 和 PyTorch 的安装均无误。同时，请确认您的 KALDI 路径当前可用（请按照“前置条件”部分中的说明将 Kaldi 路径添加到 .bashrc 文件中）。例如，输入“copy-feats”和“hmm-info”，并确保没有出现任何错误。\n\n3. 运行 Kaldi 的 TIMIT s5 基准流程。此步骤对于计算后续用于训练 PyTorch 神经网络的特征和标签至关重要。我们建议运行完整的 timit s5 流程（包括 DNN 训练）：\n```\ncd kaldi\u002Fegs\u002Ftimit\u002Fs5\n.\u002Frun.sh\n.\u002Flocal\u002Fnnet\u002Frun_dnn.sh\n```\n\n这样会生成所有必要的文件，用户可以直接比较 Kaldi 的结果与我们工具包的结果。\n\n4. 使用以下命令为测试和开发数据计算对齐信息（即音素状态标签）（请进入 $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5 目录）。如果要使用 tri3 对齐信息，输入：\n```\nsteps\u002Falign_fmllr.sh --nj 4 data\u002Fdev data\u002Flang exp\u002Ftri3 exp\u002Ftri3_ali_dev\n\nsteps\u002Falign_fmllr.sh --nj 4 data\u002Ftest data\u002Flang exp\u002Ftri3 exp\u002Ftri3_ali_test\n```\n\n如果要使用 dnn 对齐信息（推荐方式），输入：\n```\nsteps\u002Fnnet\u002Falign.sh --nj 4 data-fmllr-tri3\u002Ftrain data\u002Flang exp\u002Fdnn4_pretrain-dbn_dnn exp\u002Fdnn4_pretrain-dbn_dnn_ali\n\nsteps\u002Fnnet\u002Falign.sh --nj 4 data-fmllr-tri3\u002Fdev data\u002Flang exp\u002Fdnn4_pretrain-dbn_dnn exp\u002Fdnn4_pretrain-dbn_dnn_ali_dev\n\nsteps\u002Fnnet\u002Falign.sh --nj 4 data-fmllr-tri3\u002Ftest data\u002Flang exp\u002Fdnn4_pretrain-dbn_dnn exp\u002Fdnn4_pretrain-dbn_dnn_ali_test\n```\n\n5. 本教程从一个基于 mfcc 特征的简单 MLP 网络开始。在启动实验之前，请查看配置文件 *cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_mfcc_basic.cfg*。有关其所有字段的详细说明，请参阅[配置文件说明](#description-of-the-configuration-files)。\n\n6. 根据您的实际路径修改配置文件。具体来说：\n- 将“fea_lst”设置为您 mfcc 训练列表的路径（该列表应位于 $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fdata\u002Ftrain\u002Ffeats.scp 中）。\n- 将您的路径（例如 $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fdata\u002Ftrain\u002Futt2spk）添加到“--utt2spk=ark:”中。\n- 添加您的 CMVN 变换文件路径，例如 $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fmfcc\u002Fcmvn_train.ark。\n- 添加存储标签的文件夹路径（例如 $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fexp\u002Fdnn4_pretrain-dbn_dnn_ali 用于训练数据，$KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fexp\u002Fdnn4_pretrain-dbn_dnn_ali_dev 用于开发数据）。\n\n为避免错误，请确保配置文件中的所有路径都存在。**请勿使用包含 bash 变量的路径，因为这些路径会被原样读取，而不会自动展开**（例如，应使用 \u002Fhome\u002Fmirco\u002Fkaldi-trunk\u002Fegs\u002Ftimit\u002Fs5\u002Fexp\u002Fdnn4_pretrain-dbn_dnn_ali，而不是 $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fexp\u002Fdnn4_pretrain-dbn_dnn_ali）。\n\n7. 运行 ASR 实验：\n```\npython run_exp.py cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_mfcc_basic.cfg\n```\n\n该脚本会启动一个完整的 ASR 实验，依次执行训练、验证、前向传播和解码步骤。进度条会显示上述各个阶段的进展。脚本 `run_exp.py` 会在输出目录中逐步生成以下文件：\n\n- *res.res*：总结不同验证轮次下的训练和验证性能。\n- *log.log*：记录可能的错误和警告信息。\n- *conf.cfg*：配置文件的副本。\n- *model.svg*：展示所用模型结构及各神经网络连接方式的图片。对于比当前模型更复杂的架构（例如基于多个神经网络的模型），此图非常有助于调试。\n- 文件夹 *exp_files* 包含多个文件，用于总结各轮次训练和验证的进展。例如，*.info* 文件报告每个批次的损失、误差及训练时间等信息；*.cfg* 文件是对应批次的配置文件（详见整体架构）；而 *.lst* 文件则列出用于训练该批次的具体特征列表。\n- 训练结束后，会生成一个名为 *generated outputs* 的目录，其中包含各训练轮次的损失与误差曲线图。\n\n**请注意，您可以随时停止实验。** 如果再次运行该脚本，它将自动从上次正确处理的批次继续执行。训练时间可能需要几小时，具体取决于可用 GPU 的性能。另外，如果您希望修改配置文件中的某些参数（如 n_chunks=、fea_lst=、batch_size_train= 等），则必须指定不同的输出目录（output_folder=）。\n\n**调试提示：** 如果遇到错误，建议按以下步骤排查：\n1. 检查标准输出。\n2. 若无帮助，请查看 log.log 文件。\n3. 检查 core.py 库中的 run_nn 函数，在函数的不同部分添加打印语句，以定位问题并找出原因。\n\n8. 训练结束后，电话错误率 (PER%) 会被追加到 res.res 文件中。要查看解码结果的更多细节，可以进入输出目录中的 “decoding_test” 文件夹，查看生成的各种文件。对于本示例，我们得到了如下 res.res 文件：\n\n```\nep=000 tr=['TIMIT_tr'] loss=3.398 err=0.721 valid=TIMIT_dev loss=2.268 err=0.591 lr_architecture1=0.080000 time(s)=86\nep=001 tr=['TIMIT_tr'] loss=2.137 err=0.570 valid=TIMIT_dev loss=1.990 err=0.541 lr_architecture1=0.080000 time(s)=87\nep=002 tr=['TIMIT_tr'] loss=1.896 err=0.524 valid=TIMIT_dev loss=1.874 err=0.516 lr_architecture1=0.080000 time(s)=87\nep=003 tr=['TIMIT_tr'] loss=1.751 err=0.494 valid=TIMIT_dev loss=1.819 err=0.504 lr_architecture1=0.080000 time(s)=88\nep=004 tr=['TIMIT_tr'] loss=1.645 err=0.472 valid=TIMIT_dev loss=1.775 err=0.494 lr_architecture1=0.080000 time(s)=89\nep=005 tr=['TIMIT_tr'] loss=1.560 err=0.453 valid=TIMIT_dev loss=1.773 err=0.493 lr_architecture1=0.080000 time(s)=88\n.........\nep=020 tr=['TIMIT_tr'] loss=0.968 err=0.304 valid=TIMIT_dev loss=1.648 err=0.446 lr_architecture1=0.002500 time(s)=89\nep=021 tr=['TIMIT_tr'] loss=0.965 err=0.304 valid=TIMIT_dev loss=1.649 err=0.446 lr_architecture1=0.002500 time(s)=90\nep=022 tr=['TIMIT_tr'] loss=0.960 err=0.302 valid=TIMIT_dev loss=1.652 err=0.447 lr_architecture1=0.001250 time(s)=88\nep=023 tr=['TIMIT_tr'] loss=0.959 err=0.301 valid=TIMIT_dev loss=1.651 err=0.446 lr_architecture1=0.000625 time(s)=88\n%WER 18.1 | 192 7215 | 84.0 11.9 4.2 2.1 18.1 99.5 | -0.583 | \u002Fhome\u002Fmirco\u002Fpytorch-kaldi-new\u002Fexp\u002FTIMIT_MLP_basic5\u002Fdecode_TIMIT_test_out_dnn1\u002Fscore_6\u002Fctm_39phn.filt.sys\n```\n\n最终获得的 PER(%) 为 18.1%。需要注意的是，由于不同机器上的随机初始化差异，结果可能会有所波动。我们认为，对不同随机种子下得到的性能进行平均（即更改配置文件中的 seed 字段）对于 TIMIT 数据集至关重要，因为自然的性能波动可能会掩盖实验结果。我们在 TIMIT 实验中观察到约 0.2% 的标准偏差。\n\n如果想要更换特征，首先需要用 Kaldi 工具包计算这些特征。要计算 fbank 特征，需打开 $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Frun.sh，并使用以下命令进行计算：\n```\nfeadir=fbank\n\nfor x in train dev test; do\n  steps\u002Fmake_fbank.sh --cmd \"$train_cmd\" --nj $feats_nj data\u002F$x exp\u002Fmake_fbank\u002F$x $feadir\n  steps\u002Fcompute_cmvn_stats.sh data\u002F$x exp\u002Fmake_fbank\u002F$x $feadir\ndone\n```\n\n然后，将上述配置文件更新为新的特征列表。\n如果您已经运行过完整的 TIMIT Kaldi 流程，可以直接在 $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\u002Fdata-fmllr-tri3 中找到 fmllr 特征。\n若将这些特征输入神经网络，由于采用了说话人自适应技术，预计性能会有显著提升。\n\n在 TIMIT_baseline 文件夹中，我们还提供了其他几种可能的 TIMIT 基线示例。与前面的例子类似，您只需输入以下命令即可运行：\n```\npython run_exp.py $cfg_file\n```\n\n其中包含一些基于循环神经网络（TIMIT_RNN*、TIMIT_LSTM*、TIMIT_GRU*、TIMIT_LiGRU*）和 CNN 架构（TIMIT_CNN*）的示例。此外，我们还提出了一种更高级的模型（TIMIT_DNN_liGRU_DNN_mfcc+fbank+fmllr.cfg），该模型结合了前馈神经网络和循环神经网络，并使用 mfcc、fbank 和 fmllr 特征的拼接作为输入。请注意，后一种配置文件对应于参考论文中描述的最佳架构。从上述配置文件可以看出，我们通过引入单音素正则化等技巧来提升 ASR 性能（即同时估计上下文相关和上下文无关的目标）。下表报告了运行这些系统后得到的平均 PER% 结果：\n\n| 模型         | mfcc   | fbank  | fMLLR  |\n| ------------ | ------ | ------ | ------ |\n| Kaldi DNN 基线 | -----  | ------ | 18.5  |\n| MLP          | 18.2   | 18.7   | 16.7   |\n| RNN          | 17.7   | 17.2   | 15.9   |\n| SRU          | -----  | 16.6   | -----  |\n| LSTM         | 15.1   | 14.3   | 14.5   |\n| GRU          | 16.0   | 15.2   | 14.9   |\n| Li-GRU       | **15.5** | **14.9** | **14.2** |\n\n结果显示，正如预期，fMLLR 特征由于其说话人自适应特性，性能明显优于 MFCC 和 Fbank 特征。循环神经网络模型的表现显著优于传统的 MLP 模型，尤其是在使用 LSTM、GRU 和 Li-GRU 架构时，这些架构通过乘性门控机制有效缓解了梯度消失问题。最佳结果 *PER=14.2%* 是由 [Li-GRU 模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.10225.pdf) [2,3] 实现的，该模型仅使用一个门控单元，因此相比标准 GRU 节省了 33% 的计算量。\n\n实际上，最佳结果是通过一种更复杂的架构获得的，该架构结合了MFCC、FBANK和fMLLR特征（参见*cfg\u002FTIMI_baselines\u002FTIMIT_mfcc_fbank_fmllr_liGRU_best.cfg*）。据我们所知，后一种系统达到的**PER=13.8%**是在TIMIT测试集上已发表的最佳性能。\n\n简单循环单元（SRU）是一种高效且高度可并行化的循环模型。其在自动语音识别任务上的表现不如标准的LSTM、GRU和Li-GRU模型，但速度显著更快。SRU已在[这里](https:\u002F\u002Fgithub.com\u002Ftaolei87\u002Fsru)实现，并在以下论文中进行了描述：\n\nT. Lei, Y. Zhang, S. I. Wang, H. Dai, Y. Artzi, “用于高度可并行化循环的简单循环单元”，EMNLP 2018会议论文。[arXiv](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1709.02755.pdf)\n\n要使用该模型进行实验，请使用配置文件*cfg\u002FTIMIT_baselines\u002FTIMIT_SRU_fbank.cfg*。在此之前，您需要通过```pip install sru```安装该模型，并在*neural_networks.py*中取消注释“import sru”。\n\n您可以通过访问[这里](https:\u002F\u002Fbitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-exp-timit\u002Fsrc\u002Fmaster\u002F)直接将您的结果与我们的结果进行比较。在这个外部仓库中，您可以找到所有包含生成文件的文件夹。\n\n\n\n## Librispeech教程\n在Librispeech数据集上运行PyTorch-Kaldi的步骤与上述TIMIT的步骤类似。以下教程基于*100小时子集*，但可以轻松扩展到完整数据集（960小时）。\n\n1. 运行Kaldi的Librispeech配方，至少执行到第13阶段（含）。\n2. 将exp\u002Ftri4b\u002Ftrans.*文件复制到exp\u002Ftri4b\u002Fdecode_tgsmall_train_clean_100\u002F：\n```\nmkdir exp\u002Ftri4b\u002Fdecode_tgsmall_train_clean_100 && cp exp\u002Ftri4b\u002Ftrans.* exp\u002Ftri4b\u002Fdecode_tgsmall_train_clean_100\u002F\n```\n3. 通过运行以下脚本计算fmllr特征：\n```\n. .\u002Fcmd.sh ## 您可能需要将cmd.sh修改为适合您系统的设置。\n. .\u002Fpath.sh ## 导入tools\u002Futils（导入queue.pl）\n\ngmmdir=exp\u002Ftri4b\n\nfor chunk in train_clean_100 dev_clean test_clean; do\n    dir=fmllr\u002F$chunk\n    steps\u002Fnnet\u002Fmake_fmllr_feats.sh --nj 10 --cmd \"$train_cmd\" \\\n        --transform-dir $gmmdir\u002Fdecode_tgsmall_$chunk \\\n            $dir data\u002F$chunk $gmmdir $dir\u002Flog $dir\u002Fdata || exit 1\n\n    compute-cmvn-stats --spk2utt=ark:data\u002F$chunk\u002Fspk2utt scp:fmllr\u002F$chunk\u002Ffeats.scp ark:$dir\u002Fdata\u002Fcmvn_speaker.ark\ndone\n```\n\n4. 使用以下命令计算对齐信息：\n```\n# 对dev_clean和test_clean的对齐\nsteps\u002Falign_fmllr.sh --nj 30 data\u002Ftrain_clean_100 data\u002Flang exp\u002Ftri4b exp\u002Ftri4b_ali_clean_100\nsteps\u002Falign_fmllr.sh --nj 10 data\u002Fdev_clean data\u002Flang exp\u002Ftri4b exp\u002Ftri4b_ali_dev_clean_100\nsteps\u002Falign_fmllr.sh --nj 10 data\u002Ftest_clean data\u002Flang exp\u002Ftri4b exp\u002Ftri4b_ali_test_clean_100\n```\n\n5. 使用以下命令运行实验：\n```\n  python run_exp.py cfg\u002FLibrispeech_baselines\u002Flibri_MLP_fmllr.cfg\n```\n\n如果您想使用循环模型，可以使用*libri_RNN_fmllr.cfg*、*libri_LSTM_fmllr.cfg*、*libri_GRU_fmllr.cfg*或*libri_liGRU_fmllr.cfg*。训练循环模型可能需要几天时间（取决于所使用的GPU）。使用tgsmall图得到的性能如表所示：\n\n| 模型  | WER% | \n| ------ | -----| \n|  MLP  |  9.6 |\n|LSTM   |  8.6  |\n|GRU     | 8.6 | \n|li-GRU| 8.6 |\n\n这些结果是在未添加音素网格重打分的情况下获得的（即仅使用*tgsmall*图）。您可以通过以下方式添加音素网格重打分来提高性能（从PyTorch-Kaldi的*kaldi_decoding_script*文件夹中运行）：\n```\ndata_dir=\u002Fdata\u002Fmilatmp1\u002Fravanelm\u002Flibrispeech\u002Fs5\u002Fdata\u002F\ndec_dir=\u002Fu\u002Fravanelm\u002Fpytorch-Kaldi-new\u002Fexp\u002Flibri_fmllr\u002Fdecode_test_clean_out_dnn1\u002F\nout_dir=\u002Fu\u002Fravanelm\u002Fpytorch-kaldi-new\u002Fexp\u002Flibri_fmllr\u002F\n\nsteps\u002Flmrescore_const_arpa.sh  $data_dir\u002Flang_test_{tgsmall,fglarge} \\\n          $data_dir\u002Ftest_clean $dec_dir $out_dir\u002Fdecode_test_clean_fglarge   || exit 1;\n```\n\n使用重打分（*fglarge*）得到的最终结果如下表所示：\n\n| 模型  | WER% |  \n| ------ | -----| \n|  MLP  |  6.5 |\n|LSTM   |  6.4  |\n|GRU     | 6.3 | \n|li-GRU| **6.2**  |\n\n\n您可以在此处查看获得的结果：[bitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-exp-librispeech](https:\u002F\u002Fbitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-exp-librispeech\u002Fsrc\u002Fmaster\u002F)。\n\n\n\n\n## 工具包架构概述\n运行ASR实验的主要脚本是**run_exp.py**。该Python脚本执行训练、验证、前向传播和解码步骤。训练过程会经历多个epoch，逐步处理所有训练数据，并应用所选的神经网络。\n每个训练epoch结束后，都会进行验证步骤，以监控系统在*保留数据*上的性能。训练结束后，通过计算指定测试数据集的后验概率来进行前向传播阶段。这些后验概率会根据先验概率（使用计数文件）进行归一化，并存储到一个ark文件中。随后进行解码步骤，以获取测试句子中说话者所说的最终词序列。\n\n*run_exp.py*脚本会接收一个全局配置文件（例如*cfg\u002FTIMIT_MLP_mfcc.cfg*），该文件指定了运行完整实验所需的所有选项。*run_exp.py*代码会调用另一个函数**run_nn**（见core.py库），该函数会对每一块数据执行训练、验证和前向传播操作。\n*run_nn*函数会接收一个特定于数据块的配置文件（例如exp\u002FTIMIT_MLP_mfcc\u002Fexp_files\u002Ftrain_TIMIT_tr+TIMIT_dev_ep000_ck00.cfg*），该文件指定了运行单个数据块实验所需的所有参数。*run_nn*函数会输出一些信息文件（例如*exp\u002FTIMIT_MLP_mfcc\u002Fexp_files\u002Ftrain_TIMIT_tr+TIMIT_dev_ep000_ck00.info*），总结所处理数据块的损失和错误。\n\n结果会被汇总到*res.res*文件中，而错误和警告则会被重定向到*log.log*文件中。\n\n\n## 配置文件说明：\n配置文件分为两种类型：全局配置文件和特定于数据块的配置文件。它们都采用INI格式，通过Python的configparser库进行读取、处理和修改。\n全局配置文件包含多个部分，指定了语音识别实验的所有主要步骤（训练、验证、前向传播和解码）。\n配置文件的结构在一个原型文件中进行了描述（例如*proto\u002Fglobal.proto*），该文件不仅列出了所有必需的部分和字段，还指定了每个字段的类型。例如，*N_ep=int(1,inf)*表示*N_ep*字段（即训练epoch数）必须是一个从1到无穷大的整数。同样，*lr=float(0,inf)*表示lr字段（即学习率）必须是一个从0到无穷大的浮点数。任何不符合这些规范的配置文件写入尝试都会引发错误。\n\n现在让我们打开一个配置文件（例如*cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_mfcc_basic.cfg*），并描述其中的主要部分：\n\n```\n[cfg_proto]\ncfg_proto = proto\u002Fglobal.proto\ncfg_proto_chunk = proto\u002Fglobal_chunk.proto\n```\n当前版本的配置文件首先在 *[cfg_proto]* 部分中指定了全局和分块专用原型文件的路径。\n\n```\n[exp]\ncmd = \nrun_nn_script = run_nn\nout_folder = exp\u002FTIMIT_MLP_basic5\nseed = 1234\nuse_cuda = True\nmulti_gpu = False\nsave_gpumem = False\nn_epochs_tr = 24\n```\n[exp] 部分包含一些重要字段，例如输出文件夹（*out_folder*）以及分块处理脚本 *run_nn* 的路径（默认情况下，该函数应在 core.py 库中实现）。字段 *N_epochs_tr* 指定了所选的训练轮数。其他选项如 use_cuda、multi_gpu 和 save_gpumem 可由用户启用。字段 *cmd* 可用于附加命令，以便在 HPC 集群上运行脚本。\n\n```\n[dataset1]\ndata_name = TIMIT_tr\nfea = fea_name=mfcc\n    fea_lst=quick_test\u002Fdata\u002Ftrain\u002Ffeats_mfcc.scp\n    fea_opts=apply-cmvn --utt2spk=ark:quick_test\u002Fdata\u002Ftrain\u002Futt2spk  ark:quick_test\u002Fmfcc\u002Ftrain_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |\n    cw_left=5\n    cw_right=5\n    \nlab = lab_name=lab_cd\n    lab_folder=quick_test\u002Fdnn4_pretrain-dbn_dnn_ali\n    lab_opts=ali-to-pdf\n    lab_count_file=auto\n    lab_data_folder=quick_test\u002Fdata\u002Ftrain\u002F\n    lab_graph=quick_test\u002Fgraph\n    \nn_chunks = 5\n\n[dataset2]\ndata_name = TIMIT_dev\nfea = fea_name=mfcc\n    fea_lst=quick_test\u002Fdata\u002Fdev\u002Ffeats_mfcc.scp\n    fea_opts=apply-cmvn --utt2spk=ark:quick_test\u002Fdata\u002Fdev\u002Futt2spk  ark:quick_test\u002Fmfcc\u002Fdev_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |\n    cw_left=5\n    cw_right=5\n    \nlab = lab_name=lab_cd\n    lab_folder=quick_test\u002Fdnn4_pretrain-dbn_dnn_ali_dev\n    lab_opts=ali-to-pdf\n    lab_count_file=auto\n    lab_data_folder=quick_test\u002Fdata\u002Fdev\u002F\n    lab_graph=quick_test\u002Fgraph\nn_chunks = 1\n\n[dataset3]\ndata_name = TIMIT_test\nfea = fea_name=mfcc\n    fea_lst=quick_test\u002Fdata\u002Ftest\u002Ffeats_mfcc.scp\n    fea_opts=apply-cmvn --utt2spk=ark:quick_test\u002Fdata\u002Ftest\u002Futt2spk  ark:quick_test\u002Fmfcc\u002Ftest_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |\n    cw_left=5\n    cw_right=5\n    \nlab = lab_name=lab_cd\n    lab_folder=quick_test\u002Fdnn4_pretrain-dbn_dnn_ali_test\n    lab_opts=ali-to-pdf\n    lab_count_file=auto\n    lab_data_folder=quick_test\u002Fdata\u002Ftest\u002F\n    lab_graph=quick_test\u002Fgraph\n    \nn_chunks = 1\n```\n\n配置文件包含多个部分（*[dataset1]*、*[dataset2]*、*[dataset3]* 等），用于描述 ASR 实验中使用的所有语料库。*[dataset\\*]* 部分中的字段描述了实验中考虑的所有特征和标签。\n例如，特征是在 *fea:* 字段中指定的，其中 *fea_name* 包含特征的名称，*fea_lst* 是特征列表（以 Kaldi 的 scp 格式表示），*fea_opts* 允许用户指定如何处理这些特征（例如进行 CMVN 或添加导数），而 *cw_left* 和 *cw_right* 则设置了上下文窗口的特性（即要附加的左右帧数）。需要注意的是，当前版本的 PyTorch-Kaldi 工具包支持定义多个特征流。事实上，如 *cfg\u002FTIMIT_baselines\u002FTIMIT_mfcc_fbank_fmllr_liGRU_best.cfg* 所示，可以采用多种特征流（例如 mfcc、fbank、fmllr）。\n\n同样地，*lab* 部分也包含一些子字段。例如，*lab_name* 指的是标签的名称，而 *lab_folder* 则包含了存储由 Kaldi 配方生成的对齐结果的文件夹。*lab_opts* 允许用户指定与所考虑对齐相关的选项。例如，*lab_opts=\"ali-to-pdf\"* 会提取标准的上下文相关音素状态标签，而 *lab_opts=ali-to-phones --per-frame=true* 则可用于提取单音素目标。*lab_count_file* 用于指定包含所考虑音素状态计数的文件。\n这些计数在前向传播阶段非常重要，在该阶段神经网络计算出的后验概率会除以其先验概率。PyTorch-Kaldi 允许用户指定外部计数文件，也可以自动获取（使用 *lab_count_file=auto*）。如果不需要严格使用计数文件，例如当标签对应于不用于生成前向传播中使用的后验概率的输出时（参见 *cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_mfcc.cfg* 中的单音素目标），用户还可以指定 *lab_count_file=none*。而 *lab_data_folder* 则对应于 Kaldi 数据准备过程中创建的数据文件夹，其中包含若干文件，包括最终 WER 计算所用的文本文件。最后一个子字段 *lab_graph* 是用于生成标签的 Kaldi 图的路径。\n\n完整的数据集通常较大，无法完全加载到 GPU 或内存中，因此需要将其拆分为多个分块。PyTorch-Kaldi 会自动将数据集拆分为 *N_chunks* 中指定的分块数量。分块的数量可能取决于具体的数据集。一般来说，我们建议将语音分块处理为约 1 至 2 小时的长度（具体取决于可用内存）。\n\n```\n[data_use]\ntrain_with = TIMIT_tr\nvalid_with = TIMIT_dev\nforward_with = TIMIT_test\n```\n\n这一部分说明了 *[datasets\\*]* 部分中列出的数据如何在 *run_exp.py* 脚本中被使用。\n第一行表示我们将使用名为 *TIMIT_tr* 的数据进行训练。请注意，此数据集名称必须出现在其中一个数据集部分中，否则配置解析器将报错。类似地，第二行和第三行分别指定了用于验证和前向传播阶段的数据。\n\n```\n[batches]\nbatch_size_train = 128\nmax_seq_length_train = 1000\nincrease_seq_length_train = False\nstart_seq_len_train = 100\nmultply_factor_seq_len_train = 2\nbatch_size_valid = 128\nmax_seq_length_valid = 1000\n```\n\n*batch_size_train* 用于定义小批量中的训练样本数量。字段 *max_seq_length_train* 会截断超过指定值的句子。在对非常长的句子训练循环模型时，可能会出现内存不足的问题。通过此选项，我们可以允许用户通过截断长句子来缓解此类内存问题。此外，还可以通过设置 *increase_seq_length_train=True* 来逐步增加训练过程中的最大句子长度。如果启用此选项，训练将从 *start_seq_len_train* 中指定的最大句子长度开始（例如 *start_seq_len_train=100*）。每完成一个 epoch，最大句子长度就会乘以 *multply_factor_seq_len_train*（例如 *multply_factor_seq_len_train=2*）。\n我们观察到，这种简单的策略通常能够提升系统性能，因为它鼓励模型首先关注短期依赖关系，而在后期再学习长期依赖关系。\n\n同样地，*batch_size_valid* 和 *max_seq_length_valid* 分别指定了验证数据集的小批量样本数量以及最大序列长度。\n\n```\n[architecture1]\narch_name = MLP_layers1\narch_proto = proto\u002FMLP.proto\narch_library = neural_networks\narch_class = MLP\narch_pretrain_file = none\narch_freeze = False\narch_seq_model = False\ndnn_lay = 1024,1024,1024,1024,N_out_lab_cd\ndnn_drop = 0.15,0.15,0.15,0.15,0.0\ndnn_use_laynorm_inp = False\ndnn_use_batchnorm_inp = False\ndnn_use_batchnorm = True,True,True,True,False\ndnn_use_laynorm = False,False,False,False,False\ndnn_act = relu,relu,relu,relu,softmax\narch_lr = 0.08\narch_halving_factor = 0.5\narch_improvement_threshold = 0.001\narch_opt = sgd\nopt_momentum = 0.0\nopt_weight_decay = 0.0\nopt_dampening = 0.0\nopt_nesterov = False\n```\n\n*[architecture\\*]* 部分用于指定 ASR 实验中所涉及的神经网络架构。字段 *arch_name* 指定架构名称。由于不同的神经网络可能依赖于一组不同的超参数，用户需要在 *proto* 字段中添加包含这些超参数列表的 proto 文件路径。例如，标准 MLP 模型的原型文件包含以下字段：\n```\n[proto]\nlibrary=path\nclass=MLP\ndnn_lay=str_list\ndnn_drop=float_list(0.0,1.0)\ndnn_use_laynorm_inp=bool\ndnn_use_batchnorm_inp=bool\ndnn_use_batchnorm=bool_list\ndnn_use_laynorm=bool_list\ndnn_act=str_list\n```\n\n与其他原型文件类似，每一行定义了一个超参数及其对应的值类型。所有在 proto 文件中定义的超参数都必须出现在全局配置文件的相应 *[architecture\\*]* 部分中。\n\n字段 *arch_library* 指定模型的代码位置（例如 *neural_nets.py*），而 *arch_class* 则表示实现该架构的类名（例如，如果设置 *class=MLP*，则会执行 *from neural_nets.py import MLP*）。\n\n字段 *arch_pretrain_file* 可用于使用先前训练好的架构对神经网络进行预训练；而 *arch_freeze* 如果设为 *False*，则会在训练过程中更新架构参数；若设为 *True*，则在训练期间保持参数固定（即冻结）。*arch_seq_model* 部分指示该架构是序列型（如 RNN）还是非序列型（如前馈 MLP 或 CNN）。PyTorch-Kaldi 在处理输入批次时，这两种情况的处理方式不同。对于循环神经网络（*arch_seq_model=True*），特征序列不会被随机化（以保留序列中的元素顺序），而对于前馈模型（*arch_seq_model=False*），我们会对特征进行随机化（这通常有助于提升性能）。当存在多个架构时，只要其中至少有一个被标记为序列型（*arch_seq_model=True*），就会采用序列处理方式。\n\n需要注意的是，以 “arch_” 和 “opt_” 开头的超参数是必填项，必须出现在配置文件中指定的所有架构中。其他超参数（如 dnn_*）则特定于所考虑的架构（取决于用户实际如何实现 MLP 类），可以定义隐藏层的数量和类型、批归一化和层归一化等参数。\n\n其他重要参数与所考虑架构的优化相关。例如，*arch_lr* 是学习率，而 *arch_halving_factor* 用于实现学习率退火。具体来说，当验证集上连续两轮之间的相对性能提升小于 *arch_improvement_threshold* 所指定的值时（例如，*arch_improvement_threshold*），我们将学习率乘以 *arch_halving_factor*（例如，*arch_halving_factor=0.5*）。字段 *arch_opt* 指定优化算法的类型。目前支持 SGD、Adam 和 Rmsprop。其他参数则特定于所选的优化算法（有关所有优化专用超参数的确切含义，请参阅 PyTorch 文档）。\n\n需要注意的是，在 *[architecture\\*]* 中定义的不同架构可以具有不同的优化超参数，甚至可以使用不同的优化算法。\n\n```\n[model]\nmodel_proto = proto\u002Fmodel.proto\nmodel = out_dnn1=compute(MLP_layers1,mfcc)\n    loss_final=cost_nll(out_dnn1,lab_cd)\n    err_final=cost_err(out_dnn1,lab_cd)\n```    \n\n本部分使用一种非常简单直观的元语言来指定如何将各种特征和架构组合起来。\n字段 *model:* 描述了特征和架构是如何连接在一起，从而生成后验概率输出的。行 *out_dnn1=compute(MLP_layers,mfcc)* 的意思是：“将名为 MLP_layers1 的架构与名为 mfcc 的特征输入结合，并将输出存储到变量 out_dnn1 中。”\n从神经网络输出 *out_dnn1* 中，通过名为 *lab_cd* 的标签计算误差和损失函数，这些标签必须事先在 *[datasets\\*]* 部分中定义好。*err_final* 和 *loss_final* 是必填的子字段，用于定义模型的最终输出。\n\n一个更为复杂的例子（此处仅为了展示工具包的潜力而提及）见于 *cfg\u002FTIMIT_baselines\u002FTIMIT_mfcc_fbank_fmllr_liGRU_best.cfg*：\n```    \n[model]\nmodel_proto=proto\u002Fmodel.proto\nmodel:conc1=concatenate(mfcc,fbank)\n      conc2=concatenate(conc1,fmllr)\n      out_dnn1=compute(MLP_layers_first,conc2)\n      out_dnn2=compute(liGRU_layers,out_dnn1)\n      out_dnn3=compute(MLP_layers_second,out_dnn2)\n      out_dnn4=compute(MLP_layers_last,out_dnn3)\n      out_dnn5=compute(MLP_layers_last2,out_dnn3)\n      loss_mono=cost_nll(out_dnn5,lab_mono)\n      loss_mono_w=mult_constant(loss_mono,1.0)\n      loss_cd=cost_nll(out_dnn4,lab_cd)\n      loss_final=sum(loss_cd,loss_mono_w)     \n      err_final=cost_err(out_dnn4,lab_cd)\n```    \n在这种情况下，我们首先将 mfcc、fbank 和 fmllr 特征拼接在一起，然后将其输入到一个 MLP 中。MLP 的输出再被送入一个循环神经网络（具体来说是 Li-GRU 模型）。随后又经过一层 MLP（*MLP_layers_second*），最后由两个 softmax 分类器（*MLP_layers_last* 和 *MLP_layers_last2*）进行分类。第一个分类器估计上下文相关的状态，第二个则估计单音素目标。最终的成本函数是这两者预测结果的加权和。通过这种方式实现了单音素正则化，已被证明有助于提升 ASR 性能。\n\n完整模型可以被视为一个大型计算图，其中在“[model]”部分使用的所有基础架构都会被联合训练。对于每个小批量数据，输入特征会通过整个模型进行前向传播，并使用指定的标签计算 cost_final。随后，计算成本函数相对于该架构所有可学习参数的梯度。最后，所采用架构的所有参数将按照“*[architecture\\*]*”章节中指定的算法一起更新。\n```    \n[forward]\nforward_out = out_dnn1\nnormalize_posteriors = True\nnormalize_with_counts_from = lab_cd\nsave_out_file = True\nrequire_decoding = True\n```    \n\n“forward”部分首先定义要传递的输出（必须在“model”部分中定义）。如果 *normalize_posteriors=True*，则这些后验概率会根据先验概率（使用计数文件）进行归一化。如果 *save_out_file=True*，后验概率文件（通常是一个非常大的 ark 文件）会被保存；而如果 *save_out_file=False*，当不再需要时，该文件会被删除。\n\n*require_decoding* 是一个布尔值，用于指定是否需要对指定的输出进行解码。字段 *normalize_with_counts_from* 用于设置用于归一化后验概率的计数文件来源。\n\n```    \n[decoding]\ndecoding_script_folder = kaldi_decoding_scripts\u002F\ndecoding_script = decode_dnn.sh\ndecoding_proto = proto\u002Fdecoding.proto\nmin_active = 200\nmax_active = 7000\nmax_mem = 50000000\nbeam = 13.0\nlatbeam = 8.0\nacwt = 0.2\nmax_arcs = -1\nskip_scoring = false\nscoring_script = local\u002Fscore.sh\nscoring_opts = \"--min-lmwt 1 --max-lmwt 10\"\nnorm_vars = False\n```    \n\n“decoding”部分报告了与解码相关的参数，即从 DNN 提供的上下文相关概率序列转换为单词序列的步骤。“decoding_script_folder”字段指定了存储解码脚本的文件夹。“decoding script”字段是用于解码的脚本（例如 *decode_dnn.sh*），它应该位于之前指定的“decoding_script_folder”中。“decoding_proto”字段列出了所考虑的解码脚本所需的所有参数。\n\n为了使代码更具灵活性，配置参数也可以在命令行中指定。例如，可以运行：\n```    \n python run_exp.py quick_test\u002Fexample_newcode.cfg --optimization,lr=0.01 --batches,batch_size=4\n```    \n脚本会将指定 cfg 文件中的学习率替换为给定的 lr 值。修改后的配置文件随后会保存到 *out_folder\u002Fconfig.cfg* 中。\n\n脚本 *run_exp.py* 会自动创建针对每个数据块的配置文件，这些文件由 *run_nn* 函数用于执行单个数据块的训练。块级配置文件的结构与全局配置文件非常相似。主要区别在于一个字段 *to_do={train, valid, forward}*，它指定了对 *fea* 字段中指定的特征块进行的处理类型。\n\n*为什么使用 proto 文件？*\n不同的神经网络、优化算法和 HMM 解码器可能依赖于不同的超参数集。为了解决这个问题，我们目前的方案是定义一些原型文件（用于全局、块级和架构配置文件）。总体而言，这种方法使得对全局配置文件中指定的字段检查更加透明。此外，它还允许用户轻松添加新参数，而无需更改任何一行 Python 代码。\n\n例如，要添加一个用户自定义的模型，就需要编写一个新的 proto 文件（如 *user-model.prot*o），以指定相应的超参数。然后，用户只需在 *neural_networks.py* 中编写一个类（如 user-model），实现该架构即可。\n\n\n\n## [常见问题解答]\n\n## 如何插入我的模型\n该工具包旨在让用户轻松插入自己的声学模型。要添加自定义的神经网络模型，请按照以下步骤操作：\n\n1. 进入 `proto` 文件夹，创建一个新的 proto 文件（例如：`*proto\u002FmyDNN.proto*`）。proto 文件用于指定模型的超参数列表，这些超参数稍后会被写入配置文件中。为了了解需要在 proto 文件中添加哪些信息，可以查看 `MLP.proto` 文件：\n\n```\n[proto]\ndnn_lay=str_list\ndnn_drop=float_list(0.0,1.0)\ndnn_use_laynorm_inp=bool\ndnn_use_batchnorm_inp=bool\ndnn_use_batchnorm=bool_list\ndnn_use_laynorm=bool_list\ndnn_act=str_list\n```\n\n2. 参数 `dnn_lay` 必须是一个字符串列表；`dnn_drop`（即每一层的 dropout 概率）是一个介于 0.0 和 1.0 之间的浮点数列表；`dnn_use_laynorm_inp` 和 `dnn_use_batchnorm_inp` 是布尔值，分别用于启用或禁用输入端的层归一化和批归一化。`dnn_use_batchnorm` 和 `dnn_use_laynorm` 是布尔值列表，用于逐层决定是否使用批归一化或层归一化。\n\n参数 `dnn_act` 同样是一个字符串列表，用于设置每一层的激活函数。由于每个模型都基于自己的一组超参数，因此不同的模型会有不同的 proto 文件。例如，您可以查看 `GRU.proto`，会发现其超参数列表与标准 MLP 的不同。与前面的例子类似，您应在此处添加自己的超参数列表并保存文件。\n\n3. 编写一个实现您模型的 PyTorch 类。\n打开 `neural_networks.py` 库，查看其中已实现的一些模型。为简单起见，您可以先从 MLP 类开始。这些类必须包含两个方法：`__init__` 和 `forward`。前者用于初始化网络架构，后者则指定需要执行的计算步骤。\n\n`__init__` 方法接受两个参数，这两个参数会在 `run_nn` 函数中自动计算得出。`inp_dim` 是神经网络输入的维度，而 `options` 是一个字典，包含了配置文件中 `[architecture]` 部分所指定的所有参数。\n\n例如，可以通过以下方式访问各层的 DNN 激活函数：\n```options['dnn_lay'].split(',')```\n\n正如您在 MLP 类中看到的，初始化方法会定义并初始化神经网络的所有参数。`forward` 方法接收一个张量 `x`（即输入数据），并输出另一个包含 `x` 的向量。\n\n如果您的模型是序列模型（即配置文件中至少有一个 `arch_seq_model=true` 的架构），那么 `x` 将是一个形状为 `(time_steps, batches, N_in)` 的张量；否则，它将是一个形状为 `(batches, N_in)` 的矩阵。`forward` 方法定义了将输入张量转换为相应输出张量的计算步骤。对于循环神经网络，输出必须具有序列格式 `(time_steps, batches, N_out)`；而对于前馈网络，则为非序列格式 `(batches, N_out)`。\n\n与已实现的模型类似，用户应编写一个新的类（例如 `myDNN`），以实现自定义模型：\n```\nclass myDNN(nn.Module):\n    \n    def __init__(self, options, inp_dim):\n        super(myDNN, self).__init__()\n             \u002F\u002F 初始化参数\n\n            def forward(self, x):\n                 \u002F\u002F 执行一些计算 out=f(x)\n                  return out\n```\n\n4. 创建配置文件。\n现在您已经定义了自己的模型及其超参数列表，接下来可以创建配置文件。要创建自己的配置文件，您可以参考现有的配置文件（例如，为了简单起见，可以考虑 `cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_mfcc_basic.cfg`）。在定义了所采用的数据集及其相关特征和标签之后，配置文件中会有一些名为 `[architecture*]` 的部分。每个架构实现不同的神经网络。在 `cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_mfcc_basic.cfg` 中，我们只有一个 `[architecture1]`，因为声学模型仅由一个神经网络组成。要添加您自己的神经网络，您需要按照以下方式编写一个架构部分（例如 `[architecture1]`）：\n```\n[architecture1]\narch_name= mynetwork (这是您希望在后续模型部分中用来指代此架构的名称)\narch_proto=proto\u002FmyDNN.proto (此处填写之前定义的 proto 文件名)\narch_library=neural_networks (这是实现 myDNN 的库名)\narch_class=myDNN (这必须是您实现的类名)\narch_pretrain_file=none (通过此选项可指定是否对模型进行预训练)\narch_freeze=False (若希望更新模型参数，则设为 False)\narch_seq_model=False (对于前馈网络设为 False，对于循环网络设为 True)\n```\n\n然后，您需要为 `proto\u002FmyDNN.proto` 中列出的所有超参数指定合适的值。以 `MLP.proto` 为例，我们可以这样设置：\n```\ndnn_lay=1024,1024,1024,1024,1024,N_out_lab_cd\ndnn_drop=0.15,0.15,0.15,0.15,0.15,0.0\ndnn_use_laynorm_inp=False\ndnn_use_batchnorm_inp=False\ndnn_use_batchnorm=True,True,True,True,True,False\ndnn_use_laynorm=False,False,False,False,False,False\ndnn_act=relu,relu,relu,relu,relu,softmax\n```\n\n接着，添加与您自己架构优化相关的参数。您可以在这里使用标准的 SGD、Adam 或 RMSprop（参见 `cfg\u002FTIMIT_baselines\u002FTIMIT_LSTM_mfcc.cfg` 中使用 RMSprop 的示例）：\n```\narch_lr=0.08\narch_halving_factor=0.5\narch_improvement_threshold=0.001\narch_opt=sgd\nopt_momentum=0.0\nopt_weight_decay=0.0\nopt_dampening=0.0\nopt_nesterov=False\n```\n\n5. 将配置文件保存到 `cfg` 文件夹中（例如：`cfg\u002FmyDNN_exp.cfg`）。\n\n6. 运行实验：\n```\npython run_exp.py cfg\u002FmyDNN_exp.cfg\n```\n\n7. 为了调试模型，您可以首先查看标准输出。`run_exp.py` 会自动解析配置文件，并在出现潜在问题时抛出错误。您还可以查看 `log.log` 文件，以获取有关可能错误的更多信息。\n\n在实现新模型时，一项重要的调试测试是进行过拟合实验（以确保模型能够在一个小数据集上过拟合）。如果模型无法过拟合，则说明存在需要解决的重大错误。\n\n8. 超参数调优。\n在深度学习中，经常需要调整超参数，以找到适合您模型的最佳设置。这项工作通常计算量大且耗时，但在引入新架构时往往十分必要。为了帮助进行超参数调优，我们开发了一款实现随机搜索超参数的工具（详情请参阅下一节）。\n\n## 如何调优超参数？\n在深度学习中，通常需要对超参数进行调优，以寻找合适的神经网络架构。为了帮助在 PyTorch-Kaldi 中进行超参数调优，我们实现了一个简单的工具，采用随机搜索的方法。具体来说，脚本 *tune_hyperparameters.py* 会生成一组随机的配置文件，可以通过以下方式运行：\n```\npython tune_hyperparameters.py cfg\u002FTIMIT_MLP_mfcc.cfg exp\u002FTIMIT_MLP_mfcc_tuning 10 arch_lr=randfloat(0.001,0.01) batch_size_train=randint(32,256) dnn_act=choose_str{relu,relu,relu,relu,softmax|tanh,tanh,tanh,tanh,softmax}\n```\n第一个参数是我们希望修改的参考配置文件，第二个参数是保存随机配置文件的文件夹，第三个参数是要生成的随机配置文件的数量。接下来是一系列需要更改的超参数。例如，*arch_lr=randfloat(0.001,0.01)* 会将 *arch_lr* 字段替换为一个介于 0.001 和 0.01 之间的随机浮点数。*batch_size_train=randint(32,256)* 则会将 *batch_size_train* 替换为 32 到 256 之间的随机整数，以此类推。\n配置文件生成后，可以按顺序或并行地使用以下命令运行：\n```\npython run_exp.py $cfg_file\n```\n\n## 如何使用自己的数据集？\nPyTorch-Kaldi 可以用于任何语音数据集。要使用自己的数据集，步骤与 TIMIT\u002FLibrispeech 教程中介绍的类似。一般来说，你需要执行以下操作：\n1. 使用 Kaldi 的数据处理流程来处理你的数据集。有关数据准备的详细信息，请参阅 Kaldi 官网。\n2. 对训练集、验证集和测试集分别计算对齐结果。\n3. 编写一个 PyTorch-Kaldi 配置文件 *$cfg_file*。\n4. 使用 ```python run_exp.py $cfg_file``` 运行该配置文件。\n\n## 如何接入自定义特征？\n当前版本的 PyTorch-Kaldi 支持以 Kaldi ark 格式存储的输入特征。如果用户希望使用自定义特征进行实验，则必须先将这些特征转换为 ark 格式。有关如何将 NumPy 数组转换为 ark 文件的详细说明，请参阅 Kaldi-io-for-python 的 Git 仓库（https:\u002F\u002Fgithub.com\u002Fvesis84\u002Fkaldi-io-for-python）。\n此外，你还可以查看我们的工具脚本 save_raw_fea.py。该脚本会生成包含原始特征的 Kaldi ark 文件，这些原始特征随后可用于直接以原始波形作为输入来训练神经网络（请参阅关于使用 SincNet 处理音频的部分）。\n\n## 如何转录我自己的音频文件\n当前版本的 Pytorch-Kaldi 支持使用 Pytorch-Kaldi 预训练声学模型来转录音频文件的标准生产流程，可以处理单个或多个 .wav 文件。重要的是，您必须拥有一个已经训练好的 Pytorch-Kaldi 模型。虽然现在不再需要标签或对齐信息，但 Pytorch-Kaldi 仍然需要许多文件才能转录新的音频文件：\n1. 特征及其特征列表 *feats.scp*（包含 .ark 文件，详见 #如何接入自定义特征）\n2. 解码图（通常在之前的模型训练中通过 mkgraph.sh 脚本生成，例如三音素模型）。如果您不进行解码，则无需此图。\n\n准备好所有这些文件后，您可以开始将您的数据集部分添加到全局配置文件中。最简单的方法是复制用于训练声学模型的 *cfg* 文件，并通过添加一个新的 *[dataset]* 来修改它：\n```\n[dataset4]\ndata_name = myWavFile\nfea = fea_name=fbank\n  fea_lst=myWavFilePath\u002Fdata\u002Ffeats.scp\n  fea_opts=apply-cmvn --utt2spk=ark:myWavFilePath\u002Fdata\u002F\u002Futt2spk  ark:myWavFilePath\u002Fcmvn_test.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- |\n  cw_left=5\n  cw_right=5\n\nlab = lab_name=none\n  lab_data_folder=myWavFilePath\u002Fdata\u002F\n  lab_graph=myWavFilePath\u002Fexp\u002Ftri3\u002Fgraph\nn_chunks=1\n\n[data_use]\ntrain_with = TIMIT_tr\nvalid_with = TIMIT_dev\nforward_with = myWavFile\n```\n\n用于音频文件转录的关键字符串是 *lab_name=none*。标签值为 *none* 表示 Pytorch-Kaldi 将进入仅执行前向传播和解码的“生产”模式，而不使用任何标签。您不需要在生产服务器上放置 TIMIT_tr 和 TIMIT_dev 数据集，因为 Pytorch-Kaldi 会跳过这些信息，直接进入 *forward_with* 字段中指定的数据集的前向阶段。如您所见，全局的 *fea* 字段需要与标准训练或测试数据集完全相同的参数，而 *lab* 字段只需两个参数即可。请注意，*lab_data_folder* 实际上就是 *fea_lst* 所指向的路径。最后，您仍需指定要将该文件分割成多少个块来处理（1 小时为 1 个块）。\u003Cbr \u002F>\n**警告** \u003Cbr \u002F>\n在您的标准 .cfg 文件中，可能使用了诸如 *N_out_lab_cd* 等关键字，但在生产场景中已无法再使用。这是因为，在生产环境中，您并不希望将训练数据保留在本地机器上。因此，.cfg 文件中所有的变量都需要替换为其实际值。对于所有 *N_out_{mono,lab_cd}* 的替换，您可以查看以下命令的输出：\n```\nhmm-info \u002Fpath\u002Fto\u002Fthe\u002Ffinal.mdl\u002Fused\u002Fto\u002Fgenerate\u002Fthe\u002Ftraining\u002Fali\n```\n\n然后，如果您按照 .cfg 文件中“forward”部分的设置对后验概率进行归一化：\n```\nnormalize_posteriors = True\nnormalize_with_counts_from = lab_cd\n```\n则需要将其替换为：\n```\nnormalize_posteriors = True\nnormalize_with_counts_from = \u002Fpath\u002Fto\u002Fali_train_pdf.counts\n```\n\n这一步归一化对于 HMM-DNN 语音识别至关重要。事实上，DNN 提供的是后验概率，而 HMM 是基于似然的概率生成模型。为了得到所需的似然值，只需将后验概率除以前验概率即可。要创建这个 *ali_train_pdf.counts* 文件，您可以按照以下步骤操作：\n```\nalidir=\u002Fpath\u002Fto\u002Fthe\u002Fexp\u002Ftri_ali (请替换为您存放 ali 文件的路径)\nnum_pdf=$(hmm-info $alidir\u002Ffinal.mdl | awk '\u002Fpdfs\u002F{print $4}')\nlabels_tr_pdf=\"ark:ali-to-pdf $alidir\u002Ffinal.mdl \\\"ark:gunzip -c $alidir\u002Fali.*.gz |\\\" ark:- |\"\nanalyze-counts --verbose=1 --binary=false --counts-dim=$num_pdf \"$labels_tr_pdf\" ali_train_pdf.counts\n```\n\n这样就完成了！在生产场景中，您可能需要转录大量的音频文件，而不希望为每个文件都单独创建一个 .cfg 文件。为此，在创建好初始的生产用 .cfg 文件后（路径字段可以留空），您可以调用 run_exp.py 脚本，并传入针对不同 .wav 特征的具体参数：\n```\npython run_exp.py cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_fbank_prod.cfg --dataset4,fea,0,fea_lst=\"myWavFilePath\u002Fdata\u002Ffeats.scp\" --dataset4,lab,0,lab_data_folder=\"myWavFilePath\u002Fdata\u002F\" --dataset4,lab,0,lab_graph=\"myWavFilePath\u002Fexp\u002Ftri3\u002Fgraph\u002F\"\n```\n\n该命令会在内部根据您指定的路径修改配置文件，并运行您定义的特征！需要注意的是，向 run_exp.py 脚本传递长参数时，需要使用特定的格式。*--dataset4* 指定所创建节目的名称，*fea* 是更高级别的字段名，而 *fea_lst* 或 *lab_graph* 则是您想要更改的最低级别字段名。数字 *0* 用于指示要修改的最低级别字段的具体位置，因为某些配置文件可能为每个数据集包含多个 *lab_graph*。因此，*0* 表示第一个出现的实例，*1* 表示第二个……路径必须用双引号括起来，才能被正确解析为完整字符串！此外，如果您不希望不同 .wav 文件的转录结果相互覆盖（解码文件会根据 *data_name* 字段进行存储），则需要同时修改 *data_name* 和 *forward_with* 字段：``` --dataset4,data_name=MyNewName --data_use,forward_with=MyNewName ```。\n\n## 批量大小、学习率和丢弃率调度\n为了给用户提供更大的灵活性，PyTorch-Kaldi 的最新版本支持对批量大小、max_seq_length_train、学习率和丢弃因子进行调度。\n这意味着现在可以在训练过程中动态调整这些参数。为此，我们在配置文件中引入了以下语法：\n```\nbatch_size_train = 128*12 | 64*10 | 32*2\n```\n在这种情况下，前 12 个 epoch 的批量大小为 128，接下来的 10 个 epoch 为 64，最后两个 epoch 则为 32。默认情况下，“*”表示重复执行指定次数，“|”用于指示批量大小的变化。需要注意的是，如果用户仅设置 `batch_size_train = 128`，则批量大小将在整个训练过程中保持不变。\n\n类似地，也可以通过以下语法实现学习率调度：\n```\narch_lr = 0.08*10|0.04*5|0.02*3|0.01*2|0.005*2|0.0025*2\n```\n如果用户仅设置 `arch_lr = 0.08`，学习率将按照工具包旧版本中使用的 *new-bob* 算法进行退火。具体来说，我们会从指定的学习率开始，每当验证集上的性能提升低于 `arch_improvement_threshold` 字段中设定的阈值时，就将学习率减半。\n\n此外，丢弃因子也可以在训练过程中按如下方式调整：\n```\ndnn_drop = 0.15*12|0.20*12,0.15,0.15*10|0.20*14,0.15,0.0\n```\n通过上述语法，我们可以为不同层和不同 epoch 设置不同的丢弃率。例如，第一隐藏层在前 12 个 epoch 的丢弃率为 0.15，随后的 12 个 epoch 则为 0.20；而第二隐藏层在整个训练过程中丢弃率始终保持为 0.15。所有层都采用相同的语法，其中“|”表示同一层内丢弃率的变化，“,”则表示不同层之间的切换。\n\n您可以通过以下配置文件查看批量大小、学习率和丢弃因子如何变化的例子：\n```\ncfg\u002FTIMIT_baselines\u002FTIMIT_mfcc_basic_flex.cfg\n```\n或者：\n```\ncfg\u002FTIMIT_baselines\u002FTIMIT_liGRU_fmllr_lr_schedule.cfg\n```\n\n\n\n\n## 我如何参与该项目\n目前项目仍处于初期阶段，我们诚挚邀请所有潜在贡献者加入。我们希望构建一个规模足够大的开发者社区，以逐步维护、改进并扩展当前工具包的功能。例如，报告任何 bug 或提出改进建议都将非常有帮助。此外，大家还可以通过添加新的神经网络模型来丰富现有架构库。\n\n\n\n\n## [额外]\n## 使用 SincNet 从原始波形进行语音识别\n\n[观看我们的 SincNet 介绍视频](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=mXQBObRGUgk&feature=youtu.be)\n\nSincNet 是一种最近提出的用于处理原始音频波形的卷积神经网络。其核心思想是利用参数化的 sinc 函数，促使第一层自动发现更具语义意义的滤波器。与传统 CNN 不同，后者会学习每个滤波器的所有参数，而 SincNet 只直接从数据中学习带通滤波器的低频和高频截止频率。这种归纳偏置提供了一种非常紧凑的方式来构建自定义的滤波器组前端，该前端仅依赖于几个具有明确物理意义的参数。\n\n有关 SincNet 模型的更详细说明，请参阅以下论文：\n\n- *M. Ravanelli, Y. Bengio, “使用 SincNet 从原始波形进行说话人识别”，SLT 2018 会议论文 [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1808.00158)*\n\n- *M. Ravanelli, Y. Bengio, “具有可解释性的卷积滤波器：SincNet”，NIPS@IRASL 2018 会议论文 [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.09725)*\n\n要在 TIMIT 数据集上使用此模型进行语音识别，请按照以下步骤操作：\n1. 按照“TIMIT 教程”中的步骤进行操作。\n2. 将原始波形保存为 Kaldi ark 格式。可以使用我们仓库中的 `save_raw_fea.py` 工具脚本完成此操作。该脚本会将输入信号保存为二进制 Kaldi 归档文件，并保留与预计算标签的对齐信息。请对所有数据块（如 train、dev、test）分别运行此脚本。您还可以通过 `sig_wlen=200 # ms` 参数指定每帧的语音片段长度。\n3. 打开 `cfg\u002FTIMIT_baselines\u002FTIMIT_SincNet_raw.cfg`，修改路径后运行：\n```    \npython .\u002Frun_exp.py cfg\u002FTIMIT_baselines\u002FTIMIT_SincNet_raw.cfg\n```    \n\n4. 使用该架构，我们获得了 **PER(%)=17.1%** 的结果。而使用相同特征输入的标准 CNN 则得到 **PER(%)=18.%**。更多结果请参阅 [此处](https:\u002F\u002Fbitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-exp-timit\u002Fsrc\u002Fmaster\u002F)。我们的 SincNet 结果优于使用 MFCC 和 FBANK 特征输入的标准前馈网络所取得的效果。\n\n下表比较了 SincNet 与其他前馈神经网络的结果：\n\n| 模型  | WER(\\%) | \n| ------ | -----|\n|  MLP -fbank  | 18.7 | \n|  MLP -mfcc  | 18.2 | \n|  CNN -raw  | 18.1 | \n|SincNet -raw | **17.2**  |\n\n## 语音增强与自动语音识别的联合训练\n在本节中，我们展示了如何使用 PyTorch-Kaldi 联合训练一个由语音增强网络和语音识别神经网络组成的级联系统。语音增强的目标是通过最小化干净特征与噪声特征之间的均方误差（MSE），来提升语音信号的质量。增强后的特征随后输入到另一个神经网络中，用于预测上下文相关的音素状态。\n\n接下来，我们将报告一个基于 TIMIT 混响版本的玩具任务示例，该示例仅用于展示用户应如何配置文件以训练此类神经网络组合。尽管一些实现细节（以及所采用的数据集）有所不同，但本教程受到以下论文的启发：\n\n- *M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, “基于 DNN 的远场语音识别的批归一化联合训练”，STL 2016 年会议论文 [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.08471)*\n\n\n要运行该系统，请按照以下步骤操作：\n\n1. 确保您已准备好标准的 TIMIT 清晰版本数据。\n\n2. 运行 Kaldi s5 基线流程处理 TIMIT 数据。此步骤对于计算干净特征（将作为语音增强系统的标签）以及对齐信息（将作为语音识别系统的标签）至关重要。建议完整运行 timit s5 流程（包括 DNN 训练）。\n\n3. 标准的 TIMIT 流程使用 MFCC 特征，而在本教程中，我们改用 FBANK 特征。要在 *$KALDI_ROOT\u002Fegs\u002FTIMIT\u002Fs5* 目录下计算 FBANK 特征，请运行以下脚本：\n```    \nfeadir=fbank\n\nfor x in train dev test; do\n  steps\u002Fmake_fbank.sh --cmd \"$train_cmd\" --nj $feats_nj data\u002F$x exp\u002Fmake_fbank\u002F$x $feadir\n  steps\u002Fcompute_cmvn_stats.sh data\u002F$x exp\u002Fmake_fbank\u002F$x $feadir\ndone\n```    \n请注意，我们这里使用 40 维 FBANK 特征，而 Kaldi 默认使用 23 维。若需计算 40 维特征，请进入 *$KALDI_ROOT\u002Fegs\u002FTIMIT\u002Fconf\u002Ffbank.conf* 文件，并修改输出滤波器的数量。\n\n\n4. 访问[此外部仓库](https:\u002F\u002Fgithub.com\u002Fmravanelli\u002FpySpeechRev\u002Fblob\u002Fmaster\u002FREADME.md)，按照说明从 TIMIT 清晰版本生成混响版本。请注意，这只是一个“玩具任务”，仅用于演示如何搭建联合训练系统。\n\n5. 为 TIMIT_rev 数据集计算 FBANK 特征。为此，您可以将 *$KALDI_ROOT\u002Fegs\u002FTIMIT\u002F* 中的脚本复制到 *$KALDI_ROOT\u002Fegs\u002FTIMIT_rev\u002F* 目录下。同时请复制 data 文件夹。需要注意的是，TIMIT_rev 文件夹中的音频文件采用标准的 WAV 格式，而原始 TIMIT 数据则使用 SPHERE 格式。为解决这一问题，打开 *data\u002Ftrain\u002Fwav.scp*、*data\u002Fdev\u002Fwav.scp* 和 *data\u002Ftest\u002Fwav.scp* 文件，删除关于 SPHERE 格式的读取部分（例如 *\u002Fhome\u002Fmirco\u002Fkaldi-trunk\u002Ftools\u002Fsph2pipe_v2.5\u002Fsph2pipe -f wav*）。此外，还需将路径从标准 TIMIT 修改为混响版本（例如将 \u002FTIMIT\u002F 替换为 \u002FTIMIT_rev\u002F）。请务必移除末尾的管道符号“|”。保存更改后，按如下方式运行 FBANK 特征计算：\n``` \nfeadir=fbank\n\nfor x in train dev test; do\n  steps\u002Fmake_fbank.sh --cmd \"$train_cmd\" --nj $feats_nj data\u002F$x exp\u002Fmake_fbank\u002F$x $feadir\n  steps\u002Fcompute_cmvn_stats.sh data\u002F$x exp\u002Fmake_fbank\u002F$x $feadir\ndone\n```\n同时记得修改 *$KALDI_ROOT\u002Fegs\u002FTIMIT_rev\u002Fconf\u002Ffbank.conf* 文件，以计算 40 维特征，而非默认配置中的 23 维 FBANK 特征。\n\n6. 特征计算完成后，打开以下配置文件：\n``` \ncfg\u002FTIMIT_baselines\u002FTIMIT_rev\u002FTIMIT_joint_training_liGRU_fbank.cfg\n``` \n\n请根据您本地数据存储位置调整路径。如您所见，我们考虑了两类特征：*fbank_rev* 特征来自 TIMIT_rev 数据集，而 *fbank_clean* 特征则来自标准 TIMIT 数据集，用作语音增强神经网络的目标。在配置文件的 *[model]* 部分，可以看到语音增强与语音识别网络的级联结构。语音识别架构同时估计上下文相关和单音素目标（即采用所谓的单音素正则化）。\n要运行实验，请输入以下命令：\n``` \npython run_exp.py  cfg\u002FTIMIT_baselines\u002FTIMIT_rev\u002FTIMIT_joint_training_liGRU_fbank.cfg\n``` \n\n7. 结果\n使用此配置文件，您应获得 **音素错误率 (PER) = 28.1%**。请注意，围绕该性能的小幅波动是正常现象，主要源于神经网络参数初始化的不同。\n\n您可在此处进一步查看我们的实验结果：[链接](https:\u002F\u002Fbitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-exp-timit\u002Fsrc\u002Fmaster\u002FTIMIT_rev\u002FTIMIT_joint_training_liGRU_fbank\u002F)\n\n## 基于DIRHA的远场语音识别\n在本教程中，我们使用DIRHA-English数据集进行远场语音识别实验。DIRHA英语数据集是由欧盟项目DIRHA开发的**多麦克风语音语料库**。该语料库由在家庭环境中使用32个采样同步麦克风录制的真实和模拟序列组成。数据库包含具有不同噪声和混响特性的信号，因此适用于各种多麦克风信号处理和远场语音识别任务。目前发布的数据集部分由6位美国本地说话者（3名男性，3名女性）朗读409句《华尔街日报》句子构成。训练数据是通过一种真实的污染方法生成的，即利用在目标环境中测量的高质量多麦克风脉冲响应对干净的WSJ-5k语料中的句子进行污染。有关该数据集的更多详细信息，请参阅以下论文：\n\n- *M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, M. Omologo, “DIRHA-English语料库及用于家庭环境远场语音识别的相关任务”，ASRU 2015会议论文集。[ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.02560)*\n\n- *M. Ravanelli, P. Svaizer, M. Omologo, “用于远场语音识别的真实多麦克风数据仿真”，Interspeech 2016会议论文集。[ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.09470)*\n\n在本教程中，我们使用上述模拟数据进行训练（采用LA6麦克风），而测试则使用真实录音（LA6）。这项任务非常贴近实际场景，但也极具挑战性。语音信号的混响时间约为0.7秒。真实录音中还存在非平稳的家庭噪声（如吸尘器声、脚步声、电话铃声等）。\n\n现在让我们开始实践教程。\n\n1. 如果尚未拥有，可从LDC网站[下载DIRHA数据集](https:\u002F\u002Fcatalog.ldc.upenn.edu\u002FLDC2018S01)。LDC以少量费用提供完整数据集。\n\n2. [访问此外部仓库](https:\u002F\u002Fgithub.com\u002FSHINE-FBK\u002FDIRHA_English_wsj)。根据该仓库说明，您需要使用提供的MATLAB脚本生成污染后的WSJ数据集。随后，您可以运行建议的KALDI基线，以便为我们的pytorch-kaldi工具包准备好特征和标签。\n\n3. 打开以下配置文件：\n``` \ncfg\u002FDIRHA_baselines\u002FDIRHA_liGRU_fmllr.cfg\n``` \n该配置文件实现了一个基于轻量门控循环单元（Li-GRU）的简单RNN模型。我们使用fMLLR作为输入特征。请更改路径并运行以下命令：\n\n``` \npython run_exp.py cfg\u002FDIRHA_baselines\u002FDIRHA_liGRU_fmllr.cfg\n``` \n\n4. 结果：\n上述系统应给出**词错误率（WER%）=23.2%**。您可以在此处查看我们获得的结果：[这里](https:\u002F\u002Fbitbucket.org\u002Fmravanelli\u002Fpytorch-kaldi-dirha-exp\u002F)。\n\n通过使用*cfg\u002FDIRHA_baselines*文件夹中的其他配置文件，您可以尝试不同的设置。利用这些配置文件，您可以得到以下结果：\n\n| 模型  | WER(\\%) | \n| ------ | -----|\n|  MLP  | 26.1 | \n|GRU| 25.3 | \n|Li-GRU| **23.8**  | \n\n## 训练自编码器\n当前版本的仓库主要用于语音识别实验。我们正在积极开发一个更加灵活的新版本，能够处理不同于Kaldi特征\u002F标签的输入输出。不过，即使在当前版本中，也可以实现其他系统，例如自编码器。\n\n自编码器是一种输入和输出相同的神经网络。其中间层通常包含一个瓶颈，迫使表示压缩输入信息。在本教程中，我们提供了一个基于TIMIT数据集的示例。例如，请查看以下配置文件：\n``` \ncfg\u002FTIMIT_baselines\u002FTIMIT_MLP_fbank_autoencoder.cfg\n``` \n我们的输入是标准的40维fbank系数，采用11帧的上下文窗口提取（即输入的总维度为440）。一个前馈神经网络（称为MLP_encoder）将特征编码为100维表示。解码器（称为MLP_decoder）接收学习到的表示，并尝试重建输出。该系统使用**均方误差（MSE）**指标进行训练。\n请注意，在[Model]部分末尾添加了这一行：“err_final=cost_err(dec_out,lab_cd)”。事实上，当前版本的模型默认要求至少指定一个标签（我们将在下一版本中取消这一限制）。\n\n您可以通过运行以下命令来训练该系统：\n``` \npython run_exp.py cfg\u002FTIMIT_baselines\u002FTIMIT_MLP_fbank_autoencoder.cfg\n``` \n结果应如下所示：\n\n``` \nep=000 tr=['TIMIT_tr'] loss=0.139 err=0.999 valid=TIMIT_dev loss=0.076 err=1.000 lr_architecture1=0.080000 lr_architecture2=0.080000 time(s)=41\nep=001 tr=['TIMIT_tr'] loss=0.098 err=0.999 valid=TIMIT_dev loss=0.062 err=1.000 lr_architecture1=0.080000 lr_architecture2=0.080000 time(s)=39\nep=002 tr=['TIMIT_tr'] loss=0.091 err=0.999 valid=TIMIT_dev loss=0.058 err=1.000 lr_architecture1=0.040000 lr_architecture2=0.040000 time(s)=39\nep=003 tr=['TIMIT_tr'] loss=0.088 err=0.999 valid=TIMIT_dev loss=0.056 err=1.000 lr_architecture1=0.020000 lr_architecture2=0.020000 time(s)=38\nep=004 tr=['TIMIT_tr'] loss=0.087 err=0.999 valid=TIMIT_dev loss=0.055 err=0.999 lr_architecture1=0.010000 lr_architecture2=0.010000 time(s)=39\nep=005 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.005000 lr_architecture2=0.005000 time(s)=39\nep=006 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.002500 lr_architecture2=0.002500 time(s)=39\nep=007 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.001250 lr_architecture2=0.001250 time(s)=39\nep=008 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=0.999 lr_architecture1=0.000625 lr_architecture2=0.000625 time(s)=41\nep=009 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=0.999 lr_architecture1=0.000313 lr_architecture2=0.000313 time(s)=38\n```\n您只需关注“loss=”字段。由于上述原因，“err=”字段在此情况下并不重要。您可以通过以下命令查看生成的特征：\n\n``` \ncopy-feats ark:exp\u002FTIMIT_MLP_fbank_autoencoder\u002Fexp_files\u002Fforward_TIMIT_test_ep009_ck00_enc_out.ark  ark,t:- | more\n```\n\n## 参考文献\n[1] M. Ravanelli, T. Parcollet, Y. Bengio，《PyTorch-Kaldi 语音识别工具包》，[ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.07453)\n\n[2] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio，《通过改进门控循环单元提升语音识别性能》，载于 Interspeech 2017 年会议论文集。[ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.00641)\n\n[3] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio，《用于语音识别的轻量级门控循环单元》，载于 IEEE 计算智能新兴领域汇刊。[ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.10225)\n\n[4] M. Ravanelli，《远场语音识别中的深度学习》，博士论文，特伦托大学，2017 年。[ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.06086)\n\n[5] T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, C. Trabelsi, R. De Mori, Y. Bengio，《四元数循环神经网络》，载于 ICLR 2019 年会议论文集。[ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1806.04418)\n\n[6] T. Parcollet, M. Morchid, G. Linarès, R. De Mori，《面向语音识别的双向四元数长短时记忆循环神经网络》，载于 ICASSP 2019 年会议论文集。[ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.02566)","# PyTorch-Kaldi 快速上手指南\n\nPyTorch-Kaldi 是一个开源语音识别工具包，旨在结合 **Kaldi** 的高效特征提取与解码能力，以及 **PyTorch** 的灵活深度学习建模能力。它允许用户使用 PyTorch 定义声学模型，同时利用 Kaldi 进行前端处理和后端解码。\n\n> **注意**：本项目维护者强烈建议新用户迁移至新一代项目 [SpeechBrain](https:\u002F\u002Fspeechbrain.github.io\u002F)，该项目功能更全面且更易于使用。本指南仅针对仍需使用 PyTorch-Kaldi 的场景。\n\n## 环境准备\n\n在开始之前，请确保满足以下系统要求和依赖：\n\n### 1. 系统要求\n*   **操作系统**: Linux (推荐 Ubuntu\u002FCentOS)\n*   **硬件**: 推荐使用配备 NVIDIA GPU 的机器以加速训练。\n*   **Python**: 支持 Python 2.7 或 Python 3.7+ (建议使用 Anaconda 管理环境)。\n\n### 2. 前置依赖\n你需要手动安装以下两个核心工具包：\n\n#### A. 安装 Kaldi\nKaldi 用于特征提取、标签计算和解码。\n1.  从官方源码编译安装 Kaldi (参考 [kaldi-asr.org](http:\u002F\u002Fkaldi-asr.org))。\n2.  **关键步骤**：安装完成后，必须将 Kaldi 的二进制文件路径添加到环境变量中。编辑 `~\u002F.bashrc` 文件，添加以下内容（请将 `\u002Fhome\u002Fyour_user\u002Fkaldi-trunk` 替换为你实际的安装路径）：\n\n```bash\nexport KALDI_ROOT=\u002Fhome\u002Fyour_user\u002Fkaldi-trunk\nPATH=$PATH:$KALDI_ROOT\u002Ftools\u002Fopenfst\nPATH=$PATH:$KALDI_ROOT\u002Fsrc\u002Ffeatbin\nPATH=$PATH:$KALDI_ROOT\u002Fsrc\u002Fgmmbin\nPATH=$PATH:$KALDI_ROOT\u002Fsrc\u002Fbin\nPATH=$PATH:$KALDI_ROOT\u002Fsrc\u002Fnnetbin\nexport PATH\n```\n保存后执行 `source ~\u002F.bashrc` 使配置生效。\n*验证安装*：在终端输入 `copy-feats` 或 `hmm-info`，若无报错则说明安装成功。\n\n#### B. 安装 PyTorch 和 CUDA\n1.  确保已安装与你的显卡驱动匹配的 **CUDA**  toolkit (测试版本包括 8.0, 9.0, 9.1)。\n2.  安装 PyTorch (测试版本包括 0.4 和 1.0+)。\n    *   **国内加速建议**：推荐使用清华源或阿里源安装 PyTorch。\n    *   示例命令 (使用 pip 和清华源)：\n    ```bash\n    pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu90\n    # 或者使用国内镜像源\n    pip install torch torchvision torchaudio -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n*验证安装*：在终端输入 `python`，然后执行 `import torch`，若无报错则说明安装成功。\n\n---\n\n## 安装步骤\n\n完成上述前置依赖后，按以下步骤安装 PyTorch-Kaldi：\n\n1.  **克隆仓库**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fmravanelli\u002Fpytorch-kaldi\n    ```\n\n2.  **进入目录并安装 Python 依赖**\n    进入项目文件夹，并使用 pip 安装所需的额外 Python 包（如 `kaldi-io` 等）。\n    ```bash\n    cd pytorch-kaldi\n    pip install -r requirements.txt\n    ```\n    *(注：若下载速度慢，可添加 `-i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple` 参数)*\n\n---\n\n## 基本使用示例 (基于 TIMIT 数据集)\n\n以下是最小化的工作流程演示，展示如何运行一个标准的语音识别任务。本示例假设你已拥有 TIMIT 数据集。\n\n### 第一步：准备 Kaldi 数据与对齐\nPyTorch-Kaldi 依赖 Kaldi 生成的特征和对齐文件。首先运行 Kaldi 的标准脚本处理数据。\n\n```bash\n# 进入 Kaldi 的 TIMIT 示例目录 (假设你的 Kaldi 安装在 $KALDI_ROOT)\ncd $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5\n\n# 运行基础流程 (提取特征等)\n.\u002Frun.sh\n\n# 运行 DNN 训练脚本 (生成必要的对齐文件)\n.\u002Flocal\u002Fnnet\u002Frun_dnn.sh\n```\n\n### 第二步：生成测试集与开发集的对齐文件\n如果上一步未自动生成测试集对齐，需手动执行（以 tri3 对齐为例）：\n\n```bash\n# 确保当前在 $KALDI_ROOT\u002Fegs\u002Ftimit\u002Fs5 目录下\nsteps\u002Falign_fmllr.sh --nj 4 data\u002Fdev data\u002Flang exp\u002Ftri3 exp\u002Ftri3_ali_dev\nsteps\u002Falign_fmllr.sh --nj 4 data\u002Ftest data\u002Flang exp\u002Ftri3 exp\u002Ftri3_ali_test\n```\n\n### 第三步：配置并运行 PyTorch 训练\n回到 `pytorch-kaldi` 根目录，修改配置文件以指向你的数据路径，然后启动训练。\n\n1.  **编辑配置文件**：\n    找到对应的实验配置文件夹（例如 `cfg\u002FTIMIT_MLP.cfg`），用文本编辑器打开，确保 `data_folder` 和 `lab_folder` 指向正确的 Kaldi 输出路径。\n\n2.  **启动训练**：\n    运行主训练脚本，指定配置文件路径。\n\n    ```bash\n    # 在项目根目录执行\n    python train.py cfg\u002FTIMIT_MLP.cfg\n    ```\n\n    *程序将自动读取配置，加载数据，并开始使用 PyTorch 训练 MLP 模型。训练过程中会实时打印损失值和准确率。*\n\n### 第四步：解码与评估 (可选)\n训练完成后，通常需要使用 Kaldi 进行解码以获得最终的词错误率 (WER)。这通常涉及生成神经网络输出的归档文件，并调用 Kaldi 的解码器。具体命令取决于你的实验配置，一般流程如下：\n\n```bash\n# 示例：生成用于解码的 ark 文件 (具体脚本名称可能因版本而异)\npython create_decoding_files.py cfg\u002FTIMIT_MLP.cfg\n\n# 使用 Kaldi 进行解码 (需在 Kaldi 环境下)\n# 此处仅为示意，具体路径需根据实际生成的文件调整\n$KALDI_ROOT\u002Fsrc\u002Fonlinebin\u002Fonline-wav-gmm-decode-faster ... \n```\n\n通过以上步骤，你即可完成从环境搭建到模型训练的基本闭环。更多复杂模型（如 CNN, LSTM, SincNet）的使用只需更换相应的 `.cfg` 配置文件即可。","某语音识别初创团队正致力于为医疗行业开发高精度的医生问诊录音转写系统，需要在快速迭代深度学习模型的同时，确保解码流程符合工业级标准。\n\n### 没有 pytorch-kaldi 时\n- **框架割裂严重**：研究人员习惯用 PyTorch 构建灵活的 DNN\u002FRNN 模型，但工程团队依赖 Kaldi 进行特征提取和解码，两边代码无法直接互通，需编写大量繁琐的中间接口脚本。\n- **实验复现困难**：每次调整网络结构（如从 CNN 改为 RNN），都需要手动重新对接数据流，导致实验周期长达数周，且容易因格式转换错误引入难以排查的 Bug。\n- **性能瓶颈明显**：由于缺乏统一的混合架构支持，难以利用 Kaldi 成熟的 HMM-GMM 对齐优势来辅助深度神经网络训练，最终模型的词错率（WER）始终无法突破行业基准。\n\n### 使用 pytorch-kaldi 后\n- **无缝混合架构**：pytorch-kaldi 原生实现了\"PyTorch 负责模型训练 + Kaldi 负责特征与解码”的分工，团队可直接将最新的 PyTorch 模型嵌入成熟的 Kaldi 工作流中，无需重复造轮子。\n- **研发效率倍增**：通过简单的配置文件即可切换不同的神经网络结构，原本需要两周的模型适配工作缩短至几天，让算法工程师能专注于优化网络本身而非数据管道。\n- **精度显著提升**：借助工具对 DNN\u002FHMM 混合系统的深度优化，团队成功融合了 Kaldi 强大的时序建模能力与 PyTorch 的灵活表达力，在医疗垂直数据集上将识别准确率提升了 15%。\n\npytorch-kaldi 通过打破两大顶级框架的壁垒，让研发团队能以最低成本构建出兼具灵活性与工业级精度的语音识别系统。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmravanelli_pytorch-kaldi_70e699cb.png","mravanelli","Mirco Ravanelli","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmravanelli_cb71639f.jpg","I'm an Assistant Professor at Concordia University and Mila Associate Member working on deep learning for Conversational AI","Concordia University\u002FMila","Montreal",null,"https:\u002F\u002Fsites.google.com\u002Fsite\u002Fmircoravanelli\u002F","https:\u002F\u002Fgithub.com\u002Fmravanelli",[83,87,91],{"name":84,"color":85,"percentage":86},"Python","#3572A5",44,{"name":88,"color":89,"percentage":90},"Perl","#0298c3",36.5,{"name":92,"color":93,"percentage":94},"Shell","#89e051",19.5,2398,445,"2026-04-10T03:33:44",4,"Linux","必需 NVIDIA GPU，具体型号未说明，支持 CUDA 8.0, 9.0, 9.1","未说明",{"notes":103,"python":104,"dependencies":105},"必须预先安装并配置 Kaldi 工具箱，需将 Kaldi 二进制文件路径添加到环境变量 (.bashrc) 中。虽然非强制，但建议使用 Anaconda 管理环境。该工具旨在本地或 HPC 集群上运行。作者强烈建议用户迁移到新的 SpeechBrain 项目。","2.7 或 3.7",[106,107,108],"PyTorch (0.4 或 1.0)","Kaldi","CUDA libraries",[110,14],"音频",[112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128],"speech-recognition","gru","dnn","kaldi","rnn-model","pytorch","timit","deep-learning","deep-neural-networks","recurrent-neural-networks","multilayer-perceptron-network","lstm","lstm-neural-networks","speech","asr","rnn","dnn-hmm","2026-03-27T02:49:30.150509","2026-04-12T07:53:22.470010",[132,137,142,147,151,156],{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},30314,"运行实验时出现 'training epoch 0, chunk 0 not done' 或 CUDA 'invalid argument' 错误怎么办？","这通常是由于配置文件中的路径设置不正确或标签类别不匹配导致的。请检查并修改配置文件（.cfg）中的 fea_lst, lab_folder, lab_data_folder 和 lab_graph 字段，确保它们指向正确的数据路径。如果出现 CUDA ClassNLLCriterion 断言失败（t >= 0 && t \u003C n_classes），说明标签索引超出了类别范围，需确认对齐文件（alignments）与模型输出的类别数量一致。此外，RNN SincNet 模型对显存要求较高，建议减小 batch size 以避免显存溢出。","https:\u002F\u002Fgithub.com\u002Fmravanelli\u002Fpytorch-kaldi\u002Fissues\u002F35",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},30315,"测试集解码时为什么需要对齐文件（alignments）？如何避免测试集因未对齐导致的解码问题？","虽然开发集可用于验证，但某些解码流程可能依赖对齐信息。如果测试集的对齐不完整会导致部分语句解码失败（出现 [PARTIAL] 标签）。解决方法是拉取最新代码库，并阅读文档中新增的“如何解码我自己的音频文件”（How to Decode My Own Audio File）章节，该部分提供了不依赖完整对齐文件的解码方案。","https:\u002F\u002Fgithub.com\u002Fmravanelli\u002Fpytorch-kaldi\u002Fissues\u002F65",{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},30316,"运行 run_exp.py 时遇到 'ValueError: not enough values to unpack (expected 2, got 0)' 数据读取错误如何调试？","此类错误通常源于数据路径或格式问题。调试步骤如下：\n1. 在配置文件中找到 fea_lst 字段，选择一个 ark 文件。\n2. 手动运行命令：copy-feats ark:your_ark_file.ark ark,t:-，检查是否能正常输出数值。\n3. 如果成功，加入完整选项测试，例如：copy-feats ark:your_ark.ark ark:- | apply-cmvn --utt2spk=ark:data\u002Ftest_dev93\u002Futt2spk ark:data\u002Ftest_dev93\u002Fdata\u002Fcmvn_test_dev93.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark,t:-。\n4. 测试标签文件：gunzip -c your_lab_folder\u002Fali.*.gz | ali-to-pdf path\u002Fto\u002Ffinal.mdl ark:- ark,t:- | more。\n通过逐步执行这些命令可以定位是特征文件还是标签文件的问题。","https:\u002F\u002Fgithub.com\u002Fmravanelli\u002Fpytorch-kaldi\u002Fissues\u002F105",{"id":148,"question_zh":149,"answer_zh":150,"source_url":146},30317,"如何在多 GPU 环境下控制 PyTorch-Kaldi 使用的显卡数量？","程序默认会使用所有可见的 GPU。你可以使用 nvidia-smi 命令（加上 watch 前缀可实时刷新）查看 GPU 使用情况。若要限制使用的 GPU 数量，请设置系统环境变量 CUDA_VISIBLE_DEVICES。例如，只使用前两张卡可设置为 export CUDA_VISIBLE_DEVICES=0,1。",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},30318,"配置的切片数量（n_chunks）与实际运行的切片数量不一致（如配置 50 却运行到 51）是什么原因？","这种情况通常是因为数据分割逻辑在生成最后一个切片时，剩余数据不足以构成一个完整切片但仍被处理，或者索引计算方式导致多出一个切片。如果程序在运行完预期切片后正常结束（killed normally），通常不影响最终结果。若担心配置文件复用问题，可检查生成的临时配置文件（如 *_ck50.cfg），确认其特征路径是否正确，必要时清理旧的实验文件夹重新运行。","https:\u002F\u002Fgithub.com\u002Fmravanelli\u002Fpytorch-kaldi\u002Fissues\u002F109",{"id":157,"question_zh":158,"answer_zh":159,"source_url":136},30319,"是否支持将 SincNet 与循环神经网络（RNN）结合使用？效果如何？","是的，可以尝试将 SincNet 与 RNN 结合。但需要注意的是，RNN SincNet 模型对显存需求非常大。在实际实验中，可能需要显著减小 batch size 才能在有限的显存资源下运行。具体的实验结果取决于数据集和硬件配置，建议从小批量开始尝试。",[]]