[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-zzw922cn--Automatic_Speech_Recognition":3,"tool-zzw922cn--Automatic_Speech_Recognition":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",148568,2,"2026-04-09T23:34:24",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108111,"2026-04-08T11:23:26",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":78,"owner_email":79,"owner_twitter":74,"owner_website":74,"owner_url":80,"languages":81,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":98,"env_os":99,"env_gpu":100,"env_ram":99,"env_deps":101,"category_tags":108,"github_topics":111,"view_count":32,"oss_zip_url":74,"oss_zip_packed_at":74,"status":17,"created_at":130,"updated_at":131,"faqs":132,"releases":173},6077,"zzw922cn\u002FAutomatic_Speech_Recognition","Automatic_Speech_Recognition","End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow ","Automatic_Speech_Recognition 是一个基于 TensorFlow 框架开发的端到端自动语音识别（ASR）开源项目。它致力于解决将曼达林语（普通话）和英语的音频信号直接转换为文本的核心难题，支持从音素、字符到序列到序列（seq2seq）等多种任务层级。\n\n该项目特别适合人工智能开发者、语音技术研究人员以及高校学生使用。对于希望深入理解 ASR 底层原理或需要构建自定义语音识别模型的技术人员来说，它提供了一套完整的训练、评估及数据预处理流程，并兼容 TIMIT、LibriSpeech 和 WSJ 等主流数据集。\n\n在技术亮点方面，Automatic_Speech_Recognition 不仅实现了经典的 DeepSpeech2 架构，还前瞻性地引入了胶囊网络（Capsule Network）和层归一化 RNN（Layer Normalization RNN）以提升模型效率与准确率。此外，项目内置了语言建模模块和简单的 n-gram 模型，支持动态 Dropout 和自动化评估机制。尽管其部分代码依赖较早期的 TensorFlow 版本（如 1.x），但它作为一份详","Automatic_Speech_Recognition 是一个基于 TensorFlow 框架开发的端到端自动语音识别（ASR）开源项目。它致力于解决将曼达林语（普通话）和英语的音频信号直接转换为文本的核心难题，支持从音素、字符到序列到序列（seq2seq）等多种任务层级。\n\n该项目特别适合人工智能开发者、语音技术研究人员以及高校学生使用。对于希望深入理解 ASR 底层原理或需要构建自定义语音识别模型的技术人员来说，它提供了一套完整的训练、评估及数据预处理流程，并兼容 TIMIT、LibriSpeech 和 WSJ 等主流数据集。\n\n在技术亮点方面，Automatic_Speech_Recognition 不仅实现了经典的 DeepSpeech2 架构，还前瞻性地引入了胶囊网络（Capsule Network）和层归一化 RNN（Layer Normalization RNN）以提升模型效率与准确率。此外，项目内置了语言建模模块和简单的 n-gram 模型，支持动态 Dropout 和自动化评估机制。尽管其部分代码依赖较早期的 TensorFlow 版本（如 1.x），但它作为一份详实的学术与实践参考资源，依然对探索语音识别演进历史和技术复现具有重要价值。","# Automatic-Speech-Recognition\nEnd-to-end automatic speech recognition system implemented in TensorFlow.\n\n## Recent Updates\n- [x] **Support TensorFlow r1.0** (2017-02-24)\n- [x] **Support dropout for dynamic rnn** (2017-03-11)\n- [x] **Support running in shell file** (2017-03-11)\n- [x] **Support evaluation every several training epoches automatically** (2017-03-11)\n- [x] **Fix bugs for character-level automatic speech recognition** (2017-03-14)\n- [x] **Improve some function apis for reusable** (2017-03-14)\n- [x] **Add scaling for data preprocessing** (2017-03-15)\n- [x] **Add reusable support for LibriSpeech training** (2017-03-15)\n- [x] **Add simple n-gram model for random generation or statistical use** (2017-03-23)\n- [x] **Improve some code for pre-processing and training** (2017-03-23)\n- [x] **Replace TABs with blanks and add nist2wav converter script** (2017-04-20)\n- [x] **Add some data preparation code** (2017-05-01)\n- [x] **Add WSJ corpus standard preprocessing by s5 recipe** (2017-05-05)\n- [x] **Restructuring of the project. Updated train.py for usage convinience** (2017-05-06)\n- [x] **Finish feature module for timit, libri, wsj, support training for LibriSpeech** (2017-05-14)\n- [x] **Remove some unnecessary codes** (2017-07-22)\n- [x] **Add DeepSpeech2 implementation code** (2017-07-23)\n- [x] **Fix some bugs** (2017-08-06)\n- [x] **Add Layer Normalization RNN for efficiency** (2017-08-06)\n- [x] **Add Madarian Speech Recognition support** (2017-08-06)\n- [x] **Add Capsule Network Model** (2017-12-12)\n- [x] **Release 1.0.0 version** (2017-12-14)\n- [x] **Add Language Modeling Module** (2017-12-25)\n- [x] **Will support TF1.12 soon** (2019-10-17)\n\n## Recommendation\nIf you want to replace feed dict operation with Tensorflow multi-thread and fifoqueue input pipeline, you can refer to my repo [TensorFlow-Input-Pipeline](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FTensorFlow-Input-Pipeline) for more example codes. My own practices prove that fifoqueue input pipeline would improve the training speed in some time.\n\nIf you want to look the history of speech recognition, I have collected the significant papers since 1981 in the ASR field. You can read awesome paper list in my repo [awesome-speech-recognition-papers](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002Fawesome-speech-recognition-papers), all download links of papers are provided. I will update it every week to add new papers, including speech recognition, speech synthesis and language modelling. I hope that we won't miss any important papers in speech domain.\n\nAll my public repos will be updated in future, thanks for your stars!\n\n## Install and Usage\nCurrently only python 3.5 is supported.\n\nThis project depends on scikit.audiolab, for which you need to have [libsndfile](http:\u002F\u002Fwww.mega-nerd.com\u002Flibsndfile\u002F) installed in your system.\nClone the repository to your preferred directory and install using:\n\u003Cpre>\nsudo pip3 install -r requirements.txt\nsudo python3 setup.py install\n\u003C\u002Fpre>\n\nTo use, simply run the following command:\n\u003Cpre>\npython main\u002Ftimit_train.py [-h] [--mode MODE] [--keep [KEEP]] [--nokeep]\n                      [--level LEVEL] [--model MODEL] [--rnncell RNNCELL]\n                      [--num_layer NUM_LAYER] [--activation ACTIVATION]\n                      [--optimizer OPTIMIZER] [--batch_size BATCH_SIZE]\n                      [--num_hidden NUM_HIDDEN] [--num_feature NUM_FEATURE]\n                      [--num_classes NUM_CLASSES] [--num_epochs NUM_EPOCHS]\n                      [--lr LR] [--dropout_prob DROPOUT_PROB]\n                      [--grad_clip GRAD_CLIP] [--datadir DATADIR]\n                      [--logdir LOGDIR]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --mode MODE           set whether to train or test\n  --keep [KEEP]         set whether to restore a model, when test mode, keep\n                        should be set to True\n  --nokeep\n  --level LEVEL         set the task level, phn, cha, or seq2seq, seq2seq will\n                        be supported soon\n  --model MODEL         set the model to use, DBiRNN, BiRNN, ResNet..\n  --rnncell RNNCELL     set the rnncell to use, rnn, gru, lstm...\n  --num_layer NUM_LAYER\n                        set the layers for rnn\n  --activation ACTIVATION\n                        set the activation to use, sigmoid, tanh, relu, elu...\n  --optimizer OPTIMIZER\n                        set the optimizer to use, sgd, adam...\n  --batch_size BATCH_SIZE\n                        set the batch size\n  --num_hidden NUM_HIDDEN\n                        set the hidden size of rnn cell\n  --num_feature NUM_FEATURE\n                        set the size of input feature\n  --num_classes NUM_CLASSES\n                        set the number of output classes\n  --num_epochs NUM_EPOCHS\n                        set the number of epochs\n  --lr LR               set the learning rate\n  --dropout_prob DROPOUT_PROB\n                        set probability of dropout\n  --grad_clip GRAD_CLIP\n                        set the threshold of gradient clipping\n  --datadir DATADIR     set the data root directory\n  --logdir LOGDIR       set the log directory\n\n\u003C\u002Fpre>\nInstead of configuration in command line, you can also set the arguments above in [timit\\_train.py](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition\u002Fblob\u002Fmaster\u002Fmain\u002Ftimit_train.py) in practice.\n\nBesides, you can also run `main\u002Frun.sh` for both training and testing simultaneously! See [run\\_timit.sh](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition\u002Fblob\u002Fmaster\u002Fmain\u002Frun_timit.sh) for details.\n\n## Performance\n### PER based dynamic BLSTM on TIMIT database, with casual tuning because time it limited\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fzzw922cn_Automatic_Speech_Recognition_readme_6b9956a35c30.png)\n\n### LibriSpeech recognition result without LM\n**Label**:\n\nit was about noon when captain waverley entered the straggling village or rather hamlet of tully veolan close to which was situated the mansion of the proprietor\n\n**Prediction**:\n\nit was about noon when captain wavraly entered the stragling bilagor of rather hamlent of tulevallon close to which wi situated the mantion of the propriater\n\n\n**Label**:\n\nthe english it is evident had they not been previously assured of receiving the king would never have parted with so considerable a sum and while they weakened themselves by the same measure have strengthened a people with whom they must afterwards have so material an interest to discuss\n\n**Prediction**:\n\nthe onglish it is evident had they not being previously showed of receiving the king would never have parted with so considerable a some an quile they weakene themselves by the same measure haf streigth and de people with whom they must afterwards have so material and interest to discuss\n\n\n**Label**:\n\none who writes of such an era labours under a troublesome disadvantage\n\n**Prediction**:\n\none how rights of such an er a labours onder a troubles hom disadvantage\n\n\n**Label**:\n\nthen they started on again and two hours later came in sight of the house of doctor pipt\n\n**Prediction**:\n\nthen they started on again and two hours laytor came in sight of the house of doctor pipd\n\n\n**Label**:\n\nwhat does he want\n\n**Prediction**:\n\nwhit daes he want\n\n\n**Label**:\n\nthere just in front\n\n**Prediction**:\n\nthere just infront\n\n\n**Label**:\n\nunder ordinary circumstances the abalone is tough and unpalatable but after the deft manipulation of herbert they are tender and make a fine dish either fried as chowder or a la newberg\n\n**Prediction**:\n\nunder ordinary circumstancesi the abl ony is tufgh and unpelitable but after the deftominiculation of hurbourt and they are tender and make a fine dish either fride as choder or alanuburg\n\n\n**Label**:\n\nby degrees all his happiness all his brilliancy subsided into regret and uneasiness so that his limbs lost their power his arms hung heavily by his sides and his head drooped as though he was stupefied\n\n**Prediction**:\n\nby degrees all his happiness ill his brilliancy subsited inter regret and aneasiness so that his limbs lost their power his arms hung heavily by his sides and his head druped as though he was stupified\n\n\n**Label**:\n\ni am the one to go after walt if anyone has to i'll go down mister thomas\n\n**Prediction**:\n\ni have the one to go after walt if ety wod hastu i'll go down mister thommas\n\n\n**Label**:\n\ni had to read it over carefully as the text must be absolutely correct\n\n**Prediction**:\n\ni had to readit over carefully as the tex must be absolutely correct\n\n\n**Label**:\n\nwith a shout the boys dashed pell mell to meet the pack train and falling in behind the slow moving burros urged them on with derisive shouts and sundry resounding slaps on the animals flanks\n\n**Prediction**:\n\nwith a shok the boy stash pale mele to meek the pecktrait ane falling in behind the slow lelicg burs ersh tlan with deressive shouts and sudery resounding sleps on the animal slankes\n\n\n**Label**:\n\ni suppose though it's too early for them then came the explosion\n\n**Prediction**:\n\ni suppouse gho waths two early for them then came the explosion\n\n\n## Content\nThis is a powerful library for **automatic speech recognition**, it is implemented in TensorFlow and support training with CPU\u002FGPU. This library contains followings models you can choose to train your own model:\n* Data Pre-processing\n* Acoustic Modeling\n  * RNN\n  * BRNN\n  * LSTM\n  * BLSTM\n  * GRU\n  * BGRU\n  * Dynamic RNN\n  * Deep Residual Network\n  * Seq2Seq with attention decoder\n  * etc.\n* CTC Decoding\n* Evaluation(Mapping some similar phonemes)  \n* Saving or Restoring Model\n* Mini-batch Training\n* Training with GPU or CPU with TensorFlow\n* Keeping logging of epoch time and error rate in disk\n\n## Implementation Details\n\n### Data preprocessing\n\n#### TIMIT corpus\n\nThe original TIMIT database contains 6300 utterances, but we find the 'SA' audio files occurs many times, it will lead bad bias for our speech recognition system. Therefore, we removed the all 'SA' files from the original dataset and attain the new TIMIT dataset, which contains only 5040 utterances including 3696 standard training set and 1344 test set.\n\nAutomatic Speech Recognition transcribes a raw audio file into character sequences; the preprocessing stage converts a raw audio file into feature vectors of several frames. We first split each audio file into 20ms Hamming windows with an overlap of 10ms, and then calculate the 12 mel frequency ceptral coefficients, appending an energy variable to each frame. This results in a vector of length 13. We then calculate the delta coefficients and delta-delta coefficients, attaining a total of 39 coefficients for each frame. In other words, each audio file is split into frames using the Hamming windows function, and each frame is extracted to a feature vector of length 39 (to attain a feature vector of different length, modify the settings in the file [timit\\_preprocess.py](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic-Speech-Recognition\u002Fblob\u002Fmaster\u002Fsrc\u002Ffeature\u002Ftimit\u002Ftimit_preprocess.py).\n\nIn folder data\u002Fmfcc, each file is a feature matrix with size timeLength\\*39 of one audio file; in folder data\u002Flabel, each file is a label vector according to the mfcc file.\n\nIf you want to set your own data preprocessing, you can edit [calcmfcc.py](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic-Speech-Recognition\u002Fblob\u002Fmaster\u002Fsrc\u002Ffeature\u002Fcore\u002Fcalcmfcc.py) or [timit\\_preprocess.py](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic-Speech-Recognition\u002Fblob\u002Fmaster\u002Fsrc\u002Ffeature\u002Ftimit\u002Ftimit_preprocess.py).\n\nThe original TIMIT dataset contains 61 phonemes, we use 61 phonemes for training and evaluation, but when scoring, we mappd the 61 phonemes into 39 phonemes for better performance. We do this mapping according to the paper [Speaker-independent phone recognition using hidden Markov models](http:\u002F\u002Frepository.cmu.edu\u002Fcgi\u002Fviewcontent.cgi?article=2768&context=compsci). The mapping details are as follows:\n\n| Original Phoneme(s) | Mapped Phoneme |\n| :------------------  | :-------------------: |\n| iy | iy |\n| ix, ih | ix |\n| eh | eh |\n| ae | ae |\n| ax, ah, ax-h | ax | \n| uw, ux | uw |\n| uh | uh |\n| ao, aa | ao |\n| ey | ey |\n| ay | ay |\n| oy | oy |\n| aw | aw |\n| ow | ow |\n| er, axr | er |\n| l, el | l |\n| r | r |\n| w | w |\n| y | y |\n| m, em | m |\n| n, en, nx | n |\n| ng, eng | ng |\n| v | v |\n| f | f |\n| dh | dh |\n| th | th |\n| z | z |\n| s | s |\n| zh, sh | zh |\n| jh | jh |\n| ch | ch |\n| b | b |\n| p | p |\n| d | d |\n| dx | dx |\n| t | t |\n| g | g |\n| k | k |\n| hh, hv | hh |\n| bcl, pcl, dcl, tcl, gcl, kcl, q, epi, pau, h# | h# |\n \n\n#### LibriSpeech corpus\n\nLibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech. It can be downloaded from [here](http:\u002F\u002Fwww.openslr.org\u002F12\u002F)\n\nIn order to preprocess LibriSpeech data, download the dataset from the above mentioned link, extract it and run the following:\n\u003Cpre>\ncd feature\u002Flibri\npython libri_preprocess.py -h \nusage: libri_preprocess [-h]\n                        [-n {dev-clean,dev-other,test-clean,test-other,train-clean-100,train-clean-360,train-other-500}]\n                        [-m {mfcc,fbank}] [--featlen FEATLEN] [-s]\n                        [-wl WINLEN] [-ws WINSTEP]\n                        path save\n\nScript to preprocess libri data\n\npositional arguments:\n  path                  Directory of LibriSpeech dataset\n  save                  Directory where preprocessed arrays are to be saved\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -n {dev-clean,dev-other,test-clean,test-other,train-clean-100,train-clean-360,train-other-500}, --name {dev-clean,dev-other,test-clean,test-other,train-clean-100,train-clean-360,train-other-500}\n                        Name of the dataset\n  -m {mfcc,fbank}, --mode {mfcc,fbank}\n                        Mode\n  --featlen FEATLEN     Features length\n  -s, --seq2seq         set this flag to use seq2seq\n  -wl WINLEN, --winlen WINLEN\n                        specify the window length of feature\n  -ws WINSTEP, --winstep WINSTEP\n                        specify the window step length of feature\n\u003C\u002Fpre>\n\nThe processed data will be saved in the \"save\" path. \n\nTo train the model, run the following:\n\u003Cpre>\npython main\u002Flibri_train.py -h \nusage: libri_train.py [-h] [--task TASK] [--train_dataset TRAIN_DATASET]\n                      [--dev_dataset DEV_DATASET]\n                      [--test_dataset TEST_DATASET] [--mode MODE]\n                      [--keep [KEEP]] [--nokeep] [--level LEVEL]\n                      [--model MODEL] [--rnncell RNNCELL]\n                      [--num_layer NUM_LAYER] [--activation ACTIVATION]\n                      [--optimizer OPTIMIZER] [--batch_size BATCH_SIZE]\n                      [--num_hidden NUM_HIDDEN] [--num_feature NUM_FEATURE]\n                      [--num_classes NUM_CLASSES] [--num_epochs NUM_EPOCHS]\n                      [--lr LR] [--dropout_prob DROPOUT_PROB]\n                      [--grad_clip GRAD_CLIP] [--datadir DATADIR]\n                      [--logdir LOGDIR]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --task TASK           set task name of this program\n  --train_dataset TRAIN_DATASET\n                        set the training dataset\n  --dev_dataset DEV_DATASET\n                        set the development dataset\n  --test_dataset TEST_DATASET\n                        set the test dataset\n  --mode MODE           set whether to train, dev or test\n  --keep [KEEP]         set whether to restore a model, when test mode, keep\n                        should be set to True\n  --nokeep\n  --level LEVEL         set the task level, phn, cha, or seq2seq, seq2seq will\n                        be supported soon\n  --model MODEL         set the model to use, DBiRNN, BiRNN, ResNet..\n  --rnncell RNNCELL     set the rnncell to use, rnn, gru, lstm...\n  --num_layer NUM_LAYER\n                        set the layers for rnn\n  --activation ACTIVATION\n                        set the activation to use, sigmoid, tanh, relu, elu...\n  --optimizer OPTIMIZER\n                        set the optimizer to use, sgd, adam...\n  --batch_size BATCH_SIZE\n                        set the batch size\n  --num_hidden NUM_HIDDEN\n                        set the hidden size of rnn cell\n  --num_feature NUM_FEATURE\n                        set the size of input feature\n  --num_classes NUM_CLASSES\n                        set the number of output classes\n  --num_epochs NUM_EPOCHS\n                        set the number of epochs\n  --lr LR               set the learning rate\n  --dropout_prob DROPOUT_PROB\n                        set probability of dropout\n  --grad_clip GRAD_CLIP\n                        set the threshold of gradient clipping, -1 denotes no\n                        clipping\n  --datadir DATADIR     set the data root directory\n  --logdir LOGDIR       set the log directory\n\u003C\u002Fpre>\n\nwhere the \"datadir\" is the \"save\" path used in preprocess stage.\n\n#### Wall Street Journal corpus\n\nTODO\n\n### Core Features\n+ dynamic RNN(GRU, LSTM)\n+ Residual Network(Deep CNN)\n+ CTC Decoding\n+ TIMIT Phoneme Edit Distance(PER)\n\n## Future Work\n- [ ] Release pretrained English ASR model\n- [ ] Add Attention Mechanism\n- [ ] Add Speaker Verification\n- [ ] Add TTS\n\n## License\nMIT\n\n## Contact Us\nIf this program is helpful to you, please give us a **star or fork** to encourage us to keep updating. Thank you! Besides, any issues or pulls are appreciated.\n\nCollaborators: \n\n[zzw922cn](https:\u002F\u002Fgithub.com\u002Fzzw922cn)\n\n[deepxuexi](https:\u002F\u002Fgithub.com\u002Fdeepxuexi)\n\n[hiteshpaul](https:\u002F\u002Fgithub.com\u002Fhiteshpaul)\n\n[xxxxyzt](https:\u002F\u002Fgithub.com\u002Fxxxxyzt)\n","# 自动语音识别\n使用 TensorFlow 实现的端到端自动语音识别系统。\n\n## 最新更新\n- [x] **支持 TensorFlow r1.0** (2017-02-24)\n- [x] **支持动态 RNN 的 dropout** (2017-03-11)\n- [x] **支持在 shell 脚本中运行** (2017-03-11)\n- [x] **支持每隔几个训练周期自动进行评估** (2017-03-11)\n- [x] **修复字符级自动语音识别中的 bug** (2017-03-14)\n- [x] **改进部分函数 API 以提高可重用性** (2017-03-14)\n- [x] **添加数据预处理的归一化缩放功能** (2017-03-15)\n- [x] **为 LibriSpeech 训练添加可重用支持** (2017-03-15)\n- [x] **添加简单的 n-gram 模型，用于随机生成或统计用途** (2017-03-23)\n- [x] **改进一些预处理和训练相关的代码** (2017-03-23)\n- [x] **将 TAB 替换为空格，并添加 nist2wav 转换脚本** (2017-04-20)\n- [x] **添加部分数据准备代码** (2017-05-01)\n- [x] **按照 s5 配方对 WSJ 语料库进行标准预处理** (2017-05-05)\n- [x] **项目重构。更新 train.py 以方便使用** (2017-05-06)\n- [x] **完成 TIMIT、Libri 和 WSJ 的特征模块，支持 LibriSpeech 的训练** (2017-05-14)\n- [x] **移除部分不必要的代码** (2017-07-22)\n- [x] **添加 DeepSpeech2 实现代码** (2017-07-23)\n- [x] **修复部分 bug** (2017-08-06)\n- [x] **添加层归一化 RNN 以提高效率** (2017-08-06)\n- [x] **添加马达里语音识别支持** (2017-08-06)\n- [x] **添加胶囊网络模型** (2017-12-12)\n- [x] **发布 1.0.0 版本** (2017-12-14)\n- [x] **添加语言建模模块** (2017-12-25)\n- [x] **即将支持 TF1.12** (2019-10-17)\n\n## 推荐\n如果您希望用 TensorFlow 的多线程和 FIFO 队列输入管道来替代 feed dict 操作，可以参考我的仓库 [TensorFlow-Input-Pipeline](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FTensorFlow-Input-Pipeline)，其中提供了更多示例代码。我的实践证明，FIFO 队列输入管道可以在一定程度上提升训练速度。\n\n如果您想了解语音识别的发展历程，我收集了自 1981 年以来 ASR 领域的重要论文。您可以在我的仓库 [awesome-speech-recognition-papers](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002Fawesome-speech-recognition-papers) 中阅读这些优秀论文列表，所有论文的下载链接均已提供。我会每周更新该仓库，加入新的论文，涵盖语音识别、语音合成和语言建模等领域。希望我们不会错过语音领域任何重要的研究论文。\n\n未来我所有的公共仓库都会持续更新，感谢您的 Star！\n\n## 安装与使用\n目前仅支持 Python 3.5。\n\n该项目依赖于 scikit.audiolab，因此您需要在系统中安装 [libsndfile](http:\u002F\u002Fwww.mega-nerd.com\u002Flibsndfile\u002F)。\n\n将仓库克隆到您喜欢的目录，并使用以下命令进行安装：\n\u003Cpre>\nsudo pip3 install -r requirements.txt\nsudo python3 setup.py install\n\u003C\u002Fpre>\n\n使用时，只需运行以下命令：\n\u003Cpre>\npython main\u002Ftimit_train.py [-h] [--mode MODE] [--keep [KEEP]] [--nokeep]\n                      [--level LEVEL] [--model MODEL] [--rnncell RNNCELL]\n                      [--num_layer NUM_LAYER] [--activation ACTIVATION]\n                      [--optimizer OPTIMIZER] [--batch_size BATCH_SIZE]\n                      [--num_hidden NUM_HIDDEN] [--num_feature NUM_FEATURE]\n                      [--num_classes NUM_CLASSES] [--num_epochs NUM_EPOCHS]\n                      [--lr LR] [--dropout_prob DROPOUT_PROB]\n                      [--grad_clip GRAD_CLIP] [--datadir DATADIR]\n                      [--logdir LOGDIR]\n\n可选参数：\n  -h, --help            显示帮助信息并退出\n  --mode MODE           设置是训练还是测试模式\n  --keep [KEEP]         设置是否恢复模型，在测试模式下应设置为 True\n  --nokeep\n  --level LEVEL         设置任务级别，phn、cha 或 seq2seq，seq2seq 将很快支持\n  --model MODEL         设置使用的模型，DBiRNN、BiRNN、ResNet 等\n  --rnncell RNNCELL     设置使用的 RNN 单元，RNN、GRU、LSTM 等\n  --num_layer NUM_LAYER 设置 RNN 的层数\n  --activation ACTIVATION 设置激活函数，sigmoid、tanh、ReLU、ELU 等\n  --optimizer OPTIMIZER 设置优化器，SGD、Adam 等\n  --batch_size BATCH_SIZE 设置批次大小\n  --num_hidden NUM_HIDDEN 设置 RNN 单元的隐藏层大小\n  --num_feature NUM_FEATURE 设置输入特征的维度\n  --num_classes NUM_CLASSES 设置输出类别的数量\n  --num_epochs NUM_EPOCHS 设置训练轮数\n  --lr LR               设置学习率\n  --dropout_prob DROPOUT_PROB 设置 dropout 概率\n  --grad_clip GRAD_CLIP 设置梯度裁剪阈值\n  --datadir DATADIR     设置数据根目录\n  --logdir LOGDIR       设置日志目录\n\n\u003C\u002Fpre>\n除了在命令行中配置参数外，您也可以在实际操作中直接在 [timit_train.py](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition\u002Fblob\u002Fmaster\u002Fmain\u002Ftimit_train.py) 中设置上述参数。\n\n此外，您还可以运行 `main\u002Frun.sh` 来同时进行训练和测试！详情请参阅 [run_timit.sh](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition\u002Fblob\u002Fmaster\u002Fmain\u002Frun_timit.sh)。\n\n## 性能\n### 基于 PER 的动态 BLSTM 在 TIMIT 数据集上的表现，由于时间有限进行了简单调优\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fzzw922cn_Automatic_Speech_Recognition_readme_6b9956a35c30.png)\n\n### LibriSpeech 语音识别结果（无语言模型）\n**标签**：\n\n大约正午时分，韦弗利船长走进了散乱的村庄——更确切地说，是图利·维奥兰的小村落——庄园主的宅邸就坐落在附近。\n\n**预测**：\n\n大约正午时分，韦弗利船长走进了散乱的比拉戈尔村——更确切地说，是图利·维奥兰的小村落——庄园主的宅邸就坐落在附近。\n\n\n**标签**：\n\n很明显，如果英国人先前没有得到保证，他们绝不会拿出如此巨额的资金；而他们却通过同样的举措削弱了自己，同时增强了与他们日后必须进行重要交涉的民族的力量。\n\n**预测**：\n\n很明显，如果英国人先前没有得到保证，他们绝不会拿出如此巨额的资金；而他们却通过同样的举措削弱了自己的力量，同时增强了与他们日后必须进行重要交涉的民族的力量。\n\n\n**标签**：\n\n描写这样一个时代的人，往往处于一种令人困扰的劣势之中。\n\n**预测**：\n\n描写这样一个时代的一个人，处于一种令人困扰的劣势之中。\n\n\n**标签**：\n\n随后他们再次启程，两小时后便看到了皮普特医生的房子。\n\n**预测**：\n\n随后他们再次启程，两小时后便看到了皮普特医生的房子。\n\n\n**标签**：\n\n他想要什么？\n\n**预测**：\n\n他想要什么？\n\n\n**标签**：\n\n就在前面。\n\n**预测**：\n\n就在前面。\n\n\n**标签**：\n\n在一般情况下，鲍鱼又硬又不好吃，但在赫伯特巧妙的处理下，它们变得鲜嫩可口，无论是油炸、做成浓汤，还是以纽伯格式烹调，都是一道美味佳肴。\n\n**预测**：\n\n在一般情况下，鲍鱼又硬又不好吃，但在赫伯特巧妙的处理下，它们变得鲜嫩可口，无论是油炸、做成浓汤，还是以纽伯格式烹调，都是一道美味佳肴。\n\n\n**标签**：\n\n渐渐地，他所有的快乐和光彩都消退为悔恨和不安，以至于他的四肢失去了力气，双臂沉重地垂在身旁，头也低垂下来，仿佛被麻痹了一样。\n\n**预测**：\n\n渐渐地，他所有的快乐和光彩都消退为悔恨和不安，以至于他的四肢失去了力气，双臂沉重地垂在身旁，头也低垂下来，仿佛被麻痹了一样。\n\n\n**标签**：\n\n如果一定要有人去找沃尔特的话，那我就是那个人。托马斯先生，我会去的。\n\n**预测**：\n\n如果一定要有人去找沃尔特的话，那我就是那个人。托马斯先生，我会去的。\n\n\n**标签**：\n\n我不得不仔细地反复阅读它，因为文本必须绝对正确。\n\n**预测**：\n\n我不得不仔细地反复阅读它，因为文本必须绝对正确。\n\n\n**标签**：\n\n伴随着一声呼喊，男孩们一窝蜂地冲上前去迎接驮队，在缓慢前行的骡子后面跟上，用嘲讽的叫喊声和响亮的拍打声催促着这些动物前进。\n\n**预测**：\n\n伴随着一声呼喊，男孩们一窝蜂地冲上前去迎接驮队，在缓慢前行的骡子后面跟上，用嘲讽的叫喊声和响亮的拍打声催促着这些动物前进。\n\n\n**标签**：\n\n我想，虽然对他们来说还太早了，但随后就发生了爆炸。\n\n**预测**：\n\n我想，虽然对他们来说还太早了，但随后就发生了爆炸。\n\n\n## 内容\n这是一个功能强大的**自动语音识别**库，基于 TensorFlow 实现，并支持使用 CPU 或 GPU 进行训练。该库包含以下模型，您可以选择其中任意一种来训练自己的模型：\n* 数据预处理\n* 声学建模\n  * RNN\n  * BRNN\n  * LSTM\n  * BLSTM\n  * GRU\n  * BGRU\n  * 动态 RNN\n  * 深度残差网络\n  * 带注意力机制解码器的 Seq2Seq\n  * 等等\n* CTC 解码\n* 评估（对部分相似音素进行映射）\n* 模型保存与恢复\n* 小批量训练\n* 使用 TensorFlow 在 GPU 或 CPU 上训练\n* 将每个 epoch 的耗时及错误率记录到磁盘中\n\n## 实现细节\n\n### 数据预处理\n\n#### TIMIT 语料库\n\n原始的 TIMIT 数据库包含 6300 条语音样本，但我们发现其中“SA”类别的音频文件重复出现多次，这会导致我们的语音识别系统产生偏差。因此，我们从原始数据集中移除了所有“SA”类别的文件，得到了一个新的 TIMIT 数据集，其中仅包含 5040 条语音样本，包括 3696 条标准训练集和 1344 条测试集。\n\n自动语音识别会将原始音频文件转录为字符序列；而预处理阶段则是将原始音频文件转换为由若干帧组成的特征向量。我们首先将每段音频分割成 20 毫秒的汉明窗口，重叠 10 毫秒，然后计算 12 个梅尔频率倒谱系数，并在每一帧中附加一个能量变量，从而得到长度为 13 的向量。接着，我们再计算一阶和二阶差分系数，最终每帧共有 39 个系数。换句话说，每段音频都会被汉明窗口函数分割成多个帧，而每个帧则会被提取为一个长度为 39 的特征向量（若需获取不同长度的特征向量，可在 [timit_preprocess.py](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic-Speech-Recognition\u002Fblob\u002Fmaster\u002Fsrc\u002Ffeature\u002Ftimit\u002Ftimit_preprocess.py) 文件中修改相关设置）。\n\n在 data\u002Fmfcc 文件夹中，每个文件都是某段音频的时间长度乘以 39 维的特征矩阵；而在 data\u002Flabel 文件夹中，每个文件则是对应 mfcc 文件的标签向量。\n\n如果您希望自定义数据预处理流程，可以编辑 [calcmfcc.py](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic-Speech-Recognition\u002Fblob\u002Fmaster\u002Fsrc\u002Ffeature\u002Fcore\u002Fcalcmfcc.py) 或 [timit_preprocess.py](https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic-Speech-Recognition\u002Fblob\u002Fmaster\u002Fsrc\u002Ffeature\u002Ftimit\u002Ftimit_preprocess.py) 文件。\n\n原始 TIMIT 数据集包含 61 个音素，我们在训练和评估时使用这 61 个音素，但在评分时，为了提升性能，我们将这 61 个音素映射为 39 个音素。这一映射依据论文《基于隐马尔可夫模型的独立于说话人的音素识别》（http:\u002F\u002Frepository.cmu.edu\u002Fcgi\u002Fviewcontent.cgi?article=2768&context=compsci）进行。映射详情如下：\n\n| 原始音素 | 映射后的音素 |\n| :------------------  | :-------------------: |\n| iy | iy |\n| ix, ih | ix |\n| eh | eh |\n| ae | ae |\n| ax, ah, ax-h | ax | \n| uw, ux | uw |\n| uh | uh |\n| ao, aa | ao |\n| ey | ey |\n| ay | ay |\n| oy | oy |\n| aw | aw |\n| ow | ow |\n| er, axr | er |\n| l, el | l |\n| r | r |\n| w | w |\n| y | y |\n| m, em | m |\n| n, en, nx | n |\n| ng, eng | ng |\n| v | v |\n| f | f |\n| dh | dh |\n| th | th |\n| z | z |\n| s | s |\n| zh, sh | zh |\n| jh | jh |\n| ch | ch |\n| b | b |\n| p | p |\n| d | d |\n| dx | dx |\n| t | t |\n| g | g |\n| k | k |\n| hh, hv | hh |\n| bcl, pcl, dcl, tcl, gcl, kcl, q, epi, pau, h# | h# |\n\n\n#### LibriSpeech 语料库\n\nLibriSpeech 是一个约 1000 小时的 16kHz 英语朗读语音语料库。可以从[这里](http:\u002F\u002Fwww.openslr.org\u002F12\u002F)下载。\n\n为了预处理 LibriSpeech 数据，首先从上述链接下载数据集，解压后运行以下命令：\n\u003Cpre>\ncd feature\u002Flibri\npython libri_preprocess.py -h \nusage: libri_preprocess [-h]\n                        [-n {dev-clean,dev-other,test-clean,test-other,train-clean-100,train-clean-360,train-other-500}]\n                        [-m {mfcc,fbank}] [--featlen FEATLEN] [-s]\n                        [-wl WINLEN] [-ws WINSTEP]\n                        path save\n\n用于预处理 Libri 数据的脚本\n\n位置参数:\n  path                  LibriSpeech 数据集的目录\n  save                  预处理后的数组保存目录\n\n可选参数:\n  -h, --help            显示此帮助信息并退出\n  -n {dev-clean,dev-other,test-clean,test-other,train-clean-100,train-clean-360,train-other-500}, --name {dev-clean,dev-other,test-clean,test-other,train-clean-100,train-clean-360,train-other-500}\n                        数据集名称\n  -m {mfcc,fbank}, --mode {mfcc,fbank}\n                        模式\n  --featlen FEATLEN     特征长度\n  -s, --seq2seq         设置此标志以使用 seq2seq\n  -wl WINLEN, --winlen WINLEN\n                        指定特征的窗口长度\n  -ws WINSTEP, --winstep WINSTEP\n                        指定特征的窗口步长\n\u003C\u002Fpre>\n\n处理后的数据将保存在“save”路径中。\n\n要训练模型，请运行以下命令：\n\u003Cpre>\npython main\u002Flibri_train.py -h \nusage: libri_train.py [-h] [--task TASK] [--train_dataset TRAIN_DATASET]\n                      [--dev_dataset DEV_DATASET]\n                      [--test_dataset TEST_DATASET] [--mode MODE]\n                      [--keep [KEEP]] [--nokeep] [--level LEVEL]\n                      [--model MODEL] [--rnncell RNNCELL]\n                      [--num_layer NUM_LAYER] [--activation ACTIVATION]\n                      [--optimizer OPTIMIZER] [--batch_size BATCH_SIZE]\n                      [--num_hidden NUM_HIDDEN] [--num_feature NUM_FEATURE]\n                      [--num_classes NUM_CLASSES] [--num_epochs NUM_EPOCHS]\n                      [--lr LR] [--dropout_prob DROPOUT_PROB]\n                      [--grad_clip GRAD_CLIP] [--datadir DATADIR]\n                      [--logdir LOGDIR]\n\n可选参数:\n  -h, --help            显示此帮助信息并退出\n  --task TASK           设置程序的任务名称\n  --train_dataset TRAIN_DATASET\n                        设置训练数据集\n  --dev_dataset DEV_DATASET\n                        设置开发数据集\n  --test_dataset TEST_DATASET\n                        设置测试数据集\n  --mode MODE           设置是进行训练、开发还是测试\n  --keep [KEEP]         设置是否恢复模型，在测试模式下，keep 应设置为 True\n  --nokeep\n  --level LEVEL         设置任务级别，phn、cha 或 seq2seq，seq2seq 将很快支持\n  --model MODEL         设置要使用的模型，DBiRNN、BiRNN、ResNet 等\n  --rnncell RNNCELL     设置要使用的 RNN 单元，rnn、gru、lstm 等\n  --num_layer NUM_LAYER\n                        设置 RNN 的层数\n  --activation ACTIVATION\n                        设置要使用的激活函数，sigmoid、tanh、relu、elu 等\n  --optimizer OPTIMIZER\n                        设置要使用的优化器，sgd、adam 等\n  --batch_size BATCH_SIZE\n                        设置批量大小\n  --num_hidden NUM_HIDDEN\n                        设置 RNN 单元的隐藏层大小\n  --num_feature NUM_FEATURE\n                        设置输入特征的大小\n  --num_classes NUM_CLASSES\n                        设置输出类别的数量\n  --num_epochs NUM_EPOCHS\n                        设置训练轮数\n  --lr LR               设置学习率\n  --dropout_prob DROPOUT_PROB\n                        设置丢弃概率\n  --grad_clip GRAD_CLIP\n                        设置梯度裁剪的阈值，-1 表示不进行裁剪\n  --datadir DATADIR     设置数据根目录\n  --logdir LOGDIR       设置日志目录\n\u003C\u002Fpre>\n\n其中，“datadir”即为预处理阶段使用的“save”路径。\n\n#### 华尔街日报语料库\n\n待完成\n\n\n\n### 核心功能\n+ 动态 RNN（GRU、LSTM）\n+ 残差网络（深度 CNN）\n+ CTC 解码\n+ TIMIT 语音音素编辑距离（PER）\n\n## 未来工作\n- [ ] 发布预训练的英语 ASR 模型\n- [ ] 添加注意力机制\n- [ ] 添加说话人验证\n- [ ] 添加 TTS\n\n## 许可证\nMIT\n\n## 联系我们\n如果这个程序对您有所帮助，请为我们点个 **星标或 Fork**，以鼓励我们持续更新。感谢您的支持！此外，任何问题或 Pull Request 都将不胜感激。\n\n贡献者：\n\n[zzw922cn](https:\u002F\u002Fgithub.com\u002Fzzw922cn)\n\n[deepxuexi](https:\u002F\u002Fgithub.com\u002Fdeepxuexi)\n\n[hiteshpaul](https:\u002F\u002Fgithub.com\u002Fhiteshpaul)\n\n[xxxxyzt](https:\u002F\u002Fgithub.com\u002Fxxxxyzt)","# Automatic-Speech-Recognition 快速上手指南\n\n本项目是一个基于 TensorFlow 实现的端到端自动语音识别（ASR）系统，支持多种声学模型（如 RNN, LSTM, BLSTM, DeepSpeech2 等）及 CTC 解码。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**：Linux \u002F macOS (Windows 需额外配置编译环境)\n*   **Python 版本**：目前仅支持 **Python 3.5**\n*   **核心依赖**：\n    *   TensorFlow (推荐版本 r1.0 - r1.12，根据项目更新日志适配)\n    *   `scikit.audiolab` (需要系统预先安装 [libsndfile](http:\u002F\u002Fwww.mega-nerd.com\u002Flibsndfile\u002F) 库)\n    *   其他 Python 依赖包（见 `requirements.txt`）\n\n> **提示**：国内用户安装 Python 依赖时，建议指定清华或阿里镜像源以加速下载。\n\n## 安装步骤\n\n1.  **安装系统级依赖 libsndfile**\n    *   Ubuntu\u002FDebian:\n        ```bash\n        sudo apt-get install libsndfile1-dev\n        ```\n    *   macOS (使用 Homebrew):\n        ```bash\n        brew install libsndfile\n        ```\n\n2.  **克隆项目代码**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition.git\n    cd Automatic_Speech_Recognition\n    ```\n\n3.  **安装 Python 依赖并构建项目**\n    建议使用国内镜像源加速安装：\n    ```bash\n    # 安装 requirements.txt 中的依赖\n    sudo pip3 install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    \n    # 安装本项目\n    sudo python3 setup.py install\n    ```\n\n## 基本使用\n\n本项目提供了灵活的命令行参数进行训练和测试。以下以 TIMIT 数据集为例展示最基础的用法。\n\n### 1. 命令行运行模式\n\n您可以直接通过 `python main\u002Ftimit_train.py` 启动训练或测试。\n\n**训练示例：**\n```bash\npython main\u002Ftimit_train.py --mode train --level phn --model DBiRNN --rnncell lstm --num_layer 3 --batch_size 32 --num_epochs 10 --datadir .\u002Fdata\u002Ftimit --logdir .\u002Flogs\n```\n\n**测试示例：**\n```bash\npython main\u002Ftimit_train.py --mode test --level phn --model DBiRNN --keep True --datadir .\u002Fdata\u002Ftimit --logdir .\u002Flogs\n```\n\n**主要参数说明：**\n*   `--mode`: 设置模式，`train` (训练) 或 `test` (测试)。\n*   `--keep`: 测试时必须设为 `True` 以加载已保存的模型。\n*   `--level`: 任务级别，可选 `phn` (音素), `cha` (字符), `seq2seq`。\n*   `--model`: 模型类型，如 `DBiRNN`, `BiRNN`, `ResNet`, `DeepSpeech2` 等。\n*   `--rnncell`: RNN 单元类型，如 `rnn`, `gru`, `lstm`。\n*   `--datadir`: 数据根目录路径。\n*   `--logdir`: 日志和模型保存目录。\n\n### 2. 脚本运行模式\n\n项目也提供了 Shell 脚本以便同时执行训练和测试流程（需先根据需求修改脚本内容）：\n\n```bash\nbash main\u002Frun.sh\n```\n\n具体配置可参考 `main\u002Frun_timit.sh` 文件。\n\n### 3. 代码配置模式\n\n除了命令行参数，您也可以直接编辑 `main\u002Ftimit_train.py` 文件，在代码内部硬编码上述参数值，适合固定实验配置的场景。","一家专注于方言保护的科研团队正试图构建曼达林语与英语的双语语音转写系统，以数字化保存濒危的口述历史资料。\n\n### 没有 Automatic_Speech_Recognition 时\n- **多语言支持缺失**：传统开源方案大多仅支持标准英语，团队需从零研发曼达林语声学模型，耗时数月且数据对齐困难。\n- **架构迭代缓慢**：若要尝试 DeepSpeech2 或胶囊网络等先进架构，研究人员需手动重写底层 TensorFlow 代码，调试周期极长。\n- **数据预处理繁琐**：缺乏针对 TIMIT、LibriSpeech 等标准数据集的自动化预处理脚本，清洗音频和提取特征占用了 80% 的开发时间。\n- **训练监控薄弱**：无法自动在多个训练周期后进行评估，模型优化全靠人工干预，难以快速验证算法改进效果。\n\n### 使用 Automatic_Speech_Recognition 后\n- **双语即时可用**：直接调用内置的曼达林语与英语端到端识别模块，无需重新造轮子，项目启动时间从数月缩短至几天。\n- **前沿模型一键切换**：通过命令行参数即可灵活切换 DeepSpeech2、Layer Normalization RNN 或胶囊网络模型，快速对比不同架构的性能。\n- **标准化数据流水线**：利用集成的 s5 配方和预处理代码，自动完成 WSJ、LibriSpeech 等 corpus 的特征提取与缩放，大幅释放人力。\n- **自动化训练闭环**：配置后可自动执行定期评估与断点保存，结合 n-gram 语言模型辅助，显著提升了收敛速度与识别准确率。\n\nAutomatic_Speech_Recognition 通过提供开箱即用的多语言支持与模块化架构，将科研团队从重复的基础设施搭建中解放出来，使其能专注于核心算法的创新与方言数据的深度挖掘。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fzzw922cn_Automatic_Speech_Recognition_2944bad5.png","zzw922cn",null,"https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fzzw922cn_99a6a5a9.png","Life is short, do things you love. 人生苦短，做自己热爱的事情。","wesinger","China","zzw922cn@gmail.com","https:\u002F\u002Fgithub.com\u002Fzzw922cn",[82,86,90],{"name":83,"color":84,"percentage":85},"Python","#3572A5",98.7,{"name":87,"color":88,"percentage":89},"Shell","#89e051",1,{"name":91,"color":92,"percentage":93},"Dockerfile","#384d54",0.3,2839,534,"2026-04-02T14:43:15","MIT",4,"未说明","支持 CPU\u002FGPU 训练，具体型号、显存及 CUDA 版本未说明",{"notes":102,"python":103,"dependencies":104},"该项目基于较旧的 TensorFlow 1.x 版本（最新提及支持至 1.12），依赖 scikit.audiolab，因此需要在系统中预先安装 libsndfile 库。代码主要面向 TIMIT、LibriSpeech 和 WSJ 数据集，包含数据预处理脚本。由于项目最后更新时间为 2019 年且依赖旧版框架，在现代环境中运行可能需要配置兼容的旧版依赖环境。","3.5",[105,106,107],"tensorflow>=1.0","scikit.audiolab","libsndfile",[14,16,109,110],"音频","其他",[112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129],"automatic-speech-recognition","tensorflow","timit-dataset","feature-vector","phonemes","data-preprocessing","rnn","audio","deep-learning","lstm","end-to-end","cnn","rnn-encoder-decoder","evaluation","paper","speech-recognition","layer-normalization","chinese-speech-recognition","2026-03-27T02:49:30.150509","2026-04-10T10:30:15.184696",[133,138,143,148,153,158,163,168],{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},27536,"项目是否提供预训练模型？","目前作者无法直接提供预训练模型，因为版权不属于作者本人。但作者表示如果有相关疑问可以提 Issue 交流，且后续可能会在 GitHub 上展示略有不同的自有预训练模型。","https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition\u002Fissues\u002F24",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},27537,"运行时报错 'SyntaxError: Missing parentheses in call to print' 如何解决？","这是 Python 2 与 Python 3 的语法差异问题。在 Python 3.x 中，print 必须使用括号。解决方法是手动给 print 语句加上括号，然后重新运行 `python setup.py install` 进行安装。","https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition\u002Fissues\u002F56",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},27538,"scikits.audiolab 库只支持 Python 2，在 Python 3 中有什么替代方案？","audiolab 主要用于读取音频数据。在 Python 3 环境中，可以使用 **librosa** 库作为替代方案来实现相同的功能。","https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition\u002Fissues\u002F45",{"id":149,"question_zh":150,"answer_zh":151,"source_url":152},27539,"如何获取 TIMIT 数据集的有效链接？","可以参考以下 GitHub 仓库获取关于 TIMIT 数据集的相关信息或链接：https:\u002F\u002Fgithub.com\u002Fphilipperemy\u002Ftimit","https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition\u002Fissues\u002F36",{"id":154,"question_zh":155,"answer_zh":156,"source_url":157},27540,"如何输出音素（phonemes）而不是英文文本？","模型默认输出英文文本。如果需要获取原始检测到的音素，可能需要使用额外的转换工具将输出结果转换为音素格式，或者修改代码直接打印底层识别出的音素数据。","https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition\u002Fissues\u002F51",{"id":159,"question_zh":160,"answer_zh":161,"source_url":162},27541,"中文语音识别是如何处理的？标签是汉字还是拼音？","目前中文语音识别的主流做法是直接使用语音特征作为输入，使用汉字作为输出标签，这样可以避免拼音解码操作。也有做法是先对句子分词，再以单词作为标签，这种方式被认为更合理。","https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition\u002Fissues\u002F47",{"id":164,"question_zh":165,"answer_zh":166,"source_url":167},27542,"requirements.txt 中的 theano、tabulate 和 xlwt 是必需的吗？","Theano 不是必需的。tabulate 和 xlwt 是必需的，它们主要用于数据分析（如创建表格数据和 Excel 文件）。","https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition\u002Fissues\u002F41",{"id":169,"question_zh":170,"answer_zh":171,"source_url":172},27543,"遇到 '_linear' 函数找不到或导入错误怎么办？","需要在代码文件中添加以下导入语句来修复该问题：`from tensorflow.contrib.rnn.python.ops.core_rnn_cell import _linear`。","https:\u002F\u002Fgithub.com\u002Fzzw922cn\u002FAutomatic_Speech_Recognition\u002Fissues\u002F84",[]]