[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-buriburisuri--speech-to-text-wavenet":3,"tool-buriburisuri--speech-to-text-wavenet":65},[4,23,32,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,2,"2026-04-05T10:45:23",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},4128,"GPT-SoVITS","RVC-Boss\u002FGPT-SoVITS","GPT-SoVITS 是一款强大的开源语音合成与声音克隆工具，旨在让用户仅需极少量的音频数据即可训练出高质量的个性化语音模型。它核心解决了传统语音合成技术依赖海量录音数据、门槛高且成本大的痛点，实现了“零样本”和“少样本”的快速建模：用户只需提供 5 秒参考音频即可即时生成语音，或使用 1 分钟数据进行微调，从而获得高度逼真且相似度极佳的声音效果。\n\n该工具特别适合内容创作者、独立开发者、研究人员以及希望为角色配音的普通用户使用。其内置的友好 WebUI 界面集成了人声伴奏分离、自动数据集切片、中文语音识别及文本标注等辅助功能，极大地降低了数据准备和模型训练的技术门槛，让非专业人士也能轻松上手。\n\n在技术亮点方面，GPT-SoVITS 不仅支持中、英、日、韩、粤语等多语言跨语种合成，还具备卓越的推理速度，在主流显卡上可实现实时甚至超实时的生成效率。无论是需要快速制作视频配音，还是进行多语言语音交互研究，GPT-SoVITS 都能以极低的数据成本提供专业级的语音合成体验。",56375,3,"2026-04-05T22:15:46",[21],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},2863,"TTS","coqui-ai\u002FTTS","🐸TTS 是一款功能强大的深度学习文本转语音（Text-to-Speech）开源库，旨在将文字自然流畅地转化为逼真的人声。它解决了传统语音合成技术中声音机械生硬、多语言支持不足以及定制门槛高等痛点，让高质量的语音生成变得触手可及。\n\n无论是希望快速集成语音功能的开发者，还是致力于探索前沿算法的研究人员，亦或是需要定制专属声音的数据科学家，🐸TTS 都能提供得力支持。它不仅预置了覆盖全球 1100 多种语言的训练模型，让用户能够即刻上手，还提供了完善的工具链，支持用户利用自有数据训练新模型或对现有模型进行微调，轻松实现特定风格的声音克隆。\n\n在技术亮点方面，🐸TTS 表现卓越。其最新的 ⓍTTSv2 模型支持 16 种语言，并在整体性能上大幅提升，实现了低于 200 毫秒的超低延迟流式输出，极大提升了实时交互体验。此外，它还无缝集成了 🐶Bark、🐢Tortoise 等社区热门模型，并支持调用上千个 Fairseq 模型，展现了极强的兼容性与扩展性。配合丰富的数据集分析与整理工具，🐸TTS 已成为科研与生产环境中备受信赖的语音合成解决方案。",44971,"2026-04-03T14:47:02",[21,20,13],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":29,"last_commit_at":46,"category_tags":47,"status":22},2375,"LocalAI","mudler\u002FLocalAI","LocalAI 是一款开源的本地人工智能引擎，旨在让用户在任意硬件上轻松运行各类 AI 模型，包括大语言模型、图像生成、语音识别及视频处理等。它的核心优势在于彻底打破了高性能计算的门槛，无需昂贵的专用 GPU，仅凭普通 CPU 或常见的消费级显卡（如 NVIDIA、AMD、Intel 及 Apple Silicon）即可部署和运行复杂的 AI 任务。\n\n对于担心数据隐私的用户而言，LocalAI 提供了“隐私优先”的解决方案，确保所有数据处理均在本地基础设施内完成，无需上传至云端。同时，它完美兼容 OpenAI、Anthropic 等主流 API 接口，这意味着开发者可以无缝迁移现有应用，直接利用本地资源替代云服务，既降低了成本又提升了可控性。\n\nLocalAI 内置了超过 35 种后端支持（如 llama.cpp、vLLM、Whisper 等），并集成了自主 AI 代理、工具调用及检索增强生成（RAG）等高级功能，且具备多用户管理与权限控制能力。无论是希望保护敏感数据的企业开发者、进行算法实验的研究人员，还是想要在个人电脑上体验最新 AI 技术的极客玩家，都能通过 LocalAI 获",44782,"2026-04-02T22:14:26",[13,21,19,17,20,14,16],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":29,"last_commit_at":54,"category_tags":55,"status":22},3108,"bark","suno-ai\u002Fbark","Bark 是由 Suno 推出的开源生成式音频模型，能够根据文本提示创造出高度逼真的多语言语音、音乐、背景噪音及简单音效。与传统仅能朗读文字的语音合成工具不同，Bark 基于 Transformer 架构，不仅能模拟说话，还能生成笑声、叹息、哭泣等非语言声音，甚至能处理带有情感色彩和语气停顿的复杂文本，极大地丰富了音频表达的可能性。\n\n它主要解决了传统语音合成声音机械、缺乏情感以及无法生成非语音类音效的痛点，让创作者能通过简单的文字描述获得生动自然的音频素材。无论是需要为视频配音的内容创作者、探索多模态生成的研究人员，还是希望快速原型设计的开发者，都能从中受益。普通用户也可通过集成的演示页面轻松体验其神奇效果。\n\n技术亮点方面，Bark 支持商业使用（MIT 许可），并在近期更新中实现了显著的推理速度提升，同时提供了适配低显存 GPU 的版本，降低了使用门槛。此外，社区还建立了丰富的提示词库，帮助用户更好地驾驭模型生成特定风格的声音。只需几行 Python 代码，即可将创意文本转化为高质量音频，是连接文字与声音世界的强大桥梁。",39067,"2026-04-04T03:33:35",[21],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":22},3788,"airi","moeru-ai\u002Fairi","airi 是一款开源的本地化 AI 伴侣项目，旨在将虚拟角色（如“二次元老婆”或赛博生命）带入用户的现实世界。它的核心目标是复刻并超越知名 AI 主播 Neuro-sama 的能力，让用户能够拥有完全自主掌控、可私有化部署的智能伙伴。\n\nairi 主要解决了用户对高度定制化、具备情感交互能力且数据隐私安全的 AI 角色的需求。不同于依赖云端服务的通用助手，airi 允许用户在本地运行，不仅保护了对话隐私，还赋予了用户定义角色性格与灵魂的自由。它支持实时语音聊天，甚至能直接参与《我的世界》（Minecraft）和《异星工厂》（Factorio）等游戏，实现了从单纯对话到共同娱乐的跨越。\n\n这款工具非常适合喜爱虚拟角色的普通用户、希望搭建个性化 AI 陪伴的技术爱好者，以及研究多模态交互的开发者。其独特的技术亮点在于跨平台支持（涵盖 Web、macOS 和 Windows）以及强大的游戏交互能力，让 AI 不仅能“说”，还能“玩”。通过容器化的灵魂设计，airi 为每个人创造专属数字生命提供了可能，让虚拟陪伴变得更加真实且触手可及。",37086,1,"2026-04-05T10:54:25",[19,21,17],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":82,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":29,"env_os":98,"env_gpu":99,"env_ram":100,"env_deps":101,"category_tags":109,"github_topics":82,"view_count":110,"oss_zip_url":82,"oss_zip_packed_at":82,"status":22,"created_at":111,"updated_at":112,"faqs":113,"releases":141},636,"buriburisuri\u002Fspeech-to-text-wavenet","speech-to-text-wavenet","Speech-to-Text-WaveNet : End-to-end sentence level English speech recognition based on DeepMind's WaveNet and tensorflow","speech-to-text-wavenet 是基于 DeepMind 经典 WaveNet 架构与 TensorFlow 构建的端到端英文语音识别开源项目。它致力于将语音信号直接转换为文本，弥补了早期 TensorFlow WaveNet 实现仅专注于音频生成而忽略识别任务的不足。\n\n对于深度学习研究人员和对语音识别技术感兴趣的开发者而言，这是一个很好的学习资源。面对原始论文中实现细节模糊的挑战，speech-to-text-wavenet 通过调整模型结构来适应硬件限制，例如在 TitanX GPU 上移除难以运行的均值池化层，并改用 CTC Loss 以适应句子级标签数据。此外，它还整合了 VCTK、LibriSpeech 等多个公开语料库，提供了从预处理到训练的全流程代码参考。\n\n尽管受限于旧版 TensorFlow 环境，不太适合直接用于商业生产，但它为理解 WaveNet 在序列建模中的应用提供了宝贵的实践案例，是学习神经语音识别架构的理想起点。","# Speech-to-Text-WaveNet : End-to-end sentence level English speech recognition using DeepMind's WaveNet\nA tensorflow implementation of speech recognition based on DeepMind's [WaveNet: A Generative Model for Raw Audio](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.03499). (Hereafter the Paper)\n\nAlthough [ibab](https:\u002F\u002Fgithub.com\u002Fibab\u002Ftensorflow-wavenet) and [tomlepaine](https:\u002F\u002Fgithub.com\u002Ftomlepaine\u002Ffast-wavenet) have already implemented WaveNet with tensorflow, they did not implement speech recognition. That's why we decided to implement it ourselves. \n\nSome of Deepmind's recent papers are tricky to reproduce. The Paper also omitted specific details about the implementation, and we had to fill the gaps in our own way.\n\nHere are a few important notes.\n\nFirst, while the Paper used the TIMIT dataset for the speech recognition experiment, we used the free VTCK dataset.\n\nSecond, the Paper added a mean-pooling layer after the dilated convolution layer for down-sampling. We extracted [MFCC](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMel-frequency_cepstrum) from wav files and removed the final mean-pooling layer because the original setting was impossible to run on our TitanX GPU.\n\nThird, since the TIMIT dataset has phoneme labels, the Paper trained the model with two loss terms, phoneme classification and next phoneme prediction. We, instead, used a single CTC loss because VCTK provides sentence-level labels. As a result, we used only dilated conv1d layers without any dilated conv1d layers.\n\nFinally, we didn't do quantitative analyses such as BLEU score and post-processing by combining a language model due to the time constraints.\n\nThe final architecture is shown in the following figure.\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fburiburisuri_speech-to-text-wavenet_readme_f9dde0c68283.png\" width=\"1024\"\u002F>\n\u003C\u002Fp>\n(Some images are cropped from [WaveNet: A Generative Model for Raw Audio](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.03499) and [Neural Machine Translation in Linear Time](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.10099))  \n\n\n## Version \n\nCurrent Version : __***0.0.0.2***__\n\n## Dependencies ( VERSION MUST BE MATCHED EXACTLY! )\n\n1. [tensorflow](https:\u002F\u002Fwww.tensorflow.org\u002Fversions\u002Fr0.11\u002Fget_started\u002Fos_setup.html#pip-installation) == 1.0.0\n1. [sugartensor](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Fsugartensor) == 1.0.0.2\n1. [pandas](http:\u002F\u002Fpandas.pydata.org\u002Fpandas-docs\u002Fstable\u002Finstall.html) >= 0.19.2\n1. [librosa](https:\u002F\u002Fgithub.com\u002Flibrosa\u002Flibrosa) == 0.5.0\n1. [scikits.audiolab](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fscikits.audiolab)==0.11.0\n\nIf you have problems with the librosa library, try to install ffmpeg by the following command. ( Ubuntu 14.04 )  \n\u003Cpre>\u003Ccode>\nsudo add-apt-repository ppa:mc3man\u002Ftrusty-media\nsudo apt-get update\nsudo apt-get dist-upgrade -y\nsudo apt-get -y install ffmpeg\n\u003C\u002Fcode>\u003C\u002Fpre>\n\n## Dataset\n\nWe used [VCTK](http:\u002F\u002Fhomepages.inf.ed.ac.uk\u002Fjyamagis\u002Fpage3\u002Fpage58\u002Fpage58.html), \n[LibriSpeech](http:\u002F\u002Fwww.openslr.org\u002F12\u002F) and [TEDLIUM release 2](http:\u002F\u002Fwww-lium.univ-lemans.fr\u002Fen\u002Fcontent\u002Fted-lium-corpus) corpus.\nTotal number of sentences in the training set composed of the above three corpus is 240,612. \nValid and test set is built using only LibriSpeech and TEDLIUM corpuse, because VCTK corpus does not have valid and test set. \nAfter downloading the each corpus, extract them in the 'asset\u002Fdata\u002FVCTK-Corpus', 'asset\u002Fdata\u002FLibriSpeech' and \n 'asset\u002Fdata\u002FTEDLIUM_release2' directories. \n \nAudio was augmented by the scheme in the [Tom Ko et al](http:\u002F\u002Fspeak.clsp.jhu.edu\u002Fuploads\u002Fpublications\u002Fpapers\u002F1050_pdf.pdf)'s paper. \n(Thanks @migvel for your kind information)  \n\n## Pre-processing dataset\n\nThe TEDLIUM release 2 dataset provides audio data in the SPH format, so we should convert them to some format \nlibrosa library can handle. Run the following command in the 'asset\u002Fdata' directory convert SPH to wave format.  \n\u003Cpre>\u003Ccode>\nfind -type f -name '*.sph' | awk '{printf \"sox -t sph %s -b 16 -t wav %s\\n\", $0, $0\".wav\" }' | bash\n\u003C\u002Fcode>\u003C\u002Fpre>\n\nIf you don't have installed `sox`, please installed it first.\n\u003Cpre>\u003Ccode>\nsudo apt-get install sox\n\u003C\u002Fcode>\u003C\u002Fpre>\n\nWe found the main bottle neck is disk read time when training, so we decide to pre-process the whole audio data into \n  the MFCC feature files which is much smaller. And we highly recommend using SSD instead of hard drive.  \n  Run the following command in the console to pre-process whole dataset.\n\u003Cpre>\u003Ccode>\npython preprocess.py\n\u003C\u002Fcode>\u003C\u002Fpre>\n \n\n## Training the network\n\nExecute\n\u003Cpre>\u003Ccode>\npython train.py ( \u003C== Use all available GPUs )\nor\nCUDA_VISIBLE_DEVICES=0,1 python train.py ( \u003C== Use only GPU 0, 1 )\n\u003C\u002Fcode>\u003C\u002Fpre>\nto train the network. You can see the result ckpt files and log files in the 'asset\u002Ftrain' directory.\nLaunch tensorboard --logdir asset\u002Ftrain\u002Flog to monitor training process.\n\nWe've trained this model on a 3 Nvidia 1080 Pascal GPUs during 40 hours until 50 epochs and we picked the epoch when the \nvalidatation loss is minimum. In our case, it is epoch 40.  If you face the out of memory error, \nreduce batch_size in the train.py file from 16 to 4.  \n\nThe CTC losses at each epoch are as following table:\n\n| epoch | train set | valid set | test set | \n| :----: | ----: | ----: | ----: |\n| 20 | 79.541500 | 73.645237 | 83.607269 |\n| 30 | 72.884180 | 69.738348 | 80.145867 |\n| 40 | 69.948266 | 66.834316 | 77.316114 |\n| 50 | 69.127240 | 67.639895 | 77.866674 |\n\n\n## Testing the network\n\nAfter training finished, you can check valid or test set CTC loss by the following command.\n\u003Cpre>\u003Ccode>\npython test.py --set train|valid|test --frac 1.0(0.01~1.0)\n\u003C\u002Fcode>\u003C\u002Fpre>\nThe `frac` option will be useful if you want to test only the fraction of dataset for fast evaluation. \n\n## Transforming speech wave file to English text \n \nExecute\n\u003Cpre>\u003Ccode>\npython recognize.py --file \u003Cwave_file path>\n\u003C\u002Fcode>\u003C\u002Fpre>\nto transform a speech wave file to the English sentence. The result will be printed on the console. \n\nFor example, try the following command.\n\u003Cpre>\u003Ccode>\npython recognize.py --file asset\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F1089\u002F134686\u002F1089-134686-0000.flac\npython recognize.py --file asset\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F1089\u002F134686\u002F1089-134686-0001.flac\npython recognize.py --file asset\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F1089\u002F134686\u002F1089-134686-0002.flac\npython recognize.py --file asset\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F1089\u002F134686\u002F1089-134686-0003.flac\npython recognize.py --file asset\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F1089\u002F134686\u002F1089-134686-0004.flac\n\u003C\u002Fcode>\u003C\u002Fpre>\n\nThe result will be as follows:\n\u003Cpre>\u003Ccode>\nhe hoped there would be stoo for dinner turnips and charrats and bruzed patatos and fat mutton pieces to be ladled out in th thick peppered flower fatan sauce\nstuffid into you his belly counsiled him\nafter early night fall the yetl lampse woich light hop here and there on the squalled quarter of the browfles\no berty and he god in your mind\nnumbrt tan fresh nalli is waiting on nou cold nit husband\n\u003C\u002Fcode>\u003C\u002Fpre>\n\nThe ground truth is as follows:\n\u003Cpre>\u003Ccode>\nHE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOUR FATTENED SAUCE\nSTUFF IT INTO YOU HIS BELLY COUNSELLED HIM\nAFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS\nHELLO BERTIE ANY GOOD IN YOUR MIND\nNUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND\n\u003C\u002Fcode>\u003C\u002Fpre>\n\nAs mentioned earlier, there is no language model, so there are some cases where capital letters, punctuations, and words are misspelled.\n\n## pre-trained models\n\nYou can transform a speech wave file to English text with the pre-trained model on the VCTK corpus. \nExtract [the following zip file](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F0B3ILZKxzcrUyVWwtT25FemZEZ1k\u002Fview?usp=sharing&resourcekey=0-R4oPytT6GC2AGiIGi8L_ag) to the 'asset\u002Ftrain\u002F' directory.\n\n## Docker support\n\nSee docker [README.md](docker\u002FREADME.md).\n\n## Future works\n\n1. Language Model\n1. Polyglot(Multi-lingual) Model\n\nWe think that we should replace CTC beam decoder with a practical language model  \nand the polyglot speech recognition model will be a good candidate to future works.\n\n## Other resources\n\n1. [ibab's WaveNet(speech synthesis) tensorflow implementation](https:\u002F\u002Fgithub.com\u002Fibab\u002Ftensorflow-wavenet)\n1. [tomlepaine's Fast WaveNet(speech synthesis) tensorflow implementation](https:\u002F\u002Fgithub.com\u002Fibab\u002Ftensorflow-wavenet)\n\n## Namju's other repositories\n\n1. [SugarTensor](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Fsugartensor)\n1. [EBGAN tensorflow implementation](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Febgan)\n1. [Timeseries gan tensorflow implementation](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Ftimeseries_gan)\n1. [Supervised InfoGAN tensorflow implementation](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Fsupervised_infogan)\n1. [AC-GAN tensorflow implementation](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Fac-gan)\n1. [SRGAN tensorflow implementation](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002FSRGAN)\n1. [ByteNet-Fast Neural Machine Translation](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002FByteNet)\n\n## Citation\n\nIf you find this code useful please cite us in your work:\n\n\u003Cpre>\u003Ccode>\nKim and Park. Speech-to-Text-WaveNet. 2016. GitHub repository. https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002F.\n\u003C\u002Fcode>\u003C\u002Fpre>\n\n# Authors\n\nNamju Kim (namju.kim@kakaocorp.com) at KakaoBrain Corp.\n\nKyubyong Park (kbpark@jamonglab.com) at KakaoBrain Corp.\n","# Speech-to-Text-WaveNet：使用 DeepMind 的 WaveNet 进行端到端句子级英语语音识别\n基于 DeepMind 的 [WaveNet：原始音频生成模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.03499) 的 TensorFlow 语音识别实现。（以下简称“论文”）\n\n虽然 [ibab](https:\u002F\u002Fgithub.com\u002Fibab\u002Ftensorflow-wavenet) 和 [tomlepaine](https:\u002F\u002Fgithub.com\u002Ftomlepaine\u002Ffast-wavenet) 已经使用 TensorFlow 实现了 WaveNet，但他们没有实现语音识别。这就是为什么我们决定自己实现它。\n\nDeepMind 的一些近期论文难以复现。该论文也省略了关于实现的具体细节，我们必须以自己的方式填补这些空白。\n\n以下是一些重要说明。\n\n首先，虽然论文使用了 TIMIT 数据集进行语音识别实验，但我们使用了免费的 VCTK 数据集。\n\n其次，论文在膨胀卷积层（dilated convolution layer）后添加了一个平均池化层（mean-pooling layer）用于下采样。我们从 wav 文件中提取了 [MFCC](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMel-frequency_cepstrum)（梅尔频率倒谱系数），并移除了最终的平均池化层，因为原始设置在我们的 TitanX GPU 上无法运行。\n\n第三，由于 TIMIT 数据集具有音素标签，论文使用两个损失项训练模型：音素分类和下一个音素预测。相反，我们使用了单一的 CTC 损失（Connectionist Temporal Classification Loss），因为 VCTK 提供句子级标签。因此，我们仅使用了膨胀卷积 1D 层，而没有使用任何池化层。\n\n最后，由于时间限制，我们没有进行诸如 BLEU 分数（BLEU Score）之类的定量分析，也没有结合语言模型进行后处理。\n\n最终架构如下图所示。\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fburiburisuri_speech-to-text-wavenet_readme_f9dde0c68283.png\" width=\"1024\"\u002F>\n\u003C\u002Fp>\n（部分图片裁剪自 [WaveNet：原始音频生成模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.03499) 和 [线性时间的神经机器翻译](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.10099)）  \n\n\n## 版本 \n\n当前版本：__***0.0.0.2***__\n\n## 依赖项（版本号必须完全匹配！）\n\n1. [tensorflow](https:\u002F\u002Fwww.tensorflow.org\u002Fversions\u002Fr0.11\u002Fget_started\u002Fos_setup.html#pip-installation) == 1.0.0\n1. [sugartensor](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Fsugartensor) == 1.0.0.2\n1. [pandas](http:\u002F\u002Fpandas.pydata.org\u002Fpandas-docs\u002Fstable\u002Finstall.html) >= 0.19.2\n1. [librosa](https:\u002F\u002Fgithub.com\u002Flibrosa\u002Flibrosa) == 0.5.0\n1. [scikits.audiolab](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fscikits.audiolab)==0.11.0\n\n如果您在使用 librosa 库时遇到问题，请尝试通过以下命令安装 ffmpeg。（Ubuntu 14.04）  \n\u003Cpre>\u003Ccode>\nsudo add-apt-repository ppa:mc3man\u002Ftrusty-media\nsudo apt-get update\nsudo apt-get dist-upgrade -y\nsudo apt-get -y install ffmpeg\n\u003C\u002Fcode>\u003C\u002Fpre>\n\n## 数据集\n\n我们使用了 [VCTK](http:\u002F\u002Fhomepages.inf.ed.ac.uk\u002Fjyamagis\u002Fpage3\u002Fpage58\u002Fpage58.html)、[LibriSpeech](http:\u002F\u002Fwww.openslr.org\u002F12\u002F) 和 [TEDLIUM release 2](http:\u002F\u002Fwww-lium.univ-lemans.fr\u002Fen\u002Fcontent\u002Fted-lium-corpus) 语料库。\n由上述三个语料库组成的训练集中的句子总数为 240,612。\n验证集和测试集仅使用 LibriSpeech 和 TEDLIUM 语料库构建，因为 VCTK 语料库没有验证集和测试集。\n下载每个语料库后，将它们解压到 'asset\u002Fdata\u002FVCTK-Corpus'、'asset\u002Fdata\u002FLibriSpeech' 和 \n 'asset\u002Fdata\u002FTEDLIUM_release2' 目录中。\n \n音频增强方案参考了 [Tom Ko 等人](http:\u002F\u002Fspeak.clsp.jhu.edu\u002Fuploads\u002Fpublications\u002Fpapers\u002F1050_pdf.pdf) 的论文。\n(感谢 @migvel 提供的信息)  \n\n## 数据集预处理\n\nTEDLIUM release 2 数据集提供的是 SPH 格式的音频数据，因此我们需要将其转换为 librosa 库可以处理的格式。在 'asset\u002Fdata' 目录下运行以下命令将 SPH 转换为 WAV 格式。  \n\u003Cpre>\u003Ccode>\nfind -type f -name '*.sph' | awk '{printf \"sox -t sph %s -b 16 -t wav %s\\n\", $0, $0\".wav\" }' | bash\n\u003C\u002Fcode>\u003C\u002Fpre>\n\n如果您尚未安装 `sox`，请先安装它。\n\u003Cpre>\u003Ccode>\nsudo apt-get install sox\n\u003C\u002Fcode>\u003C\u002Fpre>\n\n我们发现训练时的主要瓶颈是磁盘读取时间，因此我们决定将整个音频数据预处理为小得多的 MFCC 特征文件。我们强烈建议使用 SSD 而不是机械硬盘。  \n在控制台运行以下命令以预处理整个数据集。\n\u003Cpre>\u003Ccode>\npython preprocess.py\n\u003C\u002Fcode>\u003C\u002Fpre>\n \n\n## 训练网络\n\n执行以下命令来训练网络。\n\u003Cpre>\u003Ccode>\npython train.py ( \u003C== 使用所有可用的 GPU )\nor\nCUDA_VISIBLE_DEVICES=0,1 python train.py ( \u003C== 仅使用 GPU 0, 1 )\n\u003C\u002Fcode>\u003C\u002Fpre>\n您可以在 'asset\u002Ftrain' 目录中看到结果 ckpt 文件和日志文件。\n启动 `tensorboard --logdir asset\u002Ftrain\u002Flog` 以监控训练过程。\n\n我们在 3 块 Nvidia 1080 Pascal GPU 上训练此模型，耗时 40 小时，达到 50 个 epoch，并选择了验证损失最小的那个 epoch。在我们的案例中，它是第 40 个 epoch。如果遇到内存不足错误，请将 train.py 文件中的 batch_size 从 16 减少到 4。  \n\n每个 epoch 的 CTC 损失如下表所示：\n\n| epoch | train set | valid set | test set | \n| :----: | ----: | ----: | ----: |\n| 20 | 79.541500 | 73.645237 | 83.607269 |\n| 30 | 72.884180 | 69.738348 | 80.145867 |\n| 40 | 69.948266 | 66.834316 | 77.316114 |\n| 50 | 69.127240 | 67.639895 | 77.866674 |\n\n\n## 测试网络\n\n训练完成后，您可以通过以下命令检查验证集或测试集的 CTC 损失。\n\u003Cpre>\u003Ccode>\npython test.py --set train|valid|test --frac 1.0(0.01~1.0)\n\u003C\u002Fcode>\u003C\u002Fpre>\n如果您只想测试数据集的一部分以进行快速评估，`frac` 选项将很有用。\n\n## 将语音波形文件转换为英文文本\n \n执行\n\u003Cpre>\u003Ccode>\npython recognize.py --file \u003Cwave_file path>\n\u003C\u002Fcode>\u003C\u002Fpre>\n将语音波形文件转换为英文句子。结果将打印在控制台上。 \n\n例如，尝试以下命令。\n\u003Cpre>\u003Ccode>\npython recognize.py --file asset\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F1089\u002F134686\u002F1089-134686-0000.flac\npython recognize.py --file asset\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F1089\u002F134686\u002F1089-134686-0001.flac\npython recognize.py --file asset\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F1089\u002F134686\u002F1089-134686-0002.flac\npython recognize.py --file asset\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F1089\u002F134686\u002F1089-134686-0003.flac\npython recognize.py --file asset\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F1089\u002F134686\u002F1089-134686-0004.flac\n\u003C\u002Fcode>\u003C\u002Fpre>\n\n结果如下：\n\u003Cpre>\u003Ccode>\nhe hoped there would be stoo for dinner turnips and charrats and bruzed patatos and fat mutton pieces to be ladled out in th thick peppered flower fatan sauce\nstuffid into you his belly counsiled him\nafter early night fall the yetl lampse woich light hop here and there on the squalled quarter of the browfles\no berty and he god in your mind\nnumbrt tan fresh nalli is waiting on nou cold nit husband\n\u003C\u002Fcode>\u003C\u002Fpre>\n\n真实标签 (Ground Truth) 如下：\n\u003Cpre>\u003Ccode>\nHE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOUR FATTENED SAUCE\nSTUFF IT INTO YOU HIS BELLY COUNSELLED HIM\nAFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS\nHELLO BERTIE ANY GOOD IN YOUR MIND\nNUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND\n\u003C\u002Fcode>\u003C\u002Fpre>\n\n如前所述，没有语言模型 (Language Model)，因此存在大写字母、标点和单词拼写错误的情况。\n\n## 预训练模型 (Pre-trained Models)\n\n您可以使用 VCTK 语料库 (Corpus) 上的预训练模型将语音波形文件转换为英文文本。 \n提取 [以下 zip 文件](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F0B3ILZKxzcrUyVWwtT25FemZEZ1k\u002Fview?usp=sharing&resourcekey=0-R4oPytT6GC2AGiIGi8L_ag) 到 'asset\u002Ftrain\u002F' 目录中。\n\n## Docker 支持\n\n请查看 docker [README.md](docker\u002FREADME.md)。\n\n## 未来工作\n\n1. 语言模型 (Language Model)\n1. 多语言（多语种）模型 (Polyglot\u002FMulti-lingual Model)\n\n我们认为应该用实用的语言模型替换 CTC 束解码器 (CTC Beam Decoder)，而多语言语音识别模型将是未来工作的良好候选方案。\n\n## 其他资源\n\n1. [ibab 的 WaveNet(语音合成) TensorFlow 实现](https:\u002F\u002Fgithub.com\u002Fibab\u002Ftensorflow-wavenet)\n1. [tomlepaine 的快速 WaveNet(语音合成) TensorFlow 实现](https:\u002F\u002Fgithub.com\u002Fibab\u002Ftensorflow-wavenet)\n\n## Namju 的其他仓库\n\n1. [SugarTensor](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Fsugartensor)\n1. [EBGAN TensorFlow 实现](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Febgan)\n1. [时间序列 GAN TensorFlow 实现](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Ftimeseries_gan)\n1. [监督 InfoGAN TensorFlow 实现](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Fsupervised_infogan)\n1. [AC-GAN TensorFlow 实现](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Fac-gan)\n1. [SRGAN TensorFlow 实现](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002FSRGAN)\n1. [ByteNet-快速神经机器翻译](https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002FByteNet)\n\n## 引用\n\n如果您发现此代码有用，请在您的工作中引用我们：\n\n\u003Cpre>\u003Ccode>\nKim and Park. Speech-to-Text-WaveNet. 2016. GitHub repository. https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002F.\n\u003C\u002Fcode>\u003C\u002Fpre>\n\n# 作者\n\nNamju Kim (namju.kim@kakaocorp.com) at KakaoBrain Corp.\n\nKyubyong Park (kbpark@jamonglab.com) at KakaoBrain Corp.","# Speech-to-Text-WaveNet 快速上手指南\n\n**Speech-to-Text-WaveNet** 是一个基于 DeepMind WaveNet 架构的端到端英语语音识别开源项目（TensorFlow 实现）。本项目旨在复现原始论文，并针对实际硬件环境进行了调整。\n\n> **注意**：该项目依赖较旧的 TensorFlow 版本 (1.0.0)，建议在兼容的旧版系统或 Docker 环境中运行。\n\n## 环境准备\n\n### 系统要求\n*   **操作系统**：Linux (推荐 Ubuntu 14.04+)\n*   **硬件**：NVIDIA GPU (如 TitanX, GTX 1080 Pascal 等)，建议使用 SSD 硬盘以加速数据读取。\n*   **CUDA\u002FcuDNN**：需配置与 TensorFlow 1.0.0 兼容的版本（通常对应 CUDA 8.0 \u002F cuDNN 5.1）。\n\n### 前置依赖\n请确保以下库的版本严格匹配（Exact Match）：\n\n| 库名称 | 必需版本 |\n| :--- | :--- |\n| tensorflow | == 1.0.0 |\n| sugartensor | == 1.0.0.2 |\n| pandas | >= 0.19.2 |\n| librosa | == 0.5.0 |\n| scikits.audiolab | == 0.11.0 |\n\n此外，还需安装系统级音频处理工具：\n```bash\nsudo apt-get install sox ffmpeg\n```\n\n## 安装步骤\n\n### 1. 克隆代码\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Fspeech-to-text-wavenet.git\ncd speech-to-text-wavenet\n```\n\n### 2. 安装 Python 依赖\n推荐使用国内镜像源加速下载：\n```bash\npip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple tensorflow==1.0.0 sugartensor==1.0.0.2 pandas librosa==0.5.0 scikits.audiolab==0.11.0\n```\n\n### 3. 准备数据集\n项目支持 VCTK、LibriSpeech 和 TEDLIUM release 2 语料库。下载后解压至对应目录：\n```bash\nasset\u002Fdata\u002FVCTK-Corpus\nasset\u002Fdata\u002FLibriSpeech\nasset\u002Fdata\u002FTEDLIUM_release2\n```\n*(注：VCTK 无验证集和测试集，验证\u002F测试集仅由 LibriSpeech 和 TEDLIUM 构建)*\n\n### 4. 数据预处理\n若数据集包含 `.sph` 格式（如 TEDLIUM），需先转换为 wav：\n```bash\nfind -type f -name '*.sph' | awk '{printf \"sox -t sph %s -b 16 -t wav %s\\n\", $0, $0\".wav\" }' | bash\n```\n随后生成 MFCC 特征文件（推荐）：\n```bash\npython preprocess.py\n```\n\n## 基本使用\n\n### 训练模型\n使用可用 GPU 进行训练：\n```bash\npython train.py \n```\n或使用指定 GPU (例如 0, 1)：\n```bash\nCUDA_VISIBLE_DEVICES=0,1 python train.py \n```\n监控训练过程：\n```bash\ntensorboard --logdir asset\u002Ftrain\u002Flog\n```\n*提示：如遇内存不足，请在 `train.py` 中将 `batch_size` 从 16 调低至 4。*\n\n### 语音转文字 (推理)\n使用训练好的模型将音频文件转换为英文文本：\n```bash\npython recognize.py --file \u003Cwave_file path>\n```\n**示例：**\n```bash\npython recognize.py --file asset\u002Fdata\u002FLibriSpeech\u002Ftest-clean\u002F1089\u002F134686\u002F1089-134686-0000.flac\n```\n\n### 使用预训练模型\n若无需自行训练，可下载官方在 VCTK 语料上训练的预训练模型，解压至 `asset\u002Ftrain\u002F` 目录即可直接使用 `recognize.py`。","某高校研究团队需要整理大量英文学术讲座录音，以便建立可搜索的文本知识库并辅助后续文献分析。\n\n### 没有 speech-to-text-wavenet 时\n- 人工听写效率极低，处理一小时音频需耗费数天时间，人力成本过高\n- 依赖云端 API 导致原始录音数据存在隐私泄露风险，且按分钟计费累积成本高昂\n- 传统开源方案多基于旧架构，对长句子的上下文理解能力不足，断句错误频发\n- 缺乏本地化部署能力，在网络不稳定或离线环境下无法完成批量转换任务\n\n### 使用 speech-to-text-wavenet 后\n- 基于 DeepMind WaveNet 架构实现端到端句子级识别，显著提升了复杂语境下的转录准确率\n- 支持本地 TensorFlow 环境运行，完全掌握数据主权，无需将敏感录音上传至第三方服务器\n- 利用 CTC 损失函数优化训练流程，有效解决了连续语音流中的分词与对齐难题\n- 结合 VCTK 等公开数据集微调模型参数，能更好地适配特定口音、语速及背景噪声\n\n通过本地化部署高精度语音识别模型，团队实现了安全、低成本且高效的英文音频转文本自动化流程。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fburiburisuri_speech-to-text-wavenet_cff58d54.png","buriburisuri","Namju Kim","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fburiburisuri_552ed34a.jpg","CEO @quarkonix.com\r\nformer CIC leader @krustuniverse.com\r\nformer research head @kakaobrain.com","QUARKONIX PTE.,LTD.","Samsung-dong, Seoul, Republic of Korea.",null,"http:\u002F\u002Fchromatic.finance","https:\u002F\u002Fgithub.com\u002Fburiburisuri",[86,90],{"name":87,"color":88,"percentage":89},"Python","#3572A5",94.5,{"name":91,"color":92,"percentage":93},"Dockerfile","#384d54",5.5,4012,791,"2026-04-01T06:50:07","Apache-2.0","Linux","需要 NVIDIA GPU (如 GTX 1080)，显存建议 8GB+，CUDA 版本需兼容 TensorFlow 1.0.0 (约 CUDA 8.0)","未说明",{"notes":102,"python":100,"dependencies":103},"需通过 apt-get 安装系统级工具 ffmpeg 和 sox；训练前需运行 preprocess.py 预处理数据为 MFCC 格式；强烈建议使用 SSD 硬盘；显存不足时需减小 train.py 中的 batch_size；模型基于旧版 TensorFlow 1.0.0，无语言模型后处理，识别结果可能存在拼写错误。",[104,105,106,107,108],"tensorflow==1.0.0","sugartensor==1.0.0.2","pandas>=0.19.2","librosa==0.5.0","scikits.audiolab==0.11.0",[21],5,"2026-03-27T02:49:30.150509","2026-04-06T08:46:50.425229",[114,119,123,127,132,136],{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},2613,"运行脚本时报错 ImportError: No module named 'sg_util' 或 Tkinter 怎么办？","这通常是 Python 环境或 librosa 库的配置问题。维护者建议检查环境配置。如果是 CentOS 系统，请执行命令 `sudo yum install tk-devel tkinter`；如果是 Arch Linux，请执行 `sudo pacman -S tk`。也可参考 StackOverflow 关于 tkinter 配置的讨论。","https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Fspeech-to-text-wavenet\u002Fissues\u002F5",{"id":120,"question_zh":121,"answer_zh":122,"source_url":118},2614,"在 CentOS 系统上遇到 sg_util 导入错误的具体解决方案是什么？","根据社区反馈，CentOS 用户需要安装 tk 开发包。请在终端执行以下命令解决依赖缺失问题：`sudo yum install tk-devel tkinter`。",{"id":124,"question_zh":125,"answer_zh":126,"source_url":118},2615,"在 Arch Linux 系统上遇到 sg_util 导入错误的具体解决方案是什么？","Arch Linux 用户需要通过包管理器安装 tk 库。请在终端执行以下命令：`sudo pacman -S tk`。",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},2616,"安装完成后找不到 assets\u002Fdata 目录，语料应该放在哪里？","请将 VCTK-corpus 文件夹内的所有文件和子文件夹直接放入项目的 `assets\u002Fdata` 目录中。不要只放压缩包，需解压后的内容。","https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Fspeech-to-text-wavenet\u002Fissues\u002F33",{"id":133,"question_zh":134,"answer_zh":135,"source_url":131},2617,"使用预训练模型时提示 IOError: File asset\u002Fdata\u002Fspeaker-info.txt does not exist 如何解决？","该项目主要使用 VCTK-corpus，没有其他额外的预训练模型。请确保已完整下载并放置了 VCTK 语料数据到 `assets\u002Fdata` 目录，该文件应包含在语料包内。",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},2618,"训练时出现 No valid path found 警告且 Loss 变为 inf 是什么原因？","这可能与模型结构修改有关。默认模型未使用 RNN 层，如果自行修改了 `model.py` 添加了 RNN 层可能导致此问题。建议尝试使用原始的 `model.py` 配置，并确保 WAV 文件为 PCM 编码格式。","https:\u002F\u002Fgithub.com\u002Fburiburisuri\u002Fspeech-to-text-wavenet\u002Fissues\u002F74",[]]