[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-sooftware--kospeech":3,"tool-sooftware--kospeech":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":82,"owner_twitter":80,"owner_website":83,"owner_url":84,"languages":85,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":10,"env_os":98,"env_gpu":99,"env_ram":98,"env_deps":100,"category_tags":114,"github_topics":115,"view_count":23,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":129,"updated_at":130,"faqs":131,"releases":166},2868,"sooftware\u002Fkospeech","kospeech","Open-Source Toolkit for End-to-End Korean Automatic Speech Recognition leveraging PyTorch and Hydra.","KoSpeech 是一个专为韩语语音识别打造的开源工具包，基于 PyTorch 深度学习框架构建，旨在提供端到端的自动语音识别（ASR）解决方案。在 KoSpeech 出现之前，主流的开源语音识别工具多专注于英语等非韩语环境，导致韩语研究者缺乏统一的预处理方法和基准模型进行性能对比。即便有了如 KsponSpeech 这样的大规模韩语语料库，业界也长期缺少标准化的研究基线。KoSpeech 填补了这一空白，它不仅提供了针对韩语数据的标准化预处理流程，还集成了 Deep Speech 2、LAS、Transformer、Jasper 及 Conformer 等多种经典与前沿的声学模型，成为韩语语音识别研究的重要参考指南。\n\n该项目特别适合人工智能研究人员、算法工程师以及希望深入探索韩语语音技术的学生使用。其技术亮点在于高度的模块化与可扩展性，并引入了 Hydra 框架来优雅地管理复杂的应用配置，让模型训练与实验复现更加便捷。需要注意的是，目前该仓库已归档，作者建议有新项目需求的用户转向其继任者 OpenSpeech，或尝试 Pororo ASR 与 Whisper 进行快速测试，但 Ko","KoSpeech 是一个专为韩语语音识别打造的开源工具包，基于 PyTorch 深度学习框架构建，旨在提供端到端的自动语音识别（ASR）解决方案。在 KoSpeech 出现之前，主流的开源语音识别工具多专注于英语等非韩语环境，导致韩语研究者缺乏统一的预处理方法和基准模型进行性能对比。即便有了如 KsponSpeech 这样的大规模韩语语料库，业界也长期缺少标准化的研究基线。KoSpeech 填补了这一空白，它不仅提供了针对韩语数据的标准化预处理流程，还集成了 Deep Speech 2、LAS、Transformer、Jasper 及 Conformer 等多种经典与前沿的声学模型，成为韩语语音识别研究的重要参考指南。\n\n该项目特别适合人工智能研究人员、算法工程师以及希望深入探索韩语语音技术的学生使用。其技术亮点在于高度的模块化与可扩展性，并引入了 Hydra 框架来优雅地管理复杂的应用配置，让模型训练与实验复现更加便捷。需要注意的是，目前该仓库已归档，作者建议有新项目需求的用户转向其继任者 OpenSpeech，或尝试 Pororo ASR 与 Whisper 进行快速测试，但 KoSpeech 留下的代码架构与论文成果依然是理解韩语语音识别发展的宝贵资源。","\u003Cp  align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsooftware_kospeech_readme_78d88f772a1a.png\" height=100>\n\n\n\u003Cdiv align=\"center\">\n\n**An Apache 2.0 ASR research library, built on PyTorch, for developing end-to-end speech recognition models.**\n\n  \n\u003C\u002Fdiv>\n  \n---\n  \n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech#introduction\">Introduction\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech#introduction\">Roadmap\u003C\u002Fa> •\n  \u003Ca href=\"sooftware.github.io\u002Fkospeech\u002F\">Docs\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fwww.codefactor.io\u002Frepository\u002Fgithub\u002Fsooftware\u002Fkospeech\">Codefactor\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Fblob\u002Fmain\u002FLICENSE\">License\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fgitter.im\u002FKorean-Speech-Recognition\u002Fcommunity\">Gitter\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS2665963821000026\">Paper\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\u003C\u002Fdiv>\n     \n#### This repository archived. If the reason why you found this repo is below, I will recommend a different repository for each reason.\n\n- I want to train my own voice recognition model or study internal code! **→** [OpenSpeech](https:\u002F\u002Fgithub.com\u002Fopenspeech-team\u002Fopenspeech)\n- I want to test the trained Korean speech recognition model right away! **→** [Pororo ASR](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fpororo) or [Whisper](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper)\n   \n### What's New\n- May 2021: Fix LayerNorm Error, Subword Error\n- Febuary 2021: Update Documentation\n- Febuary 2021: Add RNN-Transducer model\n- January 2021: Release v1.3\n- January 2021: Add Conformer model\n- January 2021: Add Jasper model\n- January 2021: Add Joint CTC-Attention Transformer model\n- January 2021: Add Speech Transformer model\n- January 2021: Apply [Hydra: framework for elegantly configuring complex applications](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhydra)\n  \n### Note\n  \n- Not long ago, I modified a lot of the code, but I was personally busy, so I couldn't test all the cases. If there is an error, please feel free to give me a feedback.\n- Subword and Grapheme unit currently not tested.\n  \n### ***[KoSpeech:  Open-Source Toolkit for End-to-End Korean Speech Recognition \\[Paper\\]](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS2665963821000026)***\n  \n***KoSpeech***, an open-source software, is modular and extensible end-to-end Korean automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch. Several automatic speech recognition open-source toolkits have been released, but all of them deal with non-Korean languages, such as English (e.g. ESPnet, Espresso). Although AI Hub opened 1,000 hours of Korean speech corpus known as KsponSpeech, there is no established preprocessing method and baseline model to compare model performances. Therefore, we propose preprocessing methods for KsponSpeech corpus and a several models (Deep Speech 2, LAS, Transformer, Jasper, Conformer). By KoSpeech, we hope this could be a guideline for those who research Korean speech recognition.  \n  \n### Supported Models\n  \n|Acoustic Model|Notes|Citation|  \n|--------------|------|--------:|  \n|Deep Speech 2|2D-invariant convolution & RNN & CTC|[Dario Amodei et al., 2015](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.02595)|   \n|Listen Attend Spell (LAS)|Attention based RNN sequence to sequence|[William Chan et al., 2016](https:\u002F\u002Farxiv.org\u002Fabs\u002F1508.01211)|  \n|Joint CTC-Attention LAS|Joint CTC-Attention LAS|[Suyoun Kim et al., 2017](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1609.06773.pdf)|  \n|RNN-Transducer|RNN Transducer|[Ales Graves. 2012](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1211.3711.pdf)|  \n|Speech Transformer|Convolutional extractor & transformer|[Linhao Dong et al., 2018](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F8462506)|  \n|Jasper|Fully convolutional & dense residual connection & CTC|[Jason Li et al., 2019](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1904.03288.pdf)|  \n|Conformer|Convolution-augmented-Transformer|[Anmol Gulati et al., 2020](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.08100)|  \n  \n- **Note**  \nIt is based on the above papers, but there may be other parts of the model implementation.\n  \n  \n## Introduction\n  \nEnd-to-end (E2E) automatic speech recognition (ASR) is an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits. Traditional “hybrid” ASR systems, which are comprised of an acoustic model, language model, and pronunciation model, require separate training of these components, each of which can be complex.   \n  \nFor example, training of an acoustic model is a multi-stage process of model training and time alignment between the speech acoustic feature sequence and output label sequence. In contrast, E2E ASR is a single integrated approach with a much simpler training pipeline with models that operate at low audio frame rates. This reduces the training time, decoding time, and allows joint optimization with downstream processing such as natural language understanding.  \n  \n## Roadmap\n  \nSo far, serveral models are implemented: *Deep Speech 2, Listen Attend and Spell (LAS), RNN-Transducer, Speech Transformer, Jasper, Conformer*.\n  \n- *Deep Speech 2*  \n  \nDeep Speech 2 showed faster and more accurate performance on ASR tasks with Connectionist Temporal Classification (CTC) loss. This model has been highlighted for significantly increasing performance compared to the previous end- to-end models.\n\n  \n- *Listen, Attend and Spell (LAS)*\n   \nWe follow the architecture previously proposed in the \"Listen, Attend and Spell\", but some modifications were added to improve performance. We provide four different attention mechanisms, `scaled dot-product attention`, `additive attention`, `location aware attention`, `multi-head attention`. Attention mechanisms much affect the performance of models. \n  \n- *RNN-Transducer*\n  \nRNN-Transducer are a form of sequence-to-sequence models that do not employ attention mechanisms. Unlike most sequence-to-sequence models, which typically need to process the entire input sequence (the waveform in our case) to produce an output (the sentence), the RNN-T continuously processes input samples and streams output symbols, a property that is welcome for speech dictation. In our implementation, the output symbols are the characters of the alphabet.\n  \n- *Speech Transformer*  \n  \nTransformer is a powerful architecture in the Natural Language Processing (NLP) field. This architecture also showed good performance at ASR tasks. In addition, as the research of this model continues in the natural language processing field, this model has high potential for further development.\n  \n- *Joint CTC-Attention*\n  \nWith the proposed architecture to take advantage of both the CTC-based model and the attention-based model. It is a structure that makes it robust by adding CTC to the encoder. Joint CTC-Attention can be trained in combination with LAS and Speech Transformer.  \n  \n- *Jasper*  \n  \nJasper (Just Another SPEech Recognizer) is a end-to-end convolutional neural acoustic model. Jasper showed powerful performance with only CNN → BatchNorm → ReLU → Dropout block and residential connection.  \n  \n- *Conformer*\n  \nConformer combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.\n  \n## Installation\nThis project recommends Python 3.7 or higher.   \nWe recommend creating a new virtual environment for this project (using virtual env or conda).  \n\n### Prerequisites\n  \n* Numpy: `pip install numpy` (Refer [here](https:\u002F\u002Fgithub.com\u002Fnumpy\u002Fnumpy) for problem installing Numpy).\n* Pytorch: Refer to [PyTorch website](http:\u002F\u002Fpytorch.org\u002F) to install the version w.r.t. your environment.   \n* Pandas: `pip install pandas` (Refer [here](https:\u002F\u002Fgithub.com\u002Fpandas-dev\u002Fpandas) for problem installing Pandas)  \n* Matplotlib: `pip install matplotlib` (Refer [here](https:\u002F\u002Fgithub.com\u002Fmatplotlib\u002Fmatplotlib) for problem installing Matplotlib)\n* librosa: `conda install -c conda-forge librosa` (Refer [here](https:\u002F\u002Fgithub.com\u002Flibrosa\u002Flibrosa) for problem installing librosa)\n* torchaudio: `pip install torchaudio==0.6.0` (Refer [here](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch) for problem installing torchaudio)\n* tqdm: `pip install tqdm` (Refer [here](https:\u002F\u002Fgithub.com\u002Ftqdm\u002Ftqdm) for problem installing tqdm)\n* sentencepiece: `pip install sentencepiece` (Refer [here](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece) for problem installing sentencepiece)\n* warp-rnnt: `pip install warp_rnnt` (Refer [here](https:\u002F\u002Fgithub.com\u002F1ytic\u002Fwarp-rnnt)) for problem installing warp-rnnt)\n* hydra: `pip install hydra-core --upgrade` (Refer [here](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhydra) for problem installing hydra)\n  \n### Install from source\nCurrently we only support installation from source code using setuptools. Checkout the source code and run the   \nfollowing commands:  \n```\npip install -e .\n```\n  \n## Get Started\n  \nWe use [Hydra](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhydra) to control all the training configurations. If you are not familiar with Hydra we recommend visiting the [Hydra website](https:\u002F\u002Fhydra.cc\u002F). Generally, Hydra is an open-source framework that simplifies the development of research applications by providing the ability to create a hierarchical configuration dynamically.\n  \n### Preparing KsponSpeech Dataset (LibriSpeech also supports)\n  \nDownload from [here](https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech#pre-processed-transcripts) or refer to the following to preprocess.\n  \n- KsponSpeech : [Check this page](https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Ftree\u002Fmaster\u002Fdataset\u002Fkspon)\n- LibriSpeech : [Check this page](https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Ftree\u002Fmaster\u002Fdataset\u002Flibri)\n  \n### Training KsponSpeech Dataset\n  \nYou can choose from several models and training options. There are many other training options, so look carefully and execute the following command:  \n  \n- **Deep Speech 2** Training\n```\npython .\u002Fbin\u002Fmain.py model=ds2 train=ds2_train train.dataset_path=$DATASET_PATH\n```\n  \n- **Listen, Attend and Spell** Training\n```\npython .\u002Fbin\u002Fmain.py model=las train=las_train train.dataset_path=$DATASET_PATH\n```\n  \n- **Joint CTC-Attention Listen, Attend and Spell** Training\n```\npython .\u002Fbin\u002Fmain.py model=joint-ctc-attention-las train=las_train train.dataset_path=$DATASET_PATH\n```\n  \n- **RNN Transducer** Training\n```\npython .\u002Fbin\u002Fmain.py model=rnnt train=rnnt_train train.dataset_path=$DATASET_PATH\n```\n  \n- **Speech Transformer** Training\n```\npython .\u002Fbin\u002Fmain.py model=transformer train=transformer_train train.dataset_path=$DATASET_PATH\n```\n  \n- **Joint CTC-Attention Speech Transformer** Training\n```\npython .\u002Fbin\u002Fmain.py model=joint-ctc-attention-transformer train=transformer_train train.dataset_path=$DATASET_PATH\n```\n  \n- **Jasper** Training\n```\npython .\u002Fbin\u002Fmain.py model=jasper train=jasper_train train.dataset_path=$DATASET_PATH\n```\n  \n- **Conformer** Training\n```\npython .\u002Fbin\u002Fmain.py model=conformer-large train=conformer_large_train train.dataset_path=$DATASET_PATH\n```\nYou can train with `conformer-medium`, `conformer-small` model.\n  \n  \n### Evaluate for KsponSpeech\n```\npython .\u002Fbin\u002Feval.py eval.dataset_path=$DATASET_PATH eval.transcripts_path=$TRANSCRIPTS_PATH eval.model_path=$MODEL_PATH\n```\n  \nNow you have a model which you can use to predict on new data. We do this by running `greedy search` or `beam search`.  \n  \n### Inference One Audio with Pre-train Models\n\n* Command\n```\n$ python3 .\u002Fbin\u002Finference.py --model_path $MODEL_PATH --audio_path $AUDIO_PATH --device $DEVICE\n```\n* Output\n```\n음성인식 결과 문장이 나옵니다\n```  \nYou can get a quick look of pre-trained model's inference, with a audio.  \n  \n### Checkpoints   \nCheckpoints are organized by experiments and timestamps as shown in the following file structure.  \n```\noutputs\n+-- YYYY_mm_dd\n|  +-- HH_MM_SS\n   |  +-- trainer_states.pt\n   |  +-- model.pt\n```\nYou can resume and load from checkpoints.\n  \n## Troubleshoots and Contributing\nIf you have any questions, bug reports, and feature requests, please [open an issue](https:\u002F\u002Fgithub.com\u002Fsooftware\u002FEnd-to-end-Speech-Recognition\u002Fissues) on Github.   \nFor live discussions, please go to our [gitter](https:\u002F\u002Fgitter.im\u002FKorean-Speech-Recognition\u002Fcommunity) or Contacts sh951011@gmail.com please.\n  \nWe appreciate any kind of feedback or contribution.  Feel free to proceed with small issues like bug fixes, documentation improvement.  For major contributions and new features, please discuss with the collaborators in corresponding issues.  \n  \n### Code Style\nWe follow [PEP-8](https:\u002F\u002Fwww.python.org\u002Fdev\u002Fpeps\u002Fpep-0008\u002F) for code style. Especially the style of docstrings is important to generate documentation.  \n    \n### Paper References\n  \n*Ilya Sutskever et al. [Sequence to Sequence Learning with Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1409.3215) arXiv: 1409.3215*  \n  \n*Dzmitry Bahdanau et al. [Neural Machine Translation by Jointly Learning to Align and Translate](https:\u002F\u002Farxiv.org\u002Fabs\u002F1409.0473) arXiv: 1409.0473*   \n  \n*Jan Chorowski et al. [Attention Based Models for Speech Recognition](https:\u002F\u002Farxiv.org\u002Fabs\u002F1506.07503) arXiv: 1506.07503*    \n  \n*Wiliam Chan et al. [Listen, Attend and Spell](https:\u002F\u002Farxiv.org\u002Fabs\u002F1508.01211) arXiv: 1508.01211*   \n   \n*Dario Amodei et al. [Deep Speech2: End-to-End Speech Recognition in English and Mandarin](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.02595) arXiv: 1512.02595*   \n   \n*Takaaki Hori et al. [Advances in Joint CTC-Attention based E2E Automatic Speech Recognition with a Deep CNN Encoder and RNN-LM](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.02737) arXiv: 1706.02737*   \n  \n*Ashish Vaswani et al. [Attention Is All You Need](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.03762) arXiv: 1706.03762*     \n  \n*Chung-Cheng Chiu et al. [State-of-the-art Speech Recognition with Sequence-to-Sequence Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.01769) arXiv: 1712.01769*   \n  \n*Anjuli Kannan et al. [An Analysis Of Incorporating An External LM Into A Sequence-to-Sequence Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.01996) arXiv: 1712.01996*  \n  \n*Daniel S. Park et al. [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.08779) arXiv: 1904.08779*     \n    \n*Rafael Muller et al. [When Does Label Smoothing Help?](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.02629) arXiv: 1906.02629*   \n  \n*Daniel S. Park et al. [SpecAugment on large scale datasets](https:\u002F\u002Farxiv.org\u002Fabs\u002F1912.05533) arXiv: 1912.05533* \n    \n*Jung-Woo Ha et al. [ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.09367) arXiv: 2004.09367*  \n  \n*Jason Li et al. [Jasper: An End-to-End Convolutional Neural Acoustic Model](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1904.03288.pdf) arXiv: 1902.03288* \n    \n*Anmol Gulati et al. [Conformer: Convolution-augmented Transformer for Speech Recognition](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.08100) arXiv: 2005.08100*\n    \n### Github References\n  \n*[IBM\u002FPytorch-seq2seq](https:\u002F\u002Fgithub.com\u002FIBM\u002Fpytorch-seq2seq)*  \n  \n*[SeanNaren\u002Fdeepspeech.pytorch](https:\u002F\u002Fgithub.com\u002FSeanNaren\u002Fdeepspeech.pytorch)*  \n  \n*[kaituoxu\u002FSpeech-Transformer](https:\u002F\u002Fgithub.com\u002Fkaituoxu\u002FSpeech-Transformer)*  \n  \n*[OpenNMT\u002FOpenNMT-py](https:\u002F\u002Fgithub.com\u002FOpenNMT\u002FOpenNMT-py)*  \n  \n*[clovaai\u002FClovaCall](https:\u002F\u002Fgithub.com\u002Fclovaai\u002FClovaCall)*  \n  \n*[LiyuanLucasLiu\u002FRAdam](https:\u002F\u002Fgithub.com\u002FLiyuanLucasLiu\u002FRAdam)*\n  \n*[NVIDIA\u002FDeepLearningExample](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FDeepLearningExamples)*\n  \n*[espnet\u002Fespnet](https:\u002F\u002Fgithub.com\u002Fespnet\u002Fespnet)*\n   \n### License\nThis project is licensed under the Apache-2.0 LICENSE - see the [LICENSE.md](https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Fblob\u002Fmaster\u002FLICENSE) file for details\n  \n## Citation\n  \nA [paper](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS2665963821000026) on KoSpeech is available. If you use the system for academic work, please cite:\n  \n```\n@ARTICLE{2021-kospeech,\n  author    = {Kim, Soohwan and Bae, Seyoung and Won, Cheolhwang},\n  title     = {KoSpeech: Open-Source Toolkit for End-to-End Korean Speech Recognition},\n  url       = {https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS2665963821000026},\n  month     = {February},\n  year      = {2021},\n  publisher = {ELSEVIER},\n  journal   = {SIMPAC},\n  pages     = {Volume 7, 100054}\n}\n```\nA [technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2009.03092) on KoSpeech in available.  \n  \n```\n@TECHREPORT{2020-kospeech,\n  author    = {Kim, Soohwan and Bae, Seyoung and Won, Cheolhwang},\n  title     = {KoSpeech: Open-Source Toolkit for End-to-End Korean Speech Recognition},\n  month     = {September},\n  year      = {2020},\n  url       = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2009.03092},\n  journal   = {ArXiv e-prints},\n  eprint    = {2009.03092}\n}\n```\n","\u003Cp align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsooftware_kospeech_readme_78d88f772a1a.png\" height=100>\n\n\n\u003Cdiv align=\"center\">\n\n**一个基于 PyTorch 的 Apache 2.0 许可的自动语音识别研究库，用于开发端到端语音识别模型。**\n\n  \n\u003C\u002Fdiv>\n  \n---\n  \n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech#introduction\">简介\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FsoFTWARE\u002Fkospeech#introduction\">路线图\u003C\u002Fa> •\n  \u003Ca href=\"soFTWARE.github.io\u002Fkospeech\u002F\">文档\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fwww.codefactor.io\u002Frepository\u002Fgithub\u002FsoFTWARE\u002Fkospeech\">Codefactor\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FsoFTWARE\u002Fkospeech\u002Fblob\u002Fmain\u002FLICENSE\">许可证\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fgitter.im\u002FKorean-Speech-Recognition\u002Fcommunity\">Gitter\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS2665963821000026\">论文\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\u003C\u002Fdiv>\n     \n#### 此仓库已归档。如果您找到本仓库的原因属于以下情况，我们将为您推荐不同的仓库：\n\n- 我想训练自己的语音识别模型或研究内部代码！ **→** [OpenSpeech](https:\u002F\u002Fgithub.com\u002Fopenspeech-team\u002Fopenspeech)\n- 我想立即测试已训练好的韩语语音识别模型！ **→** [Pororo ASR](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fpororo) 或 [Whisper](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper)\n   \n### 最新动态\n- 2021年5月：修复 LayerNorm 错误、子词错误\n- 2021年2月：更新文档\n- 2021年2月：新增 RNN-Transducer 模型\n- 2021年1月：发布 v1.3\n- 2021年1月：新增 Conformer 模型\n- 2021年1月：新增 Jasper 模型\n- 2021年1月：新增联合 CTC-注意力 Transformer 模型\n- 2021年1月：新增 Speech Transformer 模型\n- 2021年1月：应用 [Hydra：用于优雅配置复杂应用的框架](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhydra)\n  \n### 注意事项\n  \n- 近期我对代码进行了大量修改，但由于个人事务繁忙，未能对所有情况进行充分测试。如果发现任何问题，请随时向我反馈。\n- 子词和字素单元目前尚未经过全面测试。\n  \n### ***[KoSpeech：面向端到端韩语语音识别的开源工具包 \\[论文\\]](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS2665963821000026)***\n  \n***KoSpeech*** 是一款开源软件，基于深度学习库 PyTorch 构建，是一个模块化且可扩展的端到端韩语自动语音识别（ASR）工具包。目前已发布多款自动语音识别开源工具包，但它们大多针对非韩语语言，例如英语（如 ESPnet、Espresso）。尽管 AI Hub 开放了名为 KsponSpeech 的 1,000 小时韩语语音语料库，但至今仍缺乏标准化的预处理方法和基准模型来比较不同模型的性能。因此，我们提出了针对 KsponSpeech 语料库的预处理方法，并实现了多种模型（Deep Speech 2、LAS、Transformer、Jasper、Conformer）。通过 KoSpeech，我们希望为从事韩语语音识别研究的人员提供参考指南。  \n  \n### 支持的模型\n  \n|声学模型|备注|引用|  \n|--------------|------|--------:|  \n|Deep Speech 2|二维不变卷积 & RNN & CTC|[Dario Amodei 等，2015](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.02595)|   \n|听-注意-拼写（LAS）|基于注意力机制的 RNN 序列到序列模型|[William Chan 等，2016](https:\u002F\u002Farxiv.org\u002Fabs\u002F1508.01211)|  \n|联合 CTC-注意力 LAS|联合 CTC-注意力 LAS|[Suyoun Kim 等，2017](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1609.06773.pdf)|  \n|RNN-Transducer|RNN Transducer|[Ales Graves，2012](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1211.3711.pdf)|  \n|Speech Transformer|卷积特征提取器 & Transformer|[Linhao Dong 等，2018](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F8462506)|  \n|Jasper|全卷积网络 & 密集残差连接 & CTC|[Jason Li 等，2019](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1904.03288.pdf)|  \n|Conformer|卷积增强型 Transformer|[Anmol Gulati 等，2020](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.08100)|  \n  \n- **注**  \n这些模型主要基于上述文献，但在具体实现中可能还包含其他部分。\n  \n  \n## 简介\n  \n端到端（E2E）自动语音识别（ASR）是基于神经网络的语音识别领域中一种新兴范式，具有诸多优势。传统的“混合”ASR 系统由声学模型、语言模型和发音模型组成，需要分别训练这些组件，而每个组件的训练过程都可能相当复杂。   \n  \n例如，声学模型的训练通常涉及多个阶段，包括模型训练以及语音声学特征序列与输出标签序列之间的对齐。相比之下，E2E ASR 采用单一的集成方法，其训练流程更为简单，模型以较低的音频帧率运行。这不仅缩短了训练和解码时间，还允许与下游处理任务（如自然语言理解）进行联合优化。\n\n## 路线图\n  \n截至目前，已实现多个模型：*Deep Speech 2、听-看-拼写（LAS）、RNN-转换器、语音Transformer、Jasper、Conformer*。\n  \n- *Deep Speech 2*  \n  \nDeep Speech 2 在使用连接时序分类（CTC）损失的 ASR 任务中表现出更快且更准确的性能。该模型因相较于之前的端到端模型显著提升性能而备受关注。\n\n  \n- *听-看-拼写（LAS）*\n   \n我们沿用了“听-看-拼写”中提出的架构，但进行了一些改进以提升性能。我们提供了四种不同的注意力机制：`缩放点积注意力`、`加性注意力`、`位置感知注意力`、`多头注意力`。注意力机制对模型性能影响很大。\n  \n- *RNN-转换器*\n  \nRNN-转换器是一种无需使用注意力机制的序列到序列模型。与大多数通常需要处理整个输入序列（在我们的案例中为波形）才能生成输出（即句子）的序列到序列模型不同，RNN-T 可以持续处理输入样本并流式输出符号，这一特性非常适合语音转文字应用。在我们的实现中，输出符号是字母表中的字符。\n  \n- *语音Transformer*  \n  \nTransformer 是自然语言处理（NLP）领域中一种强大的架构。该架构在 ASR 任务中也表现出良好的性能。此外，随着该模型在自然语言处理领域的研究不断深入，其具有很高的进一步开发潜力。\n  \n- *联合 CTC-注意力*\n  \n通过所提出的架构，同时利用基于 CTC 的模型和基于注意力的模型的优势。这是一种通过在编码器中加入 CTC 来增强模型鲁棒性的结构。联合 CTC-注意力模型可以与 LAS 和语音 Transformer 结合训练。\n  \n- *Jasper*  \n  \nJasper（Just Another SPEech Recognizer）是一种端到端卷积神经网络声学模型。Jasper 仅使用 CNN → BatchNorm → ReLU → Dropout 块以及残差连接，便展现出强大的性能。\n  \n- *Conformer*\n  \nConformer 将卷积神经网络与 Transformer 相结合，以参数高效的方式建模音频序列的局部和全局依赖关系。Conformer 显著优于先前的基于 Transformer 和 CNN 的模型，达到了最先进的准确率。\n  \n## 安装\n本项目推荐使用 Python 3.7 或更高版本。  \n我们建议为此项目创建一个新的虚拟环境（使用 virtual env 或 conda）。\n\n### 先决条件\n  \n* Numpy: `pip install numpy`（安装 Numpy 时遇到问题可参考 [这里](https:\u002F\u002Fgithub.com\u002Fnumpy\u002Fnumpy)）。\n* Pytorch: 请参考 [PyTorch 官网](http:\u002F\u002Fpytorch.org\u002F)，根据您的环境安装相应版本。   \n* Pandas: `pip install pandas`（安装 Pandas 时遇到问题可参考 [这里](https:\u002F\u002Fgithub.com\u002Fpandas-dev\u002Fpandas)）  \n* Matplotlib: `pip install matplotlib`（安装 Matplotlib 时遇到问题可参考 [这里](https:\u002F\u002Fgithub.com\u002Fmatplotlib\u002Fmatplotlib)）  \n* librosa: `conda install -c conda-forge librosa`（安装 librosa 时遇到问题可参考 [这里](https:\u002F\u002Fgithub.com\u002Flibrosa\u002Flibrosa)）  \n* torchaudio: `pip install torchaudio==0.6.0`（安装 torchaudio 时遇到问题可参考 [这里](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch)）  \n* tqdm: `pip install tqdm`（安装 tqdm 时遇到问题可参考 [这里](https:\u002F\u002Fgithub.com\u002Ftqdm\u002Ftqdm)）  \n* sentencepiece: `pip install sentencepiece`（安装 sentencepiece 时遇到问题可参考 [这里](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece)）  \n* warp-rnnt: `pip install warp_rnnt`（安装 warp-rnnt 时遇到问题可参考 [这里](https:\u002F\u002Fgithub.com\u002F1ytic\u002Fwarp-rnnt)）  \n* hydra: `pip install hydra-core --upgrade`（安装 hydra 时遇到问题可参考 [这里](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhydra)）  \n \n### 从源码安装\n目前我们仅支持使用 setuptools 从源代码安装。克隆源代码并运行以下命令：\n```\npip install -e .\n```\n  \n## 开始使用\n  \n我们使用 [Hydra](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhydra) 来管理所有的训练配置。如果您不熟悉 Hydra，建议访问 [Hydra 官网](https:\u002F\u002Fhydra.cc\u002F)。一般来说，Hydra 是一个开源框架，它通过提供动态创建分层配置的能力，简化了科研应用程序的开发。\n  \n### 准备 KsponSpeech 数据集（LibriSpeech 也可支持）\n  \n可以从 [这里](https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech#pre-processed-transcripts) 下载，或参考以下内容进行预处理。\n  \n- KsponSpeech : [查看此页面](https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Ftree\u002Fmaster\u002Fdataset\u002Fkspon)\n- LibriSpeech : [查看此页面](https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Ftree\u002Fmaster\u002Fdataset\u002Flibri)\n  \n### 训练 KsponSpeech 数据集\n  \n您可以选择多种模型和训练选项。还有许多其他训练选项，请仔细查看并执行以下命令：  \n  \n- **Deep Speech 2** 训练\n```\npython .\u002Fbin\u002Fmain.py model=ds2 train=ds2_train train.dataset_path=$DATASET_PATH\n```\n  \n- **听-看-拼写** 训练\n```\npython .\u002Fbin\u002Fmain.py model=las train=las_train train.dataset_path=$DATASET_PATH\n```\n  \n- **联合 CTC-注意力 听-看-拼写** 训练\n```\npython .\u002Fbin\u002Fmain.py model=joint-ctc-attention-las train=las_train train.dataset_path=$DATASET_PATH\n```\n  \n- **RNN 转换器** 训练\n```\npython .\u002Fbin\u002Fmain.py model=rnnt train=rnnt_train train.dataset_path=$DATASET_PATH\n```\n  \n- **语音 Transformer** 训练\n```\npython .\u002Fbin\u002Fmain.py model=transformer train=transformer_train train.dataset_path=$DATASET_PATH\n```\n  \n- **联合 CTC-注意力 语音 Transformer** 训练\n```\npython .\u002Fbin\u002Fmain.py model=joint-ctc-attention-transformer train=transformer_train train.dataset_path=$DATASET_PATH\n```\n  \n- **Jasper** 训练\n```\npython .\u002Fbin\u002Fmain.py model=jasper train=jasper_train train.dataset_path=$DATASET_PATH\n```\n  \n- **Conformer** 训练\n```\npython .\u002Fbin\u002Fmain.py model=conformer-large train=conformer_large_train train.dataset_path=$DATASET_PATH\n```\n您也可以使用 `conformer-medium` 或 `conformer-small` 模型进行训练。\n  \n  \n### 对 KsponSpeech 进行评估\n```\npython .\u002Fbin\u002Feval.py eval.dataset_path=$DATASET_PATH eval.transcripts_path=$TRANSCRIPTS_PATH eval.model_path=$MODEL_PATH\n```\n  \n现在您已经拥有一款可用于对新数据进行预测的模型。我们可以通过运行 `贪婪搜索` 或 `束搜索` 来实现这一点。\n  \n### 使用预训练模型对单个音频文件进行推理\n\n* 命令\n```\n$ python3 .\u002Fbin\u002Finference.py --model_path $MODEL_PATH --audio_path $AUDIO_PATH --device $DEVICE\n```\n* 输出\n```\n语音识别结果句子会显示出来\n```  \n您可以通过一段音频快速查看预训练模型的推理效果。\n\n### 检查点   \n检查点按实验和时间戳组织，文件结构如下所示。  \n```\noutputs\n+-- YYYY_mm_dd\n|  +-- HH_MM_SS\n   |  +-- trainer_states.pt\n   |  +-- model.pt\n```\n您可以从检查点恢复并加载模型。\n\n## 故障排除与贡献\n如果您有任何问题、错误报告或功能请求，请在 GitHub 上 [提交一个问题](https:\u002F\u002Fgithub.com\u002Fsooftware\u002FEnd-to-end-Speech-Recognition\u002Fissues)。  \n如需实时讨论，请访问我们的 [Gitter](https:\u002F\u002Fgitter.im\u002FKorean-Speech-Recognition\u002Fcommunity)，或联系 sh951011@gmail.com。\n\n我们欢迎任何形式的反馈和贡献。您可以从小的改进入手，例如修复 bug 或完善文档。对于重大贡献或新功能，请先在相关 issue 中与合作者讨论。\n\n### 代码风格\n我们遵循 [PEP-8](https:\u002F\u002Fwww.python.org\u002Fdev\u002Fpeps\u002Fpep-0008\u002F) 代码风格规范。尤其是 docstring 的风格对生成文档非常重要。\n\n### 论文参考\n\n* Ilya Sutskever 等人，《使用神经网络进行序列到序列学习》(arXiv: 1409.3215)  \n\n* Dzmitry Bahdanau 等人，《通过联合学习对齐与翻译实现神经机器翻译》(arXiv: 1409.0473)  \n\n* Jan Chorowski 等人，《基于注意力机制的语音识别模型》(arXiv: 1506.07503)  \n\n* Wiliam Chan 等人，《听、注意与拼写》(arXiv: 1508.01211)  \n\n* Dario Amodei 等人，《Deep Speech2：英语和普通话的端到端语音识别》(arXiv: 1512.02595)  \n\n* Takaaki Hori 等人，《结合深度 CNN 编码器和 RNN-LM 的 CTC-注意力联合 E2E 自动语音识别进展》(arXiv: 1706.02737)  \n\n* Ashish Vaswani 等人，《注意力就是一切》(arXiv: 1706.03762)  \n\n* Chung-Cheng Chiu 等人，《基于序列到序列模型的最先进语音识别》(arXiv: 1712.01769)  \n\n* Anjuli Kannan 等人，《将外部语言模型融入序列到序列模型的分析》(arXiv: 1712.01996)  \n\n* Daniel S. Park 等人，《SpecAugment：一种简单的自动语音识别数据增强方法》(arXiv: 1904.08779)  \n\n* Rafael Muller 等人，《标签平滑何时有效？》(arXiv: 1906.02629)  \n\n* Daniel S. Park 等人，《SpecAugment 在大规模数据集上的应用》(arXiv: 1912.05533)  \n\n* Jung-Woo Ha 等人，《ClovaCall：面向呼叫中心自动语音识别的韩语目标导向对话语音语料库》(arXiv: 2004.09367)  \n\n* Jason Li 等人，《Jasper：一种端到端卷积神经网络声学模型》(arXiv: 1902.03288)  \n\n* Anmol Gulati 等人，《Conformer：用于语音识别的卷积增强 Transformer》(arXiv: 2005.08100)\n\n### GitHub 参考\n\n* [IBM\u002FPytorch-seq2seq](https:\u002F\u002Fgithub.com\u002FIBM\u002Fpytorch-seq2seq)  \n\n* [SeanNaren\u002Fdeepspeech.pytorch](https:\u002F\u002Fgithub.com\u002FSeanNaren\u002Fdeepspeech.pytorch)  \n\n* [kaituoxu\u002FSpeech-Transformer](https:\u002F\u002Fgithub.com\u002Fkaituoxu\u002FSpeech-Transformer)  \n\n* [OpenNMT\u002FOpenNMT-py](https:\u002F\u002Fgithub.com\u002FOpenNMT\u002FOpenNMT-py)  \n\n* [clovaai\u002FClovaCall](https:\u002F\u002Fgithub.com\u002Fclovaai\u002FClovaCall)  \n\n* [LiyuanLucasLiu\u002FRAdam](https:\u002F\u002Fgithub.com\u002FLiyuanLucasLiu\u002FRAdam)\n\n* [NVIDIA\u002FDeepLearningExample](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FDeepLearningExamples)\n\n* [espnet\u002Fespnet](https:\u002F\u002Fgithub.com\u002Fespnet\u002Fespnet)\n\n### 许可证\n本项目采用 Apache-2.0 许可证授权——详情请参阅 [LICENSE.md](https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Fblob\u002Fmaster\u002FLICENSE) 文件。\n\n## 引用\n关于 KoSpeech 的一篇论文已在 ScienceDirect 上发表。如果您在学术研究中使用该系统，请引用以下内容：\n\n```\n@ARTICLE{2021-kospeech,\n  author    = {Kim, Soohwan and Bae, Seyoung and Won, Cheolhwang},\n  title     = {KoSpeech：用于端到端韩语语音识别的开源工具包},\n  url       = {https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS2665963821000026},\n  month     = {二月},\n  year      = {2021},\n  publisher = {ELSEVIER},\n  journal   = {SIMPAC},\n  pages     = {第7卷，100054页}\n}\n```\n\n此外，还有一篇关于 KoSpeech 的技术报告已发布于 arXiv：\n\n```\n@TECHREPORT{2020-kospeech,\n  author    = {Kim, Soohwan and Bae, Seyoung and Won, Cheolhwang},\n  title     = {KoSpeech：用于端到端韩语语音识别的开源工具包},\n  month     = {九月},\n  year      = {2020},\n  url       = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2009.03092},\n  journal   = {ArXiv e-prints},\n  eprint    = {2009.03092}\n}\n```","# KoSpeech 快速上手指南\n\nKoSpeech 是一个基于 PyTorch 构建的开源端到端韩语语音识别（ASR）研究库。虽然该项目仓库已归档，但其核心代码和预训练模型仍具有参考价值。若需继续开发或训练新模型，官方建议迁移至 [OpenSpeech](https:\u002F\u002Fgithub.com\u002Fopenspeech-team\u002Fopenspeech)；若仅需测试韩语识别效果，推荐使用 [Pororo ASR](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fpororo) 或 [Whisper](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper)。\n\n以下指南基于原仓库内容整理，适用于想要复现经典模型或学习内部代码的开发者。\n\n## 环境准备\n\n本项目推荐 **Python 3.7** 或更高版本。强烈建议使用虚拟环境（如 `venv` 或 `conda`）以避免依赖冲突。\n\n### 系统要求\n- 操作系统：Linux \u002F macOS (Windows 需自行配置编译环境)\n- GPU：支持 CUDA 的 NVIDIA 显卡（可选，但推荐用于加速训练）\n\n### 前置依赖安装\n请依次安装以下基础依赖。国内用户可使用清华源或阿里源加速 `pip` 安装。\n\n```bash\n# 基础科学计算库\npip install numpy pandas matplotlib tqdm -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# PyTorch 及相关音频库 (请根据官网选择对应 CUDA 版本)\n# 示例：pip install torch torchvision torchaudio==0.6.0 -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\npip install torchaudio==0.6.0\n\n# 音频处理与分词\nconda install -c conda-forge librosa\npip install sentencepiece -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# RNN-T 专用加速库\npip install warp_rnnt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 配置管理框架 Hydra\npip install hydra-core --upgrade -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n> **注意**：`librosa` 在部分 Linux 环境下通过 `pip` 安装可能报错，优先推荐使用 `conda` 安装。\n\n## 安装步骤\n\n克隆源代码并进入目录，使用 `setuptools` 进行本地开发模式安装：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech.git\ncd kospeech\npip install -e .\n```\n\n## 基本使用\n\nKoSpeech 使用 [Hydra](https:\u002F\u002Fhydra.cc\u002F) 框架管理所有训练配置。使用前请确保已准备好数据集（如 KsponSpeech 或 LibriSpeech）。\n\n### 1. 数据准备\n请下载并预处理数据集。具体脚本和说明请参考仓库内的 `dataset\u002Fkspon` 或 `dataset\u002Flibri` 目录。\n假设数据集已处理完毕，路径为 `\u002Fpath\u002Fto\u002Fdataset`。\n\n### 2. 模型训练\n你可以选择不同的声学模型进行训练。以下以 **Deep Speech 2** 和 **Conformer** 为例：\n\n**训练 Deep Speech 2 模型：**\n```bash\npython .\u002Fbin\u002Fmain.py model=ds2 train=ds2_train train.dataset_path=\u002Fpath\u002Fto\u002Fdataset\n```\n\n**训练 Conformer 模型 (Large 版本)：**\n```bash\npython .\u002Fbin\u002Fmain.py model=conformer-large train=conformer_large_train train.dataset_path=\u002Fpath\u002Fto\u002Fdataset\n```\n*注：也可选用 `conformer-medium` 或 `conformer-small`。*\n\n其他支持模型命令参考：\n- LAS: `python .\u002Fbin\u002Fmain.py model=las train=las_train train.dataset_path=...`\n- Jasper: `python .\u002Fbin\u002Fmain.py model=jasper train=jasper_train train.dataset_path=...`\n- RNN-T: `python .\u002Fbin\u002Fmain.py model=rnnt train=rnnt_train train.dataset_path=...`\n\n### 3. 模型评估\n使用测试集评估训练好的模型：\n\n```bash\npython .\u002Fbin\u002Feval.py eval.dataset_path=\u002Fpath\u002Fto\u002Fdataset eval.transcripts_path=\u002Fpath\u002Fto\u002Ftranscripts eval.model_path=\u002Fpath\u002Fto\u002Fmodel.pt\n```\n\n### 4. 单条音频推理\n使用预训练模型对单个音频文件进行识别：\n\n```bash\npython3 .\u002Fbin\u002Finference.py --model_path \u002Fpath\u002Fto\u002Fmodel.pt --audio_path \u002Fpath\u002Fto\u002Faudio.wav --device cuda\n```\n\n**输出示例：**\n```text\n음성인식 결과 문장이 나옵니다\n```\n\n### 检查点说明\n训练产生的模型权重和状态将按时间戳保存在 `outputs\u002F` 目录下：\n```text\noutputs\n+-- YYYY_mm_dd\n|  +-- HH_MM_SS\n   |  +-- trainer_states.pt  (用于恢复训练)\n   |  +-- model.pt           (用于推理\u002F评估)\n```","韩国某初创团队正致力于开发一款针对老年用户的韩语语音健康记录应用，需要将医生的口述诊疗内容实时转化为结构化文本。\n\n### 没有 kospeech 时\n- **缺乏韩语专用基线**：市面上主流开源工具（如 ESPnet）主要针对英语优化，直接用于韩语时识别率极低，且无现成的韩语预处理方案。\n- **数据清洗成本高**：面对韩国 AI Hub 公开的 KsponSpeech 千小时语料，团队需从零编写复杂的清洗和特征提取代码，耗时数周。\n- **模型复现困难**：想要尝试业界领先的 Conformer 或 Jasper 架构，必须逐行研读论文并手动搭建网络，调试周期漫长且容易出错。\n- **实验配置混乱**：不同模型的超参数管理依赖硬编码，每次切换实验都要修改大量文件，难以进行系统性的性能对比。\n\n### 使用 kospeech 后\n- **开箱即用的韩语支持**：kospeech 提供了专为 KsponSpeech 设计的标准化预处理流程和基线模型，团队当天即可启动训练。\n- **主流架构一键调用**：内置了 Deep Speech 2、Transformer、Conformer 等多种经过验证的端到端模型，无需重复造轮子，直接微调即可适配医疗场景。\n- **高效灵活的实验管理**：基于 Hydra 框架，通过简单的配置文件即可灵活调整模型结构和训练参数，快速完成多模型效果比对。\n- **研发门槛显著降低**：模块化的 PyTorch 代码结构让团队成员能轻松理解内部逻辑，将原本一个月的数据准备与模型搭建期缩短至三天。\n\nkospeech 填补了韩语端到端语音识别开源生态的空白，让开发者能从繁琐的基础设施构建中解放出来，专注于垂直领域的业务落地。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsooftware_kospeech_c57ef836.png","sooftware","Soohwan Kim","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fsooftware_50359f13.jpg","Just vibing. ✌️",null,"\bSeoul, Republic of Korea","sh951011@gmail.com","sooftware.io","https:\u002F\u002Fgithub.com\u002Fsooftware",[86,90],{"name":87,"color":88,"percentage":89},"Python","#3572A5",99.5,{"name":91,"color":92,"percentage":93},"Shell","#89e051",0.5,638,192,"2026-03-18T04:20:15","Apache-2.0","未说明","需要 NVIDIA GPU (因依赖 warp-rnnt 和 PyTorch)，具体型号和显存未说明，CUDA 版本需与安装的 PyTorch 版本匹配",{"notes":101,"python":102,"dependencies":103},"该项目已归档，作者建议使用 OpenSpeech 进行新模型训练或使用 Pororo\u002FWhisper 进行测试。强烈建议使用 virtualenv 或 conda 创建独立虚拟环境。配置管理使用 Hydra 框架。支持 KsponSpeech 和 LibriSpeech 数据集。","3.7+",[104,105,106,107,108,109,110,111,112,113],"numpy","pytorch","pandas","matplotlib","librosa","torchaudio==0.6.0","tqdm","sentencepiece","warp-rnnt","hydra-core",[26,13,55],[116,117,118,119,120,121,105,122,123,124,125,126,127,128],"speech-recognition","asr","korean-speech","end-to-end","las-models","ksponspeech","seq2seq","e2e-asr","las","transformer","attention-is-all-you-need","jasper","conformer","2026-03-27T02:49:30.150509","2026-04-06T06:44:05.149638",[132,137,142,147,152,157,162],{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},13251,"运行 main.py 训练时提示找不到音频文件路径，应该如何配置？","需要在 `\u002Fkospeech\u002Futils.py` 文件中配置音频文件的路径。具体来说，应使用预处理过程中生成的 `aihub_labels.csv` 文件来设置路径，并确认该文件中包含正确的文件总数（例如 2337 个文件）。仅仅在 `run_seq2seq.sh` 中设置转录文件路径是不够的。","https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Fissues\u002F50",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},13250,"训练集 CER 和 Loss 下降，但验证集指标停滞不前或无法提升，可能是什么原因？","这通常不是简单的 teacher_forcing 问题，而可能是隐藏状态（hidden state）传递过程中出现了错误。此外，如果只训练了 1 个 epoch，CER 在 15% 左右属于正常现象，通常需要训练 20-30 个 epoch 才能看到最终性能。请检查代码中是否存在 token 编号混乱的情况，并确保数据预处理正确。","https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Fissues\u002F7",{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},13252,"遇到 TypeError: forward() missing required positional arguments 'inputs' and 'targets' 错误如何解决？","这是因为调用模型时参数传递方式不匹配。请显式地指定参数名称，将代码修改为：`y_hat, logit = model(inputs=feats, targets=scripts, teacher_forcing_ratio=0.0, use_beam_search=False)`。如果问题依旧，可能是本地模型代码版本过旧，建议重新下载 `models` 文件夹覆盖本地文件。","https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Fissues\u002F6",{"id":148,"question_zh":149,"answer_zh":150,"source_url":151},13253,"使用 resume=True 继续训练时，为什么从第 2 个 epoch 开始 CER 值不再下降？","这种情况通常是因为之前的预处理步骤没有正确执行导致的。如果数据预处理不当，即使恢复训练，模型也无法有效学习。请确保在重新开始训练前，数据已经经过了完整且正确的预处理流程。","https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Fissues\u002F165",{"id":153,"question_zh":154,"answer_zh":155,"source_url":156},13254,"训练开始后卡在 epoch 0，没有任何进展或报错，可能是什么原因？","这可能是由于特征提取（如 MFCC 或 Fbank）过程中出现了潜在错误，或者音频文件本身存在问题。如果日志显示参数加载正常但无后续输出，建议检查音频数据集（如 KsponSpeech）的文件完整性，并确认特征提取器（如 kaldi）是否正常工作。维护者曾指出某些特定特征提取方法可能存在 bug。","https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Fissues\u002F58",{"id":158,"question_zh":159,"answer_zh":160,"source_url":161},13255,"训练过程中出现 CUDA device-side assert triggered 错误怎么办？","该错误通常发生在评估（evaluate）阶段或束搜索（beam search）部分，往往是由于代码中的索引越界或张量维度不匹配引起的。维护者已修复了 evaluate() 部分的相关错误，并正在修复 beam search 部分。建议拉取最新的代码更新，特别是 `evaluator.py` 和解码相关的模块，以解决此问题。","https:\u002F\u002Fgithub.com\u002Fsooftware\u002Fkospeech\u002Fissues\u002F2",{"id":163,"question_zh":164,"answer_zh":165,"source_url":141},13256,"Token 编号顺序（如 PAD_TOKEN, SOS_TOKEN, Space Bar 的顺序）对识别准确率有影响吗？","是的，Token 编号顺序对结果有显著影响。有用户反馈，将空格符（Space Bar）设置为 0 号 Token（而不是传统的 PAD_TOKEN）后，准确率得到了正常体现。虽然具体原理可能与模型内部实现有关，但建议严格遵循项目默认的标签集顺序进行编码，不要随意更改特殊 Token 的索引。",[167,172,177,182],{"id":168,"version":169,"summary_zh":170,"released_at":171},71940,"v1.3","### 新增功能\n\n- 添加 [Jasper 模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1904.03288.pdf)\n- 添加 [Conformer 模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.08100)\n- 添加 Transformer 学习率调度器\n","2021-01-25T20:44:50",{"id":173,"version":174,"summary_zh":175,"released_at":176},71941,"v1.2","### 新增内容\n- 2021年1月：通过联合CTC-注意力机制的Transformer模型\n- 2021年1月：通过语音Transformer模型\n- 2021年1月：应用Hydra框架：用于优雅地配置复杂应用程序的框架","2021-01-05T06:47:55",{"id":178,"version":179,"summary_zh":180,"released_at":181},71942,"v1.1","- 添加联合CTC-注意力机制架构（目前不支持多GPU训练）\n- 添加Deep Speech 2架构","2020-12-14T16:37:07",{"id":183,"version":184,"summary_zh":185,"released_at":186},71943,"v1.0","科斯普里奇 v1.0","2020-12-02T06:16:59"]