[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-facebookresearch--AudioDec":3,"tool-facebookresearch--AudioDec":65},[4,23,32,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,2,"2026-04-05T10:45:23",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},4128,"GPT-SoVITS","RVC-Boss\u002FGPT-SoVITS","GPT-SoVITS 是一款强大的开源语音合成与声音克隆工具，旨在让用户仅需极少量的音频数据即可训练出高质量的个性化语音模型。它核心解决了传统语音合成技术依赖海量录音数据、门槛高且成本大的痛点，实现了“零样本”和“少样本”的快速建模：用户只需提供 5 秒参考音频即可即时生成语音，或使用 1 分钟数据进行微调，从而获得高度逼真且相似度极佳的声音效果。\n\n该工具特别适合内容创作者、独立开发者、研究人员以及希望为角色配音的普通用户使用。其内置的友好 WebUI 界面集成了人声伴奏分离、自动数据集切片、中文语音识别及文本标注等辅助功能，极大地降低了数据准备和模型训练的技术门槛，让非专业人士也能轻松上手。\n\n在技术亮点方面，GPT-SoVITS 不仅支持中、英、日、韩、粤语等多语言跨语种合成，还具备卓越的推理速度，在主流显卡上可实现实时甚至超实时的生成效率。无论是需要快速制作视频配音，还是进行多语言语音交互研究，GPT-SoVITS 都能以极低的数据成本提供专业级的语音合成体验。",56375,3,"2026-04-05T22:15:46",[21],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},2863,"TTS","coqui-ai\u002FTTS","🐸TTS 是一款功能强大的深度学习文本转语音（Text-to-Speech）开源库，旨在将文字自然流畅地转化为逼真的人声。它解决了传统语音合成技术中声音机械生硬、多语言支持不足以及定制门槛高等痛点，让高质量的语音生成变得触手可及。\n\n无论是希望快速集成语音功能的开发者，还是致力于探索前沿算法的研究人员，亦或是需要定制专属声音的数据科学家，🐸TTS 都能提供得力支持。它不仅预置了覆盖全球 1100 多种语言的训练模型，让用户能够即刻上手，还提供了完善的工具链，支持用户利用自有数据训练新模型或对现有模型进行微调，轻松实现特定风格的声音克隆。\n\n在技术亮点方面，🐸TTS 表现卓越。其最新的 ⓍTTSv2 模型支持 16 种语言，并在整体性能上大幅提升，实现了低于 200 毫秒的超低延迟流式输出，极大提升了实时交互体验。此外，它还无缝集成了 🐶Bark、🐢Tortoise 等社区热门模型，并支持调用上千个 Fairseq 模型，展现了极强的兼容性与扩展性。配合丰富的数据集分析与整理工具，🐸TTS 已成为科研与生产环境中备受信赖的语音合成解决方案。",44971,"2026-04-03T14:47:02",[21,20,13],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":29,"last_commit_at":46,"category_tags":47,"status":22},2375,"LocalAI","mudler\u002FLocalAI","LocalAI 是一款开源的本地人工智能引擎，旨在让用户在任意硬件上轻松运行各类 AI 模型，包括大语言模型、图像生成、语音识别及视频处理等。它的核心优势在于彻底打破了高性能计算的门槛，无需昂贵的专用 GPU，仅凭普通 CPU 或常见的消费级显卡（如 NVIDIA、AMD、Intel 及 Apple Silicon）即可部署和运行复杂的 AI 任务。\n\n对于担心数据隐私的用户而言，LocalAI 提供了“隐私优先”的解决方案，确保所有数据处理均在本地基础设施内完成，无需上传至云端。同时，它完美兼容 OpenAI、Anthropic 等主流 API 接口，这意味着开发者可以无缝迁移现有应用，直接利用本地资源替代云服务，既降低了成本又提升了可控性。\n\nLocalAI 内置了超过 35 种后端支持（如 llama.cpp、vLLM、Whisper 等），并集成了自主 AI 代理、工具调用及检索增强生成（RAG）等高级功能，且具备多用户管理与权限控制能力。无论是希望保护敏感数据的企业开发者、进行算法实验的研究人员，还是想要在个人电脑上体验最新 AI 技术的极客玩家，都能通过 LocalAI 获",44782,"2026-04-02T22:14:26",[13,21,19,17,20,14,16],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":29,"last_commit_at":54,"category_tags":55,"status":22},3108,"bark","suno-ai\u002Fbark","Bark 是由 Suno 推出的开源生成式音频模型，能够根据文本提示创造出高度逼真的多语言语音、音乐、背景噪音及简单音效。与传统仅能朗读文字的语音合成工具不同，Bark 基于 Transformer 架构，不仅能模拟说话，还能生成笑声、叹息、哭泣等非语言声音，甚至能处理带有情感色彩和语气停顿的复杂文本，极大地丰富了音频表达的可能性。\n\n它主要解决了传统语音合成声音机械、缺乏情感以及无法生成非语音类音效的痛点，让创作者能通过简单的文字描述获得生动自然的音频素材。无论是需要为视频配音的内容创作者、探索多模态生成的研究人员，还是希望快速原型设计的开发者，都能从中受益。普通用户也可通过集成的演示页面轻松体验其神奇效果。\n\n技术亮点方面，Bark 支持商业使用（MIT 许可），并在近期更新中实现了显著的推理速度提升，同时提供了适配低显存 GPU 的版本，降低了使用门槛。此外，社区还建立了丰富的提示词库，帮助用户更好地驾驭模型生成特定风格的声音。只需几行 Python 代码，即可将创意文本转化为高质量音频，是连接文字与声音世界的强大桥梁。",39067,"2026-04-04T03:33:35",[21],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":22},3788,"airi","moeru-ai\u002Fairi","airi 是一款开源的本地化 AI 伴侣项目，旨在将虚拟角色（如“二次元老婆”或赛博生命）带入用户的现实世界。它的核心目标是复刻并超越知名 AI 主播 Neuro-sama 的能力，让用户能够拥有完全自主掌控、可私有化部署的智能伙伴。\n\nairi 主要解决了用户对高度定制化、具备情感交互能力且数据隐私安全的 AI 角色的需求。不同于依赖云端服务的通用助手，airi 允许用户在本地运行，不仅保护了对话隐私，还赋予了用户定义角色性格与灵魂的自由。它支持实时语音聊天，甚至能直接参与《我的世界》（Minecraft）和《异星工厂》（Factorio）等游戏，实现了从单纯对话到共同娱乐的跨越。\n\n这款工具非常适合喜爱虚拟角色的普通用户、希望搭建个性化 AI 陪伴的技术爱好者，以及研究多模态交互的开发者。其独特的技术亮点在于跨平台支持（涵盖 Web、macOS 和 Windows）以及强大的游戏交互能力，让 AI 不仅能“说”，还能“玩”。通过容器化的灵魂设计，airi 为每个人创造专属数字生命提供了可能，让虚拟陪伴变得更加真实且触手可及。",37086,1,"2026-04-05T10:54:25",[19,21,17],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":92,"forks":93,"last_commit_at":94,"license":95,"difficulty_score":29,"env_os":96,"env_gpu":97,"env_ram":98,"env_deps":99,"category_tags":106,"github_topics":80,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":22,"created_at":107,"updated_at":108,"faqs":109,"releases":150},4360,"facebookresearch\u002FAudioDec","AudioDec","An Open-source Streaming High-fidelity Neural Audio Codec","AudioDec 是一款开源的流式高保真神经音频编解码器，专为实时语音通信场景设计。它致力于解决传统音频技术在压缩率、延迟和音质之间难以兼顾的痛点：既能将 48kHz 的单声道语音压缩至仅需 12.8 kbps 的极低码率，又能保证重建后的声音高度自然清晰。\n\n该工具的核心优势在于其卓越的实时性能。在 GPU 环境下，其解码延迟低至约 6 毫秒，即使在普通 CPU 上也仅需约 10 毫秒，几乎实现了无感知的即时通信体验。此外，AudioDec 采用了高效的两阶段训练范式，开发者利用预训练模型，仅需数小时即可针对新应用场景完成编码器训练，大幅降低了研发门槛。项目提供了完整的训练、统计提取及实时流式传输的代码模板，并支持自编码器与声码器结合等多种模式。\n\nAudioDec 非常适合从事语音通信、在线会议系统开发的工程师，以及研究神经音频压缩算法的科研人员。对于希望在实际产品中部署高质量、低带宽语音功能的团队来说，这是一个成熟且易于集成的技术选择。","\u003Ca rel=\"license\" href=\"http:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc\u002F4.0\u002F\">\u003Cimg alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_AudioDec_readme_1f8182c1eb00.png\" \u002F>\u003C\u002Fa> This work is licensed under a \u003Ca rel=\"license\" href=\"http:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc\u002F4.0\u002F\">Creative Commons Attribution-NonCommercial 4.0 International License\u003C\u002Fa>.\n\n# AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec\n  \n### Highlights\n- Streamable high-fidelity audio codec for **48 kHz** mono speech with **12.8 kbps** bitrate.\n- Very low decoding latency on **GPU (~6 ms)** and **CPU (~10 ms)** with 4 threads.\n- Efficient two-stage training (with the pre-trained models, training an encoder for new applications takes only a few hours)\n\n### Abstract\nA good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i.e. the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i.e. encoding and decoding the signal needs to be fast enough to enable communication without or with only minimal noticeable delay; and (3) reconstruction quality of the signal. In this work, we propose an open-source, streamable, and real-time neural audio codec that achieves strong performance along all three axes: it can reconstruct highly natural sounding 48 kHz speech signals while operating at only 12 kbps and running with less than 6 ms (GPU)\u002F10 ms (CPU) latency. An efficient training paradigm is also demonstrated for developing such neural audio codecs for real-world scenarios. [[paper](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10096509)] [[demo](https:\u002F\u002Fbigpon.github.io\u002FAudioDec_demo\u002F)]\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_AudioDec_readme_3b7f7f0b36c3.png\"\u002F>\n\u003C\u002Fp>\n\n### Two modes of AudioDec\n1. AutoEncoder (symmetric AudioDec, **symAD**)  \n 1-1. Train an AutoEncoder-based codec model from scratch with only metric loss(es) for the first 200k iterations.  \n 1-2. Fix the encoder, projector, quantizer, and codebook, and train the decoder with the discriminators for the following 500k iterations.  \n2. AutoEncoder + Vocoder (**AD v0,1,2**) (recommended!)  \n 2-1. Extract the stats (global mean and variance) of the codes extracted by the trained Encoder.  \n 2-2. Train the vocoder with the trained Encoder and stats for 500k iterations.\n\n\n## News\n- **2025\u002F03\u002F03**: The following work, [FlowDec](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FFlowDec), has been released.\n- **2024\u002F01\u002F03**: Update pre-trained models ([issue9](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Fissues\u002F9) and [issue11](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Fissues\u002F11))\n- **2023\u002F05\u002F17**: Upload the demo sounds on the [demo page](https:\u002F\u002Fbigpon.github.io\u002FAudioDec_demo\u002F)\n- **2023\u002F05\u002F13**: 1st version is released\n\n## Requirements\nThis repository is tested on Ubuntu 20.04 using a V100 and the following settings.\n- Python 3.8+\n- Cuda 11.0+\n- PyTorch 1.10+\n\n## Folder architecture\n- **bin**:\nThe folder of training, stats extraction, testing, and streaming templates.\n- **config**:\nThe folder of config files (.yaml).\n- **dataloader**:\nThe source codes of data loading.\n- **exp**:\nThe folder for saving models.\n- **layers**:\nThe source codes of basic neural layers.\n- **losses**:\nThe source codes of losses.\n- **models**:\nThe source codes of models.\n- **slurmlogs**:\nThe folder for saving slurm logs.\n- **stats**:\nThe folder for saving stats.\n- **trainer**:\nThe source codes of trainers.\n- **utils**:\nThe source codes of utils for the demo.\n\n## Run real-time streaming encoding\u002Fdecoding demo\n1. Please download the whole [exp](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Freleases\u002Fdownload\u002Fpretrain_models_v02\u002Fexp.zip) folder and put it in the AudioDec project directory.  \n2. Get the list of all I\u002FO devices\n```bash\n$ python -m sounddevice\n```\n3. Run the demo  \n```bash\n# The LibriTTS model is recommended for arbitrary microphones because of the robustness of microphone channel mismatches.\n# Set up the I\u002FO devices according to the list of I\u002FO devices\n\n# w\u002F GPU\n$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model libritts_v1\n\n# w\u002F CPU\n$ python demoStream.py --tx_cuda -1 --rx_cuda -1 --input_device 1 --output_device 4 --model libritts_sym\n\n# The input and out audios will be dumped into input.wav and output.wav\n```\n\n## Run codec demo with files\n1. Please download the whole [exp](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Freleases\u002Fdownload\u002Fpretrain_models_v02\u002Fexp.zip) folder and put it in the AudioDec project directory.  \n2. Run the demo  \n```bash\n## VCTK 48000Hz models\n$ python demoFile.py --model vctk_v1 -i xxx.wav -o ooo.wav\n\n## LibriTTS 24000Hz model\n$ python demoFile.py --model libritts_v1 -i xxx.wav -o ooo.wav\n```\n\n\n## Training and testing the whole AudioDec pipeline\n1. Prepare the training\u002Fvalidation\u002Ftest utterances and put them in three different folders  \n   ex: **corpus\u002Ftrain**, **corpus\u002Fdev**, and **corpus\u002Ftest**\n2. Modify the paths (ex: \u002Fmnt\u002Fhome\u002Fxxx\u002Fdatasets) in  \n   **submit_codec_vctk.sh**  \n   **config\u002Fautoencoder\u002FsymAD_vctk_48000_hop300.yaml**  \n   **config\u002Fstatistic\u002FsymAD_vctk_48000_hop300_clean.yaml**  \n   **config\u002Fvocoder\u002FAudioDec_v1_symAD_vctk_48000_hop300_clean.yaml**\n3. Assign corresponding `analyzer` and `stats` in \n   **config\u002Fstatistic\u002FsymAD_vctk_48000_hop300_clean.yaml**  \n   **config\u002Fvocoder\u002FAudioDec_v1_symAD_vctk_48000_hop300_clean.yaml**\n4. Follow the usage instructions in **submit_codec_vctk.sh** to run the training and testing\n```bash\n# stage 0: training autoencoder from scratch\n# stage 1: extracting statistics\n# stage 2: training vocoder from scratch\n# stage 3: testing (symAE)\n# stage 4: testing (AE + Vocoder)\n\n# Run stages 0-4\n$ bash submit_codec.sh --start 0 --stop 4 \\\n--autoencoder \"autoencoder\u002FsymAD_vctk_48000_hop300\" \\\n--statistic \"stati\u002FsymAD_vctk_48000_hop300_clean\" \\\n--vocoder \"vocoder\u002FAudioDec_v1_symAD_vctk_48000_hop300_clean\"  \n```\n\n\n## Training and testing only the AutoEncoder\n1. Prepare the training\u002Fvalidation\u002Ftest utterances and modify the paths \n2. Follow the usage instructions in **submit_autoencoder.sh** to run the training and testing\n```bash\n# Train AutoEncoder from scratch\n$ bash submit_autoencoder.sh --stage 0 \\\n--tag_name \"autoencoder\u002FsymAD_vctk_48000_hop300\"\n\n# Resume AutoEncoder from previous iterations\n$ bash submit_autoencoder.sh --stage 1 \\\n--tag_name \"autoencoder\u002FsymAD_vctk_48000_hop300\" \\\n--resumepoint 200000\n\n# Test AutoEncoder\n$ bash submit_autoencoder.sh --stage 2 \\\n--tag_name \"autoencoder\u002FsymAD_vctk_48000_hop300\"\n--subset \"clean_test\"\n```\n\n## Pre-trained Models\nAll pre-trained models can be accessed via [exp](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Freleases\u002Fdownload\u002Fpretrain_models_v02\u002Fexp.zip) (only the generators are provided).\n\n| AutoEncoder | Corpus | Fs | Bitrate | Path |  \n|---  |---  |---  |---  |---  |\n| symAD | VCTK | 48 kHz | 24 kbps | `exp\u002Fautoencoder\u002FsymAD_c16_vctk_48000_hop320`  |\n| symAAD | VCTK | 48 kHz | 12.8 kbps  | `exp\u002Fautoencoder\u002FsymAAD_vctk_48000_hop300`  |\n| symAD | VCTK | 48 kHz | 12.8 kbps | `exp\u002Fautoencoder\u002FsymAD_vctk_48000_hop300`  |\n| symAD_univ | VCTK | 48 kHz | 12.8 kbps  | `exp\u002Fautoencoder\u002FsymADuniv_vctk_48000_hop300`  |\n| symAD | LibriTTS | 24 kHz | 6.4 kbps  | `exp\u002Fautoencoder\u002FsymAD_libritts_24000_hop300`  |\n\n\n\n| Vocoder | Corpus | Fs | Path |\n|---  |---  |---  |---  |\n| AD v0 | VCTK | 48 kHz | `exp\u002Fvocoder\u002FAudioDec_v0_symAD_vctk_48000_hop300_clean` |\n| AD v1 | VCTK | 48 kHz | `exp\u002Fvocoder\u002FAudioDec_v1_symAD_vctk_48000_hop300_clean` |\n| AD v2 | VCTK | 48 kHz | `exp\u002Fvocoder\u002FAudioDec_v2_symAD_vctk_48000_hop300_clean` |\n| AD_univ | VCTK | 48 kHz | `exp\u002Fvocoder\u002FAudioDec_v3_symADuniv_vctk_48000_hop300_clean` |\n| AD v1 | LibriTTS | 24 kHz | `exp\u002Fvocoder\u002FAudioDec_v1_symAD_libritts_24000_hop300_clean` |\n\n\n\n\n## Bonus Track: Denoising\n1. It is easy to perform denoising by just updating the encoder using noisy-clean pairs (keeping the decoder\u002Fvocoder the same).\n2. Prepare the noisy-clean corpus and follow the usage instructions in **submit_denoise.sh** to run the training and testing\n```bash\n# Update the Encoder for denoising\n$ bash submit_denoise.sh --stage 0 \\\n--tag_name \"denoise\u002FsymAD_vctk_48000_hop300\"\n\n# Denoise\n$ bash submit_denoise.sh --stage 2 \\\n--encoder \"denoise\u002FsymAD_vctk_48000_hop300\"\n--decoder \"vocoder\u002FAudioDec_v1_symAD_vctk_48000_hop300_clean\"\n--encoder_checkpoint 200000\n--decoder_checkpoint 500000\n--subset \"noisy_test\"\n\n# Stream demo w\u002F GPU\n$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model vctk_denoise\n\n# Codec demo w\u002F files\n$ python demoFile.py -i xxx.wav -o ooo.wav --model vctk_denoise\n\n```\n\n\n## Citation\n\nIf you find the code helpful, please cite the following article.\n\n```\n@INPROCEEDINGS{10096509,\n  author={Wu, Yi-Chiao and Gebru, Israel D. and Marković, Dejan and Richard, Alexander},\n  booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, \n  title={{A}udio{D}ec: An Open-Source Streaming High-Fidelity Neural Audio Codec}, \n  year={2023},\n  doi={10.1109\u002FICASSP49357.2023.10096509}}\n```\n\n## References\nThe AudioDec repository is developed based on the following repositories.\n\n- [kan-bayashi\u002FParallelWaveGAN](https:\u002F\u002Fgithub.com\u002Fkan-bayashi\u002FParallelWaveGAN)\n- [r9y9\u002Fwavenet_vocoder](https:\u002F\u002Fgithub.com\u002Fr9y9\u002Fwavenet_vocoder)\n- [jik876\u002Fhifi-gan](https:\u002F\u002Fgithub.com\u002Fjik876\u002Fhifi-gan)\n- [lucidrains\u002Fvector-quantize-pytorch](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fvector-quantize-pytorch)\n- [chomeyama\u002FSiFiGAN](https:\u002F\u002Fgithub.com\u002Fchomeyama\u002FSiFiGAN)\n\n\n## License\nThe majority of \"AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec\" is licensed under CC-BY-NC, however, portions of the project are available under separate license terms: https:\u002F\u002Fgithub.com\u002Fkan-bayashi\u002FParallelWaveGAN, https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fvector-quantize-pytorch, https:\u002F\u002Fgithub.com\u002Fjik876\u002Fhifi-gan, https:\u002F\u002Fgithub.com\u002Fr9y9\u002Fwavenet_vocoder, and https:\u002F\u002Fgithub.com\u002Fchomeyama\u002FSiFiGAN are licensed under the MIT license.\n\n## FQ&A\n1. **Have you compared AudioDec with Encodec?**  \n Please refer to the [discussion](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Fissues\u002F1).\n2. **Have you compared AudioDec with other non-neural-network codecs such as Opus?**  \nSince this paper focuses on providing a well-developed streamable neural codec implementation with an efficient training paradigm and modularized architecture, we only compared AudioDec with SoundStream.\n3. **Can you also release the pre-trained discriminators?**  \nFor many applications such as denoising, updating only the encoder achieves almost the same performance as updating the whole model. For applications involving decoder updating such as binaural rending, it might be better to design specific discriminators for that application. Therefore, we release only the generators.\n4. **Can AudioDec encode\u002Fdecode multi-channel signals?**  \nYes, you can train a MIMO model by changing the input_channels and output_channels in the config. One lesson I learned in training a MIMO model is that although the generator is MIMO, reshaping the generator output signal to mono for the following discriminator will markedly improve the MIMO audio quality.\n\n\n\n\n\n","\u003Ca rel=\"license\" href=\"http:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc\u002F4.0\u002F\">\u003Cimg alt=\"知识共享许可协议\" style=\"border-width:0\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_AudioDec_readme_1f8182c1eb00.png\" \u002F>\u003C\u002Fa> 本作品采用知识共享署名-非商业性使用4.0国际许可协议授权。\n\n# AudioDec：一款开源的流式高保真神经音频编解码器\n\n### 亮点\n- 针对**48 kHz**单声道语音的可流式高保真音频编解码器，比特率为**12.8 kbps**。\n- 在**GPU（约6 ms）**和**CPU（约10 ms）**上，使用4个线程时解码延迟极低。\n- 高效的两阶段训练流程（借助预训练模型，为新应用训练编码器仅需数小时）。\n\n### 摘要\n适用于电信等实时应用的良好音频编解码器应具备三个关键特性：(1) 压缩率，即传输信号所需的比特率应尽可能低；(2) 延迟，即编码和解码速度需足够快，以实现无延迟或仅有极小延迟的通信；(3) 信号重建质量。在本工作中，我们提出了一种开源、可流式且支持实时处理的神经音频编解码器，该编解码器在这三个方面均表现出色：它能够在仅12 kbps的比特率下重建高度自然的48 kHz语音信号，并且具有低于6 ms（GPU）\u002F10 ms（CPU）的延迟。此外，我们还展示了一种高效的训练范式，可用于开发面向实际场景的此类神经音频编解码器。[[论文](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10096509)] [[演示](https:\u002F\u002Fbigpon.github.io\u002FAudioDec_demo\u002F)]\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_AudioDec_readme_3b7f7f0b36c3.png\"\u002F>\n\u003C\u002Fp>\n\n### AudioDec 的两种模式\n1. 自编码器（对称AudioDec，**symAD**）\n   1-1. 使用纯度量损失，在前20万次迭代中从头开始训练自编码器模型。\n   1-2. 固定编码器、投影器、量化器和码本，在接下来的50万次迭代中使用判别器训练解码器。\n2. 自编码器 + 声码器（**AD v0,1,2**，推荐！）\n   2-1. 提取由已训练编码器提取的码的统计信息（全局均值和方差）。\n   2-2. 使用已训练的编码器和这些统计信息，训练声码器50万次迭代。\n\n## 最新消息\n- **2025年3月3日**：后续工作[FlowDec](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FFlowDec)已发布。\n- **2024年1月3日**：更新了预训练模型（[issue9](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Fissues\u002F9) 和 [issue11](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Fissues\u002F11)）。\n- **2023年5月17日**：将演示音频上传至[演示页面](https:\u002F\u002Fbigpon.github.io\u002FAudioDec_demo\u002F)。\n- **2023年5月13日**：发布了1.0版本。\n\n## 系统要求\n本仓库已在Ubuntu 20.04系统上，使用V100显卡及以下配置进行测试：\n- Python 3.8+\n- CUDA 11.0+\n- PyTorch 1.10+\n\n## 文件夹结构\n- **bin**:\n  训练、统计信息提取、测试和流媒体模板所在的文件夹。\n- **config**:\n  配置文件（.yaml）存放的文件夹。\n- **dataloader**:\n  数据加载的源代码。\n- **exp**:\n  用于保存模型的文件夹。\n- **layers**:\n  基础神经网络层的源代码。\n- **losses**:\n  损失函数的源代码。\n- **models**:\n  模型的源代码。\n- **slurmlogs**:\n  用于保存Slurm作业日志的文件夹。\n- **stats**:\n  用于保存统计信息的文件夹。\n- **trainer**:\n  训练器的源代码。\n- **utils**:\n  演示相关工具的源代码。\n\n## 运行实时流式编解码演示\n1. 请下载整个[exp](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Freleases\u002Fdownload\u002Fpretrain_models_v02\u002Fexp.zip)文件夹，并将其放置在AudioDec项目目录中。\n2. 获取所有I\u002FO设备列表\n```bash\n$ python -m sounddevice\n```\n3. 运行演示\n```bash\n# 推荐使用LibriTTS模型，因为它对麦克风通道不匹配具有较强的鲁棒性。\n# 根据I\u002FO设备列表设置输入输出设备\n\n# 使用GPU\n$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model libritts_v1\n\n# 使用CPU\n$ python demoStream.py --tx_cuda -1 --rx_cuda -1 --input_device 1 --output_device 4 --model libritts_sym\n\n# 输入和输出音频将分别保存为input.wav和output.wav\n```\n\n## 使用文件运行编解码演示\n1. 请下载整个[exp](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Freleases\u002Fdownload\u002Fpretrain_models_v02\u002Fexp.zip)文件夹，并将其放置在AudioDec项目目录中。\n2. 运行演示\n```bash\n## VCTK 48000Hz模型\n$ python demoFile.py --model vctk_v1 -i xxx.wav -o ooo.wav\n\n## LibriTTS 24000Hz模型\n$ python demoFile.py --model libritts_v1 -i xxx.wav -o ooo.wav\n```\n\n\n## 训练和测试完整的AudioDec流水线\n1. 准备训练\u002F验证\u002F测试语料，并将其分别放入三个不同的文件夹中，\n   例如：**corpus\u002Ftrain**、**corpus\u002Fdev**和**corpus\u002Ftest**。\n2. 修改路径（如：\u002Fmnt\u002Fhome\u002Fxxx\u002Fdatasets）至以下文件中：\n   **submit_codec_vctk.sh**\n   **config\u002Fautoencoder\u002FsymAD_vctk_48000_hop300.yaml**\n   **config\u002Fstatistic\u002FsymAD_vctk_48000_hop300_clean.yaml**\n   **config\u002Fvocoder\u002FAudioDec_v1_symAD_vctk_48000_hop300_clean.yaml**\n3. 在以下配置文件中指定相应的`analyzer`和`stats`：\n   **config\u002Fstatistic\u002FsymAD_vctk_48000_hop300_clean.yaml**\n   **config\u002Fvocoder\u002FAudioDec_v1_symAD_vctk_48000_hop300_clean.yaml**\n4. 按照**submit_codec_vctk.sh**中的使用说明，执行训练和测试。\n```bash\n# 阶段0：从零开始训练自编码器\n# 阶段1：提取统计信息\n# 阶段2：从零开始训练声码器\n# 阶段3：测试（对称自编码器）\n# 阶段4：测试（自编码器+声码器）\n\n# 执行阶段0至4\n$ bash submit_codec.sh --start 0 --stop 4 \\\n--autoencoder \"autoencoder\u002FsymAD_vctk_48000_hop300\" \\\n--statistic \"stati\u002FsymAD_vctk_48000_hop300_clean\" \\\n--vocoder \"vocoder\u002FAudioDec_v1_symAD_vctk_48000_hop300_clean\"\n```\n\n\n## 仅训练和测试自编码器\n1. 准备训练\u002F验证\u002F测试语料，并修改路径。\n2. 按照**submit_autoencoder.sh**中的使用说明，执行训练和测试。\n```bash\n# 从零开始训练自编码器\n$ bash submit_autoencoder.sh --stage 0 \\\n--tag_name \"autoencoder\u002FsymAD_vctk_48000_hop300\"\n\n# 继续之前迭代的自编码器训练\n$ bash submit_autoencoder.sh --stage 1 \\\n--tag_name \"autoencoder\u002FsymAD_vctk_48000_hop300\" \\\n--resumepoint 200000\n\n# 测试自编码器\n$ bash submit_autoencoder.sh --stage 2 \\\n--tag_name \"autoencoder\u002FsymAD_vctk_48000_hop300\"\n--subset \"clean_test\"\n```\n\n## 预训练模型\n所有预训练模型均可通过 [exp](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Freleases\u002Fdownload\u002Fpretrain_models_v02\u002Fexp.zip) 获取（仅提供生成器）。\n\n| 自编码器 | 语料库 | 采样率 | 码率 | 路径 |\n|---  |---  |---  |---  |---  |\n| symAD | VCTK | 48 kHz | 24 kbps | `exp\u002Fautoencoder\u002FsymAD_c16_vctk_48000_hop320`  |\n| symAAD | VCTK | 48 kHz | 12.8 kbps  | `exp\u002Fautoencoder\u002FsymAAD_vctk_48000_hop300`  |\n| symAD | VCTK | 48 kHz | 12.8 kbps | `exp\u002Fautoencoder\u002FsymAD_vctk_48000_hop300`  |\n| symAD_univ | VCTK | 48 kHz | 12.8 kbps  | `exp\u002Fautoencoder\u002FsymADuniv_vctk_48000_hop300`  |\n| symAD | LibriTTS | 24 kHz | 6.4 kbps  | `exp\u002Fautoencoder\u002FsymAD_libritts_24000_hop300`  |\n\n\n\n| 声码器 | 语料库 | 采样率 | 路径 |\n|---  |---  |---  |---  |\n| AD v0 | VCTK | 48 kHz | `exp\u002Fvocoder\u002FAudioDec_v0_symAD_vctk_48000_hop300_clean` |\n| AD v1 | VCTK | 48 kHz | `exp\u002Fvocoder\u002FAudioDec_v1_symAD_vctk_48000_hop300_clean` |\n| AD v2 | VCTK | 48 kHz | `exp\u002Fvocoder\u002FAudioDec_v2_symAD_vctk_48000_hop300_clean` |\n| AD_univ | VCTK | 48 kHz | `exp\u002Fvocoder\u002FAudioDec_v3_symADuniv_vctk_48000_hop300_clean` |\n| AD v1 | LibriTTS | 24 kHz | `exp\u002Fvocoder\u002FAudioDec_v1_symAD_libritts_24000_hop300_clean` |\n\n\n\n\n## 附录：降噪\n1. 只需使用带噪-干净语音对更新编码器（保持解码器\u002F声码器不变），即可轻松实现降噪。\n2. 准备带噪-干净语音语料，并按照 **submit_denoise.sh** 中的使用说明运行训练和测试。\n```bash\n# 更新用于降噪的编码器\n$ bash submit_denoise.sh --stage 0 \\\n--tag_name \"denoise\u002FsymAD_vctk_48000_hop300\"\n\n# 降噪\n$ bash submit_denoise.sh --stage 2 \\\n--encoder \"denoise\u002FsymAD_vctk_48000_hop300\"\n--decoder \"vocoder\u002FAudioDec_v1_symAD_vctk_48000_hop300_clean\"\n--encoder_checkpoint 200000\n--decoder_checkpoint 500000\n--subset \"noisy_test\"\n\n# 使用 GPU 的流式演示\n$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model vctk_denoise\n\n# 使用文件的编解码演示\n$ python demoFile.py -i xxx.wav -o ooo.wav --model vctk_denoise\n\n```\n\n\n## 引用\n如果您觉得本代码有所帮助，请引用以下论文。\n\n```\n@INPROCEEDINGS{10096509,\n  author={Wu, Yi-Chiao and Gebru, Israel D. and Marković, Dejan and Richard, Alexander},\n  booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, \n  title={{A}udio{D}ec: An Open-Source Streaming High-Fidelity Neural Audio Codec}, \n  year={2023},\n  doi={10.1109\u002FICASSP49357.2023.10096509}}\n```\n\n## 参考文献\nAudioDec 仓库基于以下项目开发：\n\n- [kan-bayashi\u002FParallelWaveGAN](https:\u002F\u002Fgithub.com\u002Fkan-bayashi\u002FParallelWaveGAN)\n- [r9y9\u002Fwavenet_vocoder](https:\u002F\u002Fgithub.com\u002Fr9y9\u002Fwavenet_vocoder)\n- [jik876\u002Fhifi-gan](https:\u002F\u002Fgithub.com\u002Fjik876\u002Fhifi-gan)\n- [lucidrains\u002Fvector-quantize-pytorch](https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fvector-quantize-pytorch)\n- [chomeyama\u002FSiFiGAN](https:\u002F\u002Fgithub.com\u002Fchomeyama\u002FSiFiGAN)\n\n\n## 许可证\n“AudioDec: 一个开源的高保真流式神经音频编解码器”的大部分内容采用 CC-BY-NC 许可证，但项目的部分内容则遵循单独的许可条款：https:\u002F\u002Fgithub.com\u002Fkan-bayashi\u002FParallelWaveGAN、https:\u002F\u002Fgithub.com\u002Flucidrains\u002Fvector-quantize-pytorch、https:\u002F\u002Fgithub.com\u002Fjik876\u002Fhifi-gan、https:\u002F\u002Fgithub.com\u002Fr9y9\u002Fwavenet_vocoder 以及 https:\u002F\u002Fgithub.com\u002Fchomeyama\u002FSiFiGAN 均采用 MIT 许可证。\n\n## 常见问题解答\n1. **您是否将 AudioDec 与 Encodec 进行过比较？**  \n   请参阅 [讨论](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Fissues\u002F1)。\n2. **您是否将 AudioDec 与 Opus 等非神经网络编解码器进行过比较？**  \n   由于本文重点在于提供一种具有高效训练范式和模块化架构的成熟流式神经编解码器实现，我们仅将 AudioDec 与 SoundStream 进行了比较。\n3. **是否也会发布预训练的判别器？**  \n   对于许多应用（如降噪），仅更新编码器即可达到与更新整个模型几乎相同的效果。而对于涉及解码器更新的应用（如双耳渲染），可能更适合为该特定应用设计专用判别器。因此，我们仅发布生成器。\n4. **AudioDec 是否可以编解码多声道信号？**  \n   是的，您可以通过修改配置中的 input_channels 和 output_channels 来训练 MIMO 模型。我在训练 MIMO 模型时学到的一点经验是：尽管生成器本身是 MIMO 的，但在后续判别器处理前将生成器输出信号重塑为单声道，能够显著提升 MIMO 音质。","# AudioDec 快速上手指南\n\nAudioDec 是一个开源的流式高保真神经音频编解码器，专为实时应用（如电信）设计。它支持 **48 kHz** 单声道语音，比特率低至 **12.8 kbps**，并在 GPU 上实现约 **6 ms**、CPU 上约 **10 ms** 的极低解码延迟。\n\n## 环境准备\n\n本项目已在 **Ubuntu 20.04** 系统上使用 V100 GPU 测试通过。请确保满足以下前置依赖：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+)\n*   **Python**: 3.8 或更高版本\n*   **CUDA**: 11.0 或更高版本\n*   **PyTorch**: 1.10 或更高版本\n\n> **提示**：国内用户建议使用清华源或阿里源加速 PyTorch 及相关依赖的安装。\n\n## 安装步骤\n\n1.  **克隆仓库**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec.git\n    cd AudioDec\n    ```\n\n2.  **安装 Python 依赖**\n    虽然原文未提供 `requirements.txt`，但根据项目结构，通常需要安装基础音频处理和深度学习库。请确保已安装 PyTorch，然后安装其他必要包（如 `sounddevice`, `librosa`, `pyyaml` 等）：\n    ```bash\n    pip install sounddevice librosa pyyaml numpy scipy\n    # 如果尚未安装 PyTorch，请参考 pytorch.org 使用国内镜像安装\n    # 例如：pip install torch torchvision torchaudio --index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n3.  **下载预训练模型**\n    运行演示或推理前，必须下载预训练模型文件夹 (`exp`) 并解压到项目根目录。\n    \n    *   **下载地址**: [exp.zip](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Freleases\u002Fdownload\u002Fpretrain_models_v02\u002Fexp.zip)\n    *   **操作**: 下载后解压，确保 `exp` 文件夹位于 `AudioDec` 项目目录下。\n    \n    ```bash\n    # 示例：使用 wget 下载（需自行替换为实际下载链接或手动下载后上传）\n    # wget https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Freleases\u002Fdownload\u002Fpretrain_models_v02\u002Fexp.zip\n    # unzip exp.zip\n    ```\n\n## 基本使用\n\n### 1. 文件编解码演示（最简单用法）\n\n对现有的音频文件进行编码和解码。\n\n*   **VCTK 数据集模型 (48kHz)**:\n    ```bash\n    python demoFile.py --model vctk_v1 -i input.wav -o output.wav\n    ```\n\n*   **LibriTTS 数据集模型 (24kHz)**:\n    ```bash\n    python demoFile.py --model libritts_v1 -i input.wav -o output.wav\n    ```\n    *注：请将 `input.wav` 替换为你的输入文件路径，`output.wav` 为输出文件路径。*\n\n### 2. 实时流式编解码演示\n\n通过麦克风输入并实时播放解码后的音频。\n\n1.  **查看音频设备列表**：\n    ```bash\n    python -m sounddevice\n    ```\n    记下输入设备 (input_device) 和输出设备 (output_device) 的 ID。\n\n2.  **运行流式演示**：\n\n    *   **使用 GPU (推荐 LibriTTS 模型，对麦克风通道不匹配更具鲁棒性)**：\n        ```bash\n        python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model libritts_v1\n        ```\n        *(请根据上一步查到的设备 ID 修改 `--input_device` 和 `--output_device`)*\n\n    *   **使用 CPU**:\n        ```bash\n        python demoStream.py --tx_cuda -1 --rx_cuda -1 --input_device 1 --output_device 4 --model libritts_sym\n        ```\n\n    *运行后，程序将捕获麦克风输入并实时从扬声器播放重建的音频，同时保存 `input.wav` 和 `output.wav` 到当前目录。*\n\n### 3. 进阶：降噪功能 (Bonus)\n\nAudioDec 支持通过仅更新编码器来实现降噪。\n\n*   **文件降噪处理**：\n    ```bash\n    python demoFile.py -i noisy_input.wav -o denoised_output.wav --model vctk_denoise\n    ```\n\n*   **实时流式降噪**：\n    ```bash\n    python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model vctk_denoise\n    ```","某远程医疗平台正在构建一套支持高清听诊音实时传输的在线问诊系统，要求医生能清晰分辨患者心肺杂音细节。\n\n### 没有 AudioDec 时\n- **音质严重受损**：传统编码为节省带宽将采样率降至 8kHz 或 16kHz，导致高频病理杂音丢失，医生难以准确诊断。\n- **网络带宽压力大**：传输未压缩或低效压缩的 48kHz 原始音频需占用极高带宽，在弱网环境下极易卡顿或中断。\n- **沟通延迟明显**：现有编解码方案在普通服务器上处理延迟往往超过 50ms，造成医患对话出现可感知的“抢话”现象。\n- **部署成本高昂**：若要兼顾音质与低延迟，通常需依赖昂贵的专用硬件加速卡，增加了中小医院的接入门槛。\n\n### 使用 AudioDec 后\n- **高保真还原病灶音**：AudioDec 能在 12.8 kbps 极低码率下完整保留 48kHz 采样率的音频细节，让细微的心肺杂音清晰可辨。\n- **弱网环境流畅传输**：凭借极高的压缩效率，仅需极小带宽即可传输高清音频，显著提升了移动网络下的通话稳定性。\n- **实现近乎零延迟互动**：利用其 GPU 约 6ms、CPU 约 10ms 的超低解码延迟，确保了医患交流如面对面般自然顺畅。\n- **通用硬件轻松落地**：无需专用硬件，仅在常规云服务器甚至四核 CPU 上即可运行，大幅降低了系统的部署与运维成本。\n\nAudioDec 通过突破性的神经编解码技术，成功在极低带宽下实现了医疗级高清音频的实时无损传输，让远程听诊真正具备临床价值。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_AudioDec_0c147d67.png","facebookresearch","Meta Research","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Ffacebookresearch_449342bd.png","",null,"https:\u002F\u002Fopensource.fb.com","https:\u002F\u002Fgithub.com\u002Ffacebookresearch",[84,88],{"name":85,"color":86,"percentage":87},"Python","#3572A5",89.1,{"name":89,"color":90,"percentage":91},"Shell","#89e051",10.9,501,30,"2026-04-01T06:52:18","NOASSERTION","Linux","非必需（支持 CPU 运行），如需 GPU 加速需 NVIDIA 显卡（测试环境为 V100），CUDA 11.0+","未说明",{"notes":100,"python":101,"dependencies":102},"该项目已在 Ubuntu 20.04 和 V100 GPU 上完成测试。支持纯 CPU 模式运行（解码延迟约 10ms），GPU 模式下延迟更低（约 6ms）。运行实时流式演示或文件处理前，需下载预训练模型文件夹（exp）。许可证为 CC-BY-NC（知识共享署名 - 非商业性使用），仅限非商业用途。","3.8+",[103,104,105],"torch>=1.10","sounddevice","pyyaml",[21],"2026-03-27T02:49:30.150509","2026-04-06T18:52:34.955413",[110,115,119,124,128,132,137,142,146],{"id":111,"question_zh":112,"answer_zh":113,"source_url":114},19830,"如何执行去噪（denoising）任务？提交脚本是 submit_denoise.sh 还是 submit_autoencoder.sh？","请按照 README 中 Bonus Track 部分的说明，准备含噪 - 纯净语料库，并遵循 `submit_denoise.sh` 中的使用说明来运行训练和测试。虽然代码示例中可能提到 `submit_autoencoder.sh`，但去噪任务应参考 `submit_denoise.sh` 的流程。对于数据长度不匹配的问题，可以通过预处理纯净语音使其长度符合规则（例如是 hop_size 的倍数），然后生成 Type I（混合噪声后经过第一阶段 AudioDec 处理）和 Type II（仅经过第一阶段 AudioDec 处理）的含噪数据。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Fissues\u002F35",{"id":116,"question_zh":117,"answer_zh":118,"source_url":114},19831,"AudioDec 输出音频长度与输入不一致的原因是什么？如何解决？","长度不匹配是由 CNN 的填充（padding）引起的。在训练过程中，您可以控制采样长度。只要输入长度是 AudioDec hop_size 的倍数（例如 3x4x5x5=300），输出长度就会与输入长度完全匹配。例如，如果纯净数据长度为 [1, 36600]（300 的倍数），则可以得到长度完全一致的输出。建议在预处理阶段将音频裁剪或填充为 hop_size 的倍数。",{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},19832,"从头开始训练时，困惑度（perplexity）和 VQ loss 立即上升是否正常？","如果在第二阶段训练中出现 mode collapse（模式崩溃），vq_loss 和 mel_loss 可能会显著升高。如果模型训练正常，这两个损失值通常只会轻微增加（约 1-2）。建议在第二阶段训练期间密切监控 mel_loss 和 vq_loss。此外，在稳定的 GAN 模型中，real_loss 通常与 fake_loss 相似。如果损失持续飙升，可能需要检查学习率、数据预处理或增加第一阶段的迭代次数。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Fissues\u002F8",{"id":125,"question_zh":126,"answer_zh":127,"source_url":123},19833,"针对不同数据集（如 VCTK 和 LibriTTS），推荐的训练迭代次数和最终 mel_loss 是多少？","最终的 mel_loss 因数据集而异：VCTK 约为 17，LibriTTS 约为 22。训练时间取决于迭代次数和 batch_size。使用 A100 GPU，batch_size 为 0.2 秒时，仅指标训练需 200k 次迭代，加上对抗训练共需 500k 次迭代，大约需要 2 天。经验表明，第一阶段的迭代次数应根据训练数据量调整（VCTK 设为 200k，LibriTTS 设为 500k），而第二阶段的迭代次数对不同数据集可以保持一致。",{"id":129,"question_zh":130,"answer_zh":131,"source_url":123},19834,"如何配置以跳过第一阶段训练，直接进行第二阶段（+HiFiGAN 判别器）训练？","如果您想跳过第一阶段直接进入第二阶段，可以在配置文件（.yaml）中将判别器的启动步数（start_steps）设置为 0。具体配置如下：\nstart_steps:\n    generator: 0\n    discriminator: 0\ntrain_max_steps: 500000\n注意：虽然论文显示第一阶段提升较小，但跳过它可能会导致模型稳定性下降，需谨慎操作。",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},19835,"生成器损失（generator loss）一直增加是否正常？生成器和判别器的正确损失范围是多少？","在 GAN 训练中，生成器损失波动甚至暂时上升可能是正常的，特别是在训练初期或遇到模式崩溃时。关键在于监控 mel_loss 和 vq_loss：如果模型状态良好，这些损失应仅轻微上升。对于判别器，real_loss 和 fake_loss 在稳定状态下应数值相近。没有绝对的“正确”损失值，需结合具体数据集观察趋势。如果发现损失异常飙升，建议检查判别器类型组合（如使用 MPDs + multi-resolution STFTDs 通常效果较好）。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Fissues\u002F19",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},19836,"复现论文结果时，测试指标（如 DNSMOS）与论文不符，可能的原因是什么？","测试结果差异可能源于评估代码或数据集版本的不同。作者使用的是微软 DNS-Challenge 仓库中的 DNSMOS 模型进行评估（链接：https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDNS-Challenge\u002Ftree\u002Fmaster\u002FDNSMOS），并使用 `clean_test_wav` 数据集。如果您的得分较低（如 3.4-3.5），请确认是否使用了相同的评估脚本、模型版本以及数据预处理流程。此外，基线模型的选择（如 vctk_v1）也会影响结果。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FAudioDec\u002Fissues\u002F12",{"id":143,"question_zh":144,"answer_zh":145,"source_url":123},19837,"Codebook 使用率低导致质量下降，有什么解决办法？","减少 codebook 数量会导致音质显著下降，因此不建议这样做。更好的方法是采用先进的技术来提高 codebook 的使用率。虽然本仓库未集成此类技术，但您可以参考其他流行的神经编解码器（neural codec）仓库，寻找如 codebook 负载均衡、温度调节等改进技巧来优化训练效果。",{"id":147,"question_zh":148,"answer_zh":149,"source_url":136},19838,"推荐使用的判别器组合有哪些？多分辨率 STFT 判别器是指什么？","内部实验表明，使用的判别器类型越多，性能通常越好。但在实际应用中，使用 MPDs（多周期判别器）+ multi-resolution STFTDs（多分辨率短时傅里叶变换判别器）通常已经足够好，这也与其他相关工作的结果一致。多分辨率 STFT 判别器类似于 UnivNet 提出的 MSRD（multi-resolution spectrogram discriminator），旨在捕捉不同分辨率下的频谱特征以提升生成质量。",[151],{"id":152,"version":153,"summary_zh":154,"released_at":155},117882,"pretrain_models_v02","整个“exp”文件夹包含所有预训练模型。\n文件夹结构如下：\n- exp\n  - autoencoder\n    - symAD_libritts_24000_hop300\n    - symAD_vctk_48000_hop300\n    - symADuniv_vctk_48000_hop300\n    - symAAD_vctk_48000_hop300\n    - symAD_c16_vctk_48000_hop320\n  - denoise\n    - symAD_vctk_48000_hop300\n  - vocoder\n    - AudioDec_v0_symAD_vctk_48000_hop300_clean\n    - AudioDec_v1_symAD_libritts_24000_hop300_clean\n    - AudioDec_v1_symAD_vctk_48000_hop300_clean\n    - AudioDec_v2_symAD_vctk_48000_hop300_clean\n    - AudioDec_v3_symADuniv_vctk_48000_hop300_clean","2024-01-03T21:14:14"]