[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-NVIDIA--BigVGAN":3,"tool-NVIDIA--BigVGAN":64},[4,23,32,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,2,"2026-04-10T11:13:16",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},4128,"GPT-SoVITS","RVC-Boss\u002FGPT-SoVITS","GPT-SoVITS 是一款强大的开源语音合成与声音克隆工具，旨在让用户仅需极少量的音频数据即可训练出高质量的个性化语音模型。它核心解决了传统语音合成技术依赖海量录音数据、门槛高且成本大的痛点，实现了“零样本”和“少样本”的快速建模：用户只需提供 5 秒参考音频即可即时生成语音，或使用 1 分钟数据进行微调，从而获得高度逼真且相似度极佳的声音效果。\n\n该工具特别适合内容创作者、独立开发者、研究人员以及希望为角色配音的普通用户使用。其内置的友好 WebUI 界面集成了人声伴奏分离、自动数据集切片、中文语音识别及文本标注等辅助功能，极大地降低了数据准备和模型训练的技术门槛，让非专业人士也能轻松上手。\n\n在技术亮点方面，GPT-SoVITS 不仅支持中、英、日、韩、粤语等多语言跨语种合成，还具备卓越的推理速度，在主流显卡上可实现实时甚至超实时的生成效率。无论是需要快速制作视频配音，还是进行多语言语音交互研究，GPT-SoVITS 都能以极低的数据成本提供专业级的语音合成体验。",56375,3,"2026-04-05T22:15:46",[21],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},2863,"TTS","coqui-ai\u002FTTS","🐸TTS 是一款功能强大的深度学习文本转语音（Text-to-Speech）开源库，旨在将文字自然流畅地转化为逼真的人声。它解决了传统语音合成技术中声音机械生硬、多语言支持不足以及定制门槛高等痛点，让高质量的语音生成变得触手可及。\n\n无论是希望快速集成语音功能的开发者，还是致力于探索前沿算法的研究人员，亦或是需要定制专属声音的数据科学家，🐸TTS 都能提供得力支持。它不仅预置了覆盖全球 1100 多种语言的训练模型，让用户能够即刻上手，还提供了完善的工具链，支持用户利用自有数据训练新模型或对现有模型进行微调，轻松实现特定风格的声音克隆。\n\n在技术亮点方面，🐸TTS 表现卓越。其最新的 ⓍTTSv2 模型支持 16 种语言，并在整体性能上大幅提升，实现了低于 200 毫秒的超低延迟流式输出，极大提升了实时交互体验。此外，它还无缝集成了 🐶Bark、🐢Tortoise 等社区热门模型，并支持调用上千个 Fairseq 模型，展现了极强的兼容性与扩展性。配合丰富的数据集分析与整理工具，🐸TTS 已成为科研与生产环境中备受信赖的语音合成解决方案。",44971,"2026-04-03T14:47:02",[21,20,13],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":29,"last_commit_at":46,"category_tags":47,"status":22},2375,"LocalAI","mudler\u002FLocalAI","LocalAI 是一款开源的本地人工智能引擎，旨在让用户在任意硬件上轻松运行各类 AI 模型，包括大语言模型、图像生成、语音识别及视频处理等。它的核心优势在于彻底打破了高性能计算的门槛，无需昂贵的专用 GPU，仅凭普通 CPU 或常见的消费级显卡（如 NVIDIA、AMD、Intel 及 Apple Silicon）即可部署和运行复杂的 AI 任务。\n\n对于担心数据隐私的用户而言，LocalAI 提供了“隐私优先”的解决方案，确保所有数据处理均在本地基础设施内完成，无需上传至云端。同时，它完美兼容 OpenAI、Anthropic 等主流 API 接口，这意味着开发者可以无缝迁移现有应用，直接利用本地资源替代云服务，既降低了成本又提升了可控性。\n\nLocalAI 内置了超过 35 种后端支持（如 llama.cpp、vLLM、Whisper 等），并集成了自主 AI 代理、工具调用及检索增强生成（RAG）等高级功能，且具备多用户管理与权限控制能力。无论是希望保护敏感数据的企业开发者、进行算法实验的研究人员，还是想要在个人电脑上体验最新 AI 技术的极客玩家，都能通过 LocalAI 获",44782,"2026-04-02T22:14:26",[13,21,19,17,20,14,16],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":29,"last_commit_at":54,"category_tags":55,"status":22},3108,"bark","suno-ai\u002Fbark","Bark 是由 Suno 推出的开源生成式音频模型，能够根据文本提示创造出高度逼真的多语言语音、音乐、背景噪音及简单音效。与传统仅能朗读文字的语音合成工具不同，Bark 基于 Transformer 架构，不仅能模拟说话，还能生成笑声、叹息、哭泣等非语言声音，甚至能处理带有情感色彩和语气停顿的复杂文本，极大地丰富了音频表达的可能性。\n\n它主要解决了传统语音合成声音机械、缺乏情感以及无法生成非语音类音效的痛点，让创作者能通过简单的文字描述获得生动自然的音频素材。无论是需要为视频配音的内容创作者、探索多模态生成的研究人员，还是希望快速原型设计的开发者，都能从中受益。普通用户也可通过集成的演示页面轻松体验其神奇效果。\n\n技术亮点方面，Bark 支持商业使用（MIT 许可），并在近期更新中实现了显著的推理速度提升，同时提供了适配低显存 GPU 的版本，降低了使用门槛。此外，社区还建立了丰富的提示词库，帮助用户更好地驾驭模型生成特定风格的声音。只需几行 Python 代码，即可将创意文本转化为高质量音频，是连接文字与声音世界的强大桥梁。",39067,"2026-04-04T03:33:35",[21],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":29,"last_commit_at":62,"category_tags":63,"status":22},5908,"ChatTTS","2noise\u002FChatTTS","ChatTTS 是一款专为日常对话场景打造的生成式语音模型，特别适用于大语言模型助手等交互式应用。它主要解决了传统文本转语音（TTS）技术在对话中缺乏自然感、情感表达单一以及难以处理停顿、笑声等细微语气的问题，让机器生成的语音听起来更像真人在聊天。\n\n这款工具非常适合开发者、研究人员以及希望为应用增添自然语音交互功能的设计师使用。普通用户也可以通过社区开发的衍生产品体验其能力。ChatTTS 的核心亮点在于其对对话任务的深度优化：它不仅支持中英文双语，还能精准控制韵律细节，自动生成自然的 laughter（笑声）、pauses（停顿）和 interjections（插入语），从而实现多说话人的互动对话效果。在韵律表现上，ChatTTS 超越了大多数开源 TTS 模型。目前开源版本基于 4 万小时数据预训练而成，虽主要用于学术研究与教育目的，但已展现出强大的潜力，并支持流式音频生成与零样本推理，为后续的多情绪控制等进阶功能奠定了基础。",39042,"2026-04-09T11:54:03",[19,17,20,21],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":99,"forks":100,"last_commit_at":101,"license":102,"difficulty_score":29,"env_os":103,"env_gpu":104,"env_ram":105,"env_deps":106,"category_tags":116,"github_topics":117,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":22,"created_at":124,"updated_at":125,"faqs":126,"releases":127},8182,"NVIDIA\u002FBigVGAN","BigVGAN","Official PyTorch implementation of BigVGAN (ICLR 2023)","BigVGAN 是一款由英伟达推出的通用神经声码器，旨在将梅尔频谱图等声学特征高保真地还原为自然流畅的音频波形。它主要解决了传统声码器在合成语音时容易出现的机械感强、细节丢失以及在跨语言或复杂声音场景下表现不佳的问题，能够高质量地处理多语言语音、环境音效及乐器声音。\n\n这款工具非常适合从事语音合成（TTS）、歌声合成研究的科研人员，以及需要部署高质量音频生成模型的开发者。得益于其大规模训练策略，BigVGAN 具备极强的泛化能力。其技术亮点在于最新的 v2 版本引入了自定义的融合 CUDA 内核，将上采样与激活操作合并，使得在单张 A100 GPU 上的推理速度提升了 1.5 至 3 倍。此外，模型采用了多尺度子带 CQT 判别器和多尺度梅尔频谱损失函数，并支持高达 44kHz 的采样率。BigVGAN 已集成至 Hugging Face Hub，提供预训练权重和交互式演示，方便用户快速上手体验或进行二次开发。","## BigVGAN: A Universal Neural Vocoder with Large-Scale Training\n\n#### Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon\n\n[[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.04658) - [[Code]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN) - [[Showcase]](https:\u002F\u002Fbigvgan-demo.github.io\u002F) - [[Project Page]](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fadlr\u002Fprojects\u002Fbigvgan\u002F) - [[Weights]](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fnvidia\u002Fbigvgan-66959df3d97fd7d98d97dc9a) - [[Demo]](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fnvidia\u002FBigVGAN)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fbigvgan-a-universal-neural-vocoder-with-large\u002Fspeech-synthesis-on-libritts)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fspeech-synthesis-on-libritts?p=bigvgan-a-universal-neural-vocoder-with-large)\n\n\u003Ccenter>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA_BigVGAN_readme_abd7f47167ef.png\" width=\"800\">\u003C\u002Fcenter>\n\n## News\n- **Sep 2024 (v2.4):**\n  - We have updated the pretrained checkpoints trained for 5M steps. This is final release of the BigVGAN-v2 checkpoints.\n\n- **Jul 2024 (v2.3):**\n  - General refactor and code improvements for improved readability.\n  - Fully fused CUDA kernel of anti-alised activation (upsampling + activation + downsampling) with inference speed benchmark.\n\n- **Jul 2024 (v2.2):** The repository now includes an interactive local demo using gradio.\n\n- **Jul 2024 (v2.1):** BigVGAN is now integrated with 🤗 Hugging Face Hub with easy access to inference using pretrained checkpoints. We also provide an interactive demo on Hugging Face Spaces.\n\n- **Jul 2024 (v2):** We release BigVGAN-v2 along with pretrained checkpoints. Below are the highlights:\n  - Custom CUDA kernel for inference: we provide a fused upsampling + activation kernel written in CUDA for accelerated inference speed. Our test shows 1.5 - 3x faster speed on a single A100 GPU.\n  - Improved discriminator and loss: BigVGAN-v2 is trained using a [multi-scale sub-band CQT discriminator](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.14957) and a [multi-scale mel spectrogram loss](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.06546).\n  - Larger training data: BigVGAN-v2 is trained using datasets containing diverse audio types, including speech in multiple languages, environmental sounds, and instruments.\n  - We provide pretrained checkpoints of BigVGAN-v2 using diverse audio configurations, supporting up to 44 kHz sampling rate and 512x upsampling ratio.\n\n## Installation\n\nThe codebase has been tested on Python `3.10` and PyTorch `2.3.1` conda packages with either `pytorch-cuda=12.1` or `pytorch-cuda=11.8`. Below is an example command to create the conda environment:\n\n```shell\nconda create -n bigvgan python=3.10 pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia\nconda activate bigvgan\n```\n\nClone the repository and install dependencies:\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\ncd BigVGAN\npip install -r requirements.txt\n```\n\n## Inference Quickstart using 🤗 Hugging Face Hub\n\nBelow example describes how you can use BigVGAN: load the pretrained BigVGAN generator from Hugging Face Hub, compute mel spectrogram from input waveform, and generate synthesized waveform using the mel spectrogram as the model's input.\n\n```python\ndevice = 'cuda'\n\nimport torch\nimport bigvgan\nimport librosa\nfrom meldataset import get_mel_spectrogram\n\n# instantiate the model. You can optionally set use_cuda_kernel=True for faster inference.\nmodel = bigvgan.BigVGAN.from_pretrained('nvidia\u002Fbigvgan_v2_24khz_100band_256x', use_cuda_kernel=False)\n\n# remove weight norm in the model and set to eval mode\nmodel.remove_weight_norm()\nmodel = model.eval().to(device)\n\n# load wav file and compute mel spectrogram\nwav_path = '\u002Fpath\u002Fto\u002Fyour\u002Faudio.wav'\nwav, sr = librosa.load(wav_path, sr=model.h.sampling_rate, mono=True) # wav is np.ndarray with shape [T_time] and values in [-1, 1]\nwav = torch.FloatTensor(wav).unsqueeze(0) # wav is FloatTensor with shape [B(1), T_time]\n\n# compute mel spectrogram from the ground truth audio\nmel = get_mel_spectrogram(wav, model.h).to(device) # mel is FloatTensor with shape [B(1), C_mel, T_frame]\n\n# generate waveform from mel\nwith torch.inference_mode():\n    wav_gen = model(mel) # wav_gen is FloatTensor with shape [B(1), 1, T_time] and values in [-1, 1]\nwav_gen_float = wav_gen.squeeze(0).cpu() # wav_gen is FloatTensor with shape [1, T_time]\n\n# you can convert the generated waveform to 16 bit linear PCM\nwav_gen_int16 = (wav_gen_float * 32767.0).numpy().astype('int16') # wav_gen is now np.ndarray with shape [1, T_time] and int16 dtype\n```\n\n## Local gradio demo \u003Ca href='https:\u002F\u002Fgithub.com\u002Fgradio-app\u002Fgradio'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgradio-app\u002Fgradio'>\u003C\u002Fa>\n\nYou can run a local gradio demo using below command:\n\n```python\npip install -r demo\u002Frequirements.txt\npython demo\u002Fapp.py\n```\n\n## Training\n\nCreate symbolic link to the root of the dataset. The codebase uses filelist with the relative path from the dataset. Below are the example commands for LibriTTS dataset:\n\n```shell\ncd filelists\u002FLibriTTS && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Ftrain-clean-100 train-clean-100 && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Ftrain-clean-360 train-clean-360 && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Ftrain-other-500 train-other-500 && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Fdev-clean dev-clean && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Fdev-other dev-other && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Ftest-clean test-clean && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Ftest-other test-other && \\\ncd ..\u002F..\n```\n\nTrain BigVGAN model. Below is an example command for training BigVGAN-v2 using LibriTTS dataset at 24kHz with a full 100-band mel spectrogram as input:\n\n```shell\npython train.py \\\n--config configs\u002Fbigvgan_v2_24khz_100band_256x.json \\\n--input_wavs_dir filelists\u002FLibriTTS \\\n--input_training_file filelists\u002FLibriTTS\u002Ftrain-full.txt \\\n--input_validation_file filelists\u002FLibriTTS\u002Fval-full.txt \\\n--list_input_unseen_wavs_dir filelists\u002FLibriTTS filelists\u002FLibriTTS \\\n--list_input_unseen_validation_file filelists\u002FLibriTTS\u002Fdev-clean.txt filelists\u002FLibriTTS\u002Fdev-other.txt \\\n--checkpoint_path exp\u002Fbigvgan_v2_24khz_100band_256x\n```\n\n## Synthesis\n\nSynthesize from BigVGAN model. Below is an example command for generating audio from the model.\nIt computes mel spectrograms using wav files from `--input_wavs_dir` and saves the generated audio to `--output_dir`.\n\n```shell\npython inference.py \\\n--checkpoint_file \u002Fpath\u002Fto\u002Fyour\u002Fbigvgan_v2_24khz_100band_256x\u002Fbigvgan_generator.pt \\\n--input_wavs_dir \u002Fpath\u002Fto\u002Fyour\u002Finput_wav \\\n--output_dir \u002Fpath\u002Fto\u002Fyour\u002Foutput_wav\n```\n\n`inference_e2e.py` supports synthesis directly from the mel spectrogram saved in `.npy` format, with shapes `[1, channel, frame]` or `[channel, frame]`.\nIt loads mel spectrograms from `--input_mels_dir` and saves the generated audio to `--output_dir`.\n\nMake sure that the STFT hyperparameters for mel spectrogram are the same as the model, which are defined in `config.json` of the corresponding model.\n\n```shell\npython inference_e2e.py \\\n--checkpoint_file \u002Fpath\u002Fto\u002Fyour\u002Fbigvgan_v2_24khz_100band_256x\u002Fbigvgan_generator.pt \\\n--input_mels_dir \u002Fpath\u002Fto\u002Fyour\u002Finput_mel \\\n--output_dir \u002Fpath\u002Fto\u002Fyour\u002Foutput_wav\n```\n\n## Using Custom CUDA Kernel for Synthesis\n\nYou can apply the fast CUDA inference kernel by using a parameter `use_cuda_kernel` when instantiating BigVGAN:\n\n```python\ngenerator = BigVGAN(h, use_cuda_kernel=True)\n```\n\nYou can also pass `--use_cuda_kernel` to `inference.py` and `inference_e2e.py` to enable this feature.\n\nWhen applied for the first time, it builds the kernel using `nvcc` and `ninja`. If the build succeeds, the kernel is saved to `alias_free_activation\u002Fcuda\u002Fbuild` and the model automatically loads the kernel. The codebase has been tested using CUDA `12.1`.\n\nPlease make sure that both are installed in your system and `nvcc` installed in your system matches the version your PyTorch build is using.\n\nWe recommend running `test_cuda_vs_torch_model.py` first to build and check the correctness of the CUDA kernel. See below example command and its output, where it returns `[Success] test CUDA fused vs. plain torch BigVGAN inference`:\n\n```python\npython tests\u002Ftest_cuda_vs_torch_model.py \\\n--checkpoint_file \u002Fpath\u002Fto\u002Fyour\u002Fbigvgan_generator.pt\n```\n\n```shell\nloading plain Pytorch BigVGAN\n...\nloading CUDA kernel BigVGAN with auto-build\nDetected CUDA files, patching ldflags\nEmitting ninja build file \u002Fpath\u002Fto\u002Fyour\u002FBigVGAN\u002Falias_free_activation\u002Fcuda\u002Fbuild\u002Fbuild.ninja..\nBuilding extension module anti_alias_activation_cuda...\n...\nLoading extension module anti_alias_activation_cuda...\n...\nLoading '\u002Fpath\u002Fto\u002Fyour\u002Fbigvgan_generator.pt'\n...\n[Success] test CUDA fused vs. plain torch BigVGAN inference\n > mean_difference=0.0007238413265440613\n...\n```\n\nIf you see `[Fail] test CUDA fused vs. plain torch BigVGAN inference`, it means that the CUDA kernel inference is incorrect. Please check if `nvcc` installed in your system is compatible with your PyTorch version.\n\n## Pretrained Models\n\nWe provide the [pretrained models on Hugging Face Collections](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fnvidia\u002Fbigvgan-66959df3d97fd7d98d97dc9a).\nOne can download the checkpoints of the generator weight (named `bigvgan_generator.pt`) and its discriminator\u002Foptimizer states (named `bigvgan_discriminator_optimizer.pt`) within the listed model repositories.\n\n| Model Name                                                                                               | Sampling Rate | Mel band | fmax  | Upsampling Ratio | Params | Dataset                    | Steps | Fine-Tuned |\n|:--------------------------------------------------------------------------------------------------------:|:-------------:|:--------:|:-----:|:----------------:|:------:|:--------------------------:|:-----:|:----------:|\n| [bigvgan_v2_44khz_128band_512x](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_v2_44khz_128band_512x)             | 44 kHz        | 128      | 22050 | 512              | 122M   | Large-scale Compilation    | 5M    | No         |\n| [bigvgan_v2_44khz_128band_256x](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_v2_44khz_128band_256x)             | 44 kHz        | 128      | 22050 | 256              | 112M   | Large-scale Compilation    | 5M    | No         |\n| [bigvgan_v2_24khz_100band_256x](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_v2_24khz_100band_256x)             | 24 kHz        | 100      | 12000 | 256              | 112M   | Large-scale Compilation    | 5M    | No         |\n| [bigvgan_v2_22khz_80band_256x](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_v2_22khz_80band_256x)               | 22 kHz        | 80       | 11025 | 256              | 112M   | Large-scale Compilation    | 5M    | No         |\n| [bigvgan_v2_22khz_80band_fmax8k_256x](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_v2_22khz_80band_fmax8k_256x) | 22 kHz        | 80       | 8000  | 256              | 112M   | Large-scale Compilation    | 5M    | No         |\n| [bigvgan_24khz_100band](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_24khz_100band)                             | 24 kHz        | 100      | 12000 | 256              | 112M   | LibriTTS                   | 5M    | No         |\n| [bigvgan_base_24khz_100band](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_base_24khz_100band)                   | 24 kHz        | 100      | 12000 | 256              | 14M    | LibriTTS                   | 5M    | No         |\n| [bigvgan_22khz_80band](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_22khz_80band)                               | 22 kHz        | 80       | 8000  | 256              | 112M   | LibriTTS + VCTK + LJSpeech | 5M    | No         |\n| [bigvgan_base_22khz_80band](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_base_22khz_80band)                     | 22 kHz        | 80       | 8000  | 256              | 14M    | LibriTTS + VCTK + LJSpeech | 5M    | No         |\n\nThe paper results are based on the original 24kHz BigVGAN models (`bigvgan_24khz_100band` and `bigvgan_base_24khz_100band`) trained on LibriTTS dataset.\nWe also provide 22kHz BigVGAN models with band-limited setup (i.e., fmax=8000) for TTS applications.\nNote that the checkpoints use `snakebeta` activation with log scale parameterization, which have the best overall quality.\n\nYou can fine-tune the models by:\n\n1. downloading the checkpoints (both the generator weight and its discriminator\u002Foptimizer states)\n2. resuming training using your audio dataset by specifying `--checkpoint_path` that includes the checkpoints when launching `train.py`\n\n## Training Details of BigVGAN-v2\n\nComapred to the original BigVGAN, the pretrained checkpoints of BigVGAN-v2 used `batch_size=32` with a longer `segment_size=65536` and are trained using 8 A100 GPUs.\n\nNote that the BigVGAN-v2 `json` config files in `.\u002Fconfigs` use `batch_size=4` as default to fit in a single A100 GPU for training. You can fine-tune the models adjusting `batch_size` depending on your GPUs.\n\nWhen training BigVGAN-v2 from scratch with small batch size, it can potentially encounter the early divergence problem mentioned in the paper. In such case, we recommend lowering the `clip_grad_norm` value (e.g. `100`) for the early training iterations (e.g. 20000 steps) and increase the value to the default `500`.\n\n## Evaluation Results of BigVGAN-v2\n\nBelow are the objective results of the 24kHz model (`bigvgan_v2_24khz_100band_256x`) obtained from the LibriTTS `dev` sets. BigVGAN-v2 shows noticeable improvements of the metrics. The model also exhibits reduced perceptual artifacts, especially for non-speech audio.\n\n| Model      | Dataset                 | Steps | PESQ(↑)   | M-STFT(↓)  | MCD(↓)     | Periodicity(↓) | V\u002FUV F1(↑) |\n|:----------:|:-----------------------:|:-----:|:---------:|:----------:|:----------:|:--------------:|:----------:|\n| BigVGAN    | LibriTTS                | 1M    | 4.027     | 0.7997     | 0.3745     | 0.1018         | 0.9598     |\n| BigVGAN    | LibriTTS                | 5M    | 4.256     | 0.7409     | 0.2988     | 0.0809         | 0.9698     |\n| BigVGAN-v2 | Large-scale Compilation | 3M    | 4.359     | 0.7134     | 0.3060     | 0.0621         | 0.9777     |\n| BigVGAN-v2 | Large-scale Compilation | 5M    | **4.362** | **0.7026** | **0.2903** | **0.0593**     | **0.9793** |\n\n## Speed Benchmark\n\nBelow are the speed and VRAM usage benchmark results of BigVGAN from `tests\u002Ftest_cuda_vs_torch_model.py`, using `bigvgan_v2_24khz_100band_256x` as a reference model.\n\n| GPU                        | num_mel_frame | use_cuda_kernel | Speed (kHz) | Real-time Factor | VRAM (GB) |\n|:--------------------------:|:-------------:|:---------------:|:-----------:|:----------------:|:---------:|\n| NVIDIA A100                | 256           | False           | 1672.1      | 69.7x            | 1.3       |\n|                            |               | True            | 3916.5      | 163.2x           | 1.3       |\n|                            | 2048          | False           | 1899.6      | 79.2x            | 1.7       |\n|                            |               | True            | 5330.1      | 222.1x           | 1.7       |\n|                            | 16384         | False           | 1973.8      | 82.2x            | 5.0       |\n|                            |               | True            | 5761.7      | 240.1x           | 4.4       |\n| NVIDIA GeForce RTX 3080    | 256           | False           | 841.1       | 35.0x            | 1.3       |\n|                            |               | True            | 1598.1      | 66.6x            | 1.3       |\n|                            | 2048          | False           | 929.9       | 38.7x            | 1.7       |\n|                            |               | True            | 1971.3      | 82.1x            | 1.6       |\n|                            | 16384         | False           | 943.4       | 39.3x            | 5.0       |\n|                            |               | True            | 2026.5      | 84.4x            | 3.9       |\n| NVIDIA GeForce RTX 2080 Ti | 256           | False           | 515.6       | 21.5x            | 1.3       |\n|                            |               | True            | 811.3       | 33.8x            | 1.3       |\n|                            | 2048          | False           | 576.5       | 24.0x            | 1.7       |\n|                            |               | True            | 1023.0      | 42.6x            | 1.5       |\n|                            | 16384         | False           | 589.4       | 24.6x            | 5.0       |\n|                            |               | True            | 1068.1      | 44.5x            | 3.2       |\n\n## Acknowledgements\n\nWe thank Vijay Anand Korthikanti and Kevin J. Shih for their generous support in implementing the CUDA kernel for inference.\n\n## References\n\n- [HiFi-GAN](https:\u002F\u002Fgithub.com\u002Fjik876\u002Fhifi-gan) (for generator and multi-period discriminator)\n- [Snake](https:\u002F\u002Fgithub.com\u002FEdwardDixon\u002Fsnake) (for periodic activation)\n- [Alias-free-torch](https:\u002F\u002Fgithub.com\u002Fjunjun3518\u002Falias-free-torch) (for anti-aliasing)\n- [Julius](https:\u002F\u002Fgithub.com\u002Fadefossez\u002Fjulius) (for low-pass filter)\n- [UnivNet](https:\u002F\u002Fgithub.com\u002Fmindslab-ai\u002Funivnet) (for multi-resolution discriminator)\n- [descript-audio-codec](https:\u002F\u002Fgithub.com\u002Fdescriptinc\u002Fdescript-audio-codec) and [vocos](https:\u002F\u002Fgithub.com\u002Fgemelo-ai\u002Fvocos) (for multi-band multi-scale STFT discriminator and multi-scale mel spectrogram loss)\n- [Amphion](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002FAmphion) (for multi-scale sub-band CQT discriminator)\n","## BigVGAN：基于大规模训练的通用神经声码器\n\n#### 李尚吉、魏平、鲍里斯·金斯堡、布莱恩·卡坦扎罗、尹成浩\n\n[[论文]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.04658) - [[代码]](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN) - [[展示]](https:\u002F\u002Fbigvgan-demo.github.io\u002F) - [[项目页面]](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fadlr\u002Fprojects\u002Fbigvgan\u002F) - [[权重]](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fnvidia\u002Fbigvgan-66959df3d97fd7d98d97dc9a) - [[演示]](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fnvidia\u002FBigVGAN)\n\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fbigvgan-a-universal-neural-vocoder-with-large\u002Fspeech-synthesis-on-libritts)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fspeech-synthesis-on-libritts?p=bigvgan-a-universal-neural-vocoder-with-large)\n\n\u003Ccenter>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA_BigVGAN_readme_abd7f47167ef.png\" width=\"800\">\u003C\u002Fcenter>\n\n## 新闻\n- **2024年9月（v2.4）：**\n  - 我们更新了经过500万步训练的预训练检查点。这是BigVGAN-v2检查点的最终版本。\n\n- **2024年7月（v2.3）：**\n  - 进行了全面重构和代码改进，以提高可读性。\n  - 实现了抗混叠激活（上采样 + 激活 + 下采样）的完全融合CUDA核，并进行了推理速度基准测试。\n\n- **2024年7月（v2.2）：** 该仓库现在包含一个使用Gradio的交互式本地演示。\n\n- **2024年7月（v2.1）：** BigVGAN现已集成到Hugging Face Hub中，用户可以轻松访问预训练检查点进行推理。我们还在Hugging Face Spaces上提供了一个交互式演示。\n\n- **2024年7月（v2）：** 我们发布了BigVGAN-v2及其预训练检查点。以下是主要亮点：\n  - 自定义CUDA推理核：我们提供了一个用CUDA编写的融合上采样与激活核，以加速推理速度。测试显示，在单个A100 GPU上，速度提升了1.5至3倍。\n  - 改进的判别器和损失函数：BigVGAN-v2使用了[多尺度子带CQT判别器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.14957)和[多尺度梅尔谱损失](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.06546)进行训练。\n  - 更大的训练数据集：BigVGAN-v2使用包含多种音频类型的大型数据集进行训练，其中包括多语言语音、环境音效和乐器音等。\n  - 我们提供了适用于不同音频配置的BigVGAN-v2预训练检查点，支持最高44 kHz采样率和512倍上采样比。\n\n## 安装\n\n该代码库已在Python `3.10` 和 PyTorch `2.3.1` 的conda环境中测试通过，使用的CUDA版本为`pytorch-cuda=12.1`或`pytorch-cuda=11.8`。以下是一个创建conda环境的示例命令：\n\n```shell\nconda create -n bigvgan python=3.10 pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia\nconda activate bigvgan\n```\n\n克隆仓库并安装依赖项：\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\ncd BigVGAN\npip install -r requirements.txt\n```\n\n## 使用Hugging Face Hub快速进行推理\n\n以下示例展示了如何使用BigVGAN：从Hugging Face Hub加载预训练的BigVGAN生成器，从输入波形计算梅尔频谱图，并将梅尔频谱图作为模型输入生成合成波形。\n\n```python\ndevice = 'cuda'\n\nimport torch\nimport bigvgan\nimport librosa\nfrom meldataset import get_mel_spectrogram\n\n# 实例化模型。您可以选择设置use_cuda_kernel=True以获得更快的推理速度。\nmodel = bigvgan.BigVGAN.from_pretrained('nvidia\u002Fbigvgan_v2_24khz_100band_256x', use_cuda_kernel=False)\n\n# 移除模型中的权重归一化，并将其设置为评估模式\nmodel.remove_weight_norm()\nmodel = model.eval().to(device)\n\n# 加载wav文件并计算梅尔频谱图\nwav_path = '\u002Fpath\u002Fto\u002Fyour\u002Faudio.wav'\nwav, sr = librosa.load(wav_path, sr=model.h.sampling_rate, mono=True) # wav是形状为[T_time]、值在[-1, 1]之间的np.ndarray\nwav = torch.FloatTensor(wav).unsqueeze(0) # wav是形状为[B(1), T_time]的FloatTensor\n\n# 从真实音频中计算梅尔频谱图\nmel = get_mel_spectrogram(wav, model.h).to(device) # mel是形状为[B(1), C_mel, T_frame]的FloatTensor\n\n# 从梅尔频谱图生成波形\nwith torch.inference_mode():\n    wav_gen = model(mel) # wav_gen是形状为[B(1), 1, T_time]、值在[-1, 1]之间的FloatTensor\nwav_gen_float = wav_gen.squeeze(0).cpu() # wav_gen是形状为[1, T_time]的FloatTensor\n\n# 您可以将生成的波形转换为16位线性PCM\nwav_gen_int16 = (wav_gen_float * 32767.0).numpy().astype('int16') # wav_gen现在是形状为[1, T_time]、数据类型为int16的np.ndarray\n```\n\n## 本地Gradio演示 \u003Ca href='https:\u002F\u002Fgithub.com\u002Fgradio-app\u002Fgradio'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgradio-app\u002Fgradio'>\u003C\u002Fa>\n\n您可以通过以下命令运行本地Gradio演示：\n\n```shell\npip install -r demo\u002Frequirements.txt\npython demo\u002Fapp.py\n```\n\n## 训练\n\n为数据集根目录创建符号链接。代码库使用相对于数据集路径的文件列表。以下是LibriTTS数据集的示例命令：\n\n```shell\ncd filelists\u002FLibriTTS && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Ftrain-clean-100 train-clean-100 && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Ftrain-clean-360 train-clean-360 && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Ftrain-other-500 train-other-500 && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Fdev-clean dev-clean && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Fdev-other dev-other && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Ftest-clean test-clean && \\\nln -s \u002Fpath\u002Fto\u002Fyour\u002FLibriTTS\u002Ftest-other test-other && \\\ncd ..\u002F..\n```\n\n训练BigVGAN模型。以下是使用LibriTTS数据集在24kHz下训练BigVGAN-v2的示例命令，输入为完整的100频带梅尔频谱图：\n\n```shell\npython train.py \\\n--config configs\u002Fbigvgan_v2_24khz_100band_256x.json \\\n--input_wavs_dir filelists\u002FLibriTTS \\\n--input_training_file filelists\u002FLibriTTS\u002Ftrain-full.txt \\\n--input_validation_file filelists\u002FLibriTTS\u002Fval-full.txt \\\n--list_input_unseen_wavs_dir filelists\u002FLibriTTS filelists\u002FLibriTTS \\\n--list_input_unseen_validation_file filelists\u002FLibriTTS\u002Fdev-clean.txt filelists\u002FLibriTTS\u002Fdev-other.txt \\\n--checkpoint_path exp\u002Fbigvgan_v2_24khz_100band_256x\n```\n\n## 合成\n\n使用 BigVGAN 模型进行合成。以下是使用该模型生成音频的示例命令。\n它会从 `--input_wavs_dir` 中的 WAV 文件计算梅尔谱，并将生成的音频保存到 `--output_dir`。\n\n```shell\npython inference.py \\\n--checkpoint_file \u002Fpath\u002Fto\u002Fyour\u002Fbigvgan_v2_24khz_100band_256x\u002Fbigvgan_generator.pt \\\n--input_wavs_dir \u002Fpath\u002Fto\u002Fyour\u002Finput_wav \\\n--output_dir \u002Fpath\u002Fto\u002Fyour\u002Foutput_wav\n```\n\n`inference_e2e.py` 支持直接从以 `.npy` 格式保存的梅尔谱进行合成，其形状为 `[1, channel, frame]` 或 `[channel, frame]`。\n它会从 `--input_mels_dir` 加载梅尔谱，并将生成的音频保存到 `--output_dir`。\n\n请确保用于生成梅尔谱的 STFT 超参数与模型一致，这些超参数在对应模型的 `config.json` 文件中定义。\n\n```shell\npython inference_e2e.py \\\n--checkpoint_file \u002Fpath\u002Fto\u002Fyour\u002Fbigvgan_v2_24khz_100band_256x\u002Fbigvgan_generator.pt \\\n--input_mels_dir \u002Fpath\u002Fto\u002Fyour\u002Finput_mel \\\n--output_dir \u002Fpath\u002Fto\u002Fyour\u002Foutput_wav\n```\n\n## 使用自定义 CUDA 内核进行合成\n\n您可以在实例化 BigVGAN 时通过参数 `use_cuda_kernel` 来应用快速的 CUDA 推理内核：\n\n```python\ngenerator = BigVGAN(h, use_cuda_kernel=True)\n```\n\n您也可以在 `inference.py` 和 `inference_e2e.py` 中传递 `--use_cuda_kernel` 来启用此功能。\n\n首次应用时，它会使用 `nvcc` 和 `ninja` 构建内核。如果构建成功，内核会被保存到 `alias_free_activation\u002Fcuda\u002Fbuild` 目录下，模型会自动加载该内核。该代码库已在 CUDA `12.1` 上进行了测试。\n\n请确保您的系统中已安装这两者，并且系统中的 `nvcc` 版本与您 PyTorch 的构建版本一致。\n\n我们建议先运行 `test_cuda_vs_torch_model.py` 来构建并检查 CUDA 内核的正确性。以下是示例命令及其输出，其中返回了 `[Success] test CUDA fused vs. plain torch BigVGAN inference`：\n\n```python\npython tests\u002Ftest_cuda_vs_torch_model.py \\\n--checkpoint_file \u002Fpath\u002Fto\u002Fyour\u002Fbigvgan_generator.pt\n```\n\n```shell\nloading plain Pytorch BigVGAN\n...\nloading CUDA kernel BigVGAN with auto-build\nDetected CUDA files, patching ldflags\nEmitting ninja build file \u002Fpath\u002Fto\u002Fyour\u002FBigVGAN\u002Falias_free_activation\u002Fcuda\u002Fbuild\u002Fbuild.ninja..\nBuilding extension module anti_alias_activation_cuda...\n...\nLoading extension module anti_alias_activation_cuda...\n...\nLoading '\u002Fpath\u002Fto\u002Fyour\u002Fbigvgan_generator.pt'\n...\n[Success] test CUDA fused vs. plain torch BigVGAN inference\n > mean_difference=0.0007238413265440613\n...\n```\n\n如果您看到 `[Fail] test CUDA fused vs. plain torch BigVGAN inference`，则表示 CUDA 内核推理不正确。请检查您系统中安装的 `nvcc` 是否与您的 PyTorch 版本兼容。\n\n## 预训练模型\n\n我们在 Hugging Face Collections 上提供了预训练模型：[https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fnvidia\u002Fbigvgan-66959df3d97fd7d98d97dc9a](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fnvidia\u002Fbigvgan-66959df3d97fd7d98d97dc9a)。\n用户可以从列出的模型仓库中下载生成器权重（名为 `bigvgan_generator.pt`）以及判别器和优化器的状态（名为 `bigvgan_discriminator_optimizer.pt`）。\n\n| 模型名称                                                                                               | 采样率 | 梅尔带数 | fmax  | 上采样倍数 | 参数量 | 数据集                    | 步数 | 微调 |\n|:--------------------------------------------------------------------------------------------------------:|:-------------:|:--------:|:-----:|:----------------:|:------:|:--------------------------:|:-----:|:----------:|\n| [bigvgan_v2_44khz_128band_512x](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_v2_44khz_128band_512x)             | 44 kHz        | 128      | 22050 | 512              | 122M   | 大规模语料库    | 5M    | 否         |\n| [bigvgan_v2_44khz_128band_256x](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_v2_44khz_128band_256x)             | 44 kHz        | 128      | 22050 | 256              | 112M   | 大规模语料库    | 5M    | 否         |\n| [bigvgan_v2_24khz_100band_256x](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_v2_24khz_100band_256x)             | 24 kHz        | 100      | 12000 | 256              | 112M   | 大规模语料库    | 5M    | 否         |\n| [bigvgan_v2_22khz_80band_256x](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_v2_22khz_80band_256x)               | 22 kHz        | 80       | 11025 | 256              | 112M   | 大规模语料库    | 5M    | 否         |\n| [bigvgan_v2_22khz_80band_fmax8k_256x](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_v2_22khz_80band_fmax8k_256x) | 22 kHz        | 80       | 8000  | 256              | 112M   | 大规模语料库    | 5M    | 否         |\n| [bigvgan_24khz_100band](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_24khz_100band)                             | 24 kHz        | 100      | 12000 | 256              | 112M   | LibriTTS                   | 5M    | 否         |\n| [bigvgan_base_24khz_100band](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_base_24khz_100band)                   | 24 kHz        | 100      | 12000 | 256              | 14M    | LibriTTS                   | 5M    | 否         |\n| [bigvgan_22khz_80band](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_22khz_80band)                               | 22 kHz        | 80       | 8000  | 256              | 112M   | LibriTTS + VCTK + LJSpeech | 5M    | 否         |\n| [bigvgan_base_22khz_80band](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002Fbigvgan_base_22khz_80band)                     | 22 kHz        | 80       | 8000  | 256              | 14M    | LibriTTS + VCTK + LJSpeech | 5M    | 否         |\n\n论文中的结果基于原始的 24kHz BigVGAN 模型（`bigvgan_24khz_100band` 和 `bigvgan_base_24khz_100band`），这些模型是在 LibriTTS 数据集上训练的。\n我们还提供了 22kHz 的 BigVGAN 模型，采用带限设置（即 fmax=8000），适用于 TTS 应用。\n请注意，这些检查点使用带有对数尺度参数化的 `snakebeta` 激活函数，这种激活函数具有最佳的整体质量。\n\n您可以通过以下步骤对模型进行微调：\n\n1. 下载检查点（包括生成器权重及其判别器\u002F优化器状态）\n2. 使用您的音频数据集继续训练，在启动 `train.py` 时指定包含检查点的 `--checkpoint_path` 参数\n\n## BigVGAN-v2 训练细节\n\n与原始的 BigVGAN 相比，BigVGAN-v2 的预训练检查点使用了 `batch_size=32` 和更长的 `segment_size=65536`，并在 8 张 A100 GPU 上进行训练。\n\n需要注意的是，`.\u002Fconfigs` 中的 BigVGAN-v2 `json` 配置文件默认使用 `batch_size=4`，以便在单张 A100 GPU 上进行训练。您可以根据自己的 GPU 资源调整 `batch_size` 来微调模型。\n\n当使用较小的批量大小从头开始训练 BigVGAN-v2 时，可能会遇到论文中提到的早期发散问题。在这种情况下，我们建议在训练的早期阶段（例如前 20000 步）降低 `clip_grad_norm` 值（例如设置为 100），然后将其恢复到默认值 500。\n\n## BigVGAN-v2 评估结果\n\n以下是基于 LibriTTS `dev` 数据集得到的 24kHz 模型 (`bigvgan_v2_24khz_100band_256x`) 的客观评估结果。BigVGAN-v2 在各项指标上均有显著提升。此外，该模型还减少了听感上的伪影，尤其是在非语音音频方面表现更为突出。\n\n| 模型      | 数据集                 | 步数 | PESQ(↑)   | M-STFT(↓)  | MCD(↓)     | Periodicity(↓) | V\u002FUV F1(↑) |\n|:----------:|:-----------------------:|:-----:|:---------:|:----------:|:----------:|:--------------:|:----------:|\n| BigVGAN    | LibriTTS                | 1M    | 4.027     | 0.7997     | 0.3745     | 0.1018         | 0.9598     |\n| BigVGAN    | LibriTTS                | 5M    | 4.256     | 0.7409     | 0.2988     | 0.0809         | 0.9698     |\n| BigVGAN-v2 | Large-scale Compilation | 3M    | 4.359     | 0.7134     | 0.3060     | 0.0621         | 0.9777     |\n| BigVGAN-v2 | Large-scale Compilation | 5M    | **4.362** | **0.7026** | **0.2903** | **0.0593**     | **0.9793** |\n\n## 速度基准测试\n\n以下是来自 `tests\u002Ftest_cuda_vs_torch_model.py` 的 BigVGAN 速度和显存占用基准测试结果，以 `bigvgan_v2_24khz_100band_256x` 作为参考模型。\n\n| GPU                        | num_mel_frame | use_cuda_kernel | 速度 (kHz) | 实时因子 | 显存 (GB) |\n|:--------------------------:|:-------------:|:---------------:|:-----------:|:----------------:|:---------:|\n| NVIDIA A100                | 256           | False           | 1672.1      | 69.7x            | 1.3       |\n|                            |               | True            | 3916.5      | 163.2x           | 1.3       |\n|                            | 2048          | False           | 1899.6      | 79.2x            | 1.7       |\n|                            |               | True            | 5330.1      | 222.1x           | 1.7       |\n|                            | 16384         | False           | 1973.8      | 82.2x            | 5.0       |\n|                            |               | True            | 5761.7      | 240.1x           | 4.4       |\n| NVIDIA GeForce RTX 3080    | 256           | False           | 841.1       | 35.0x            | 1.3       |\n|                            |               | True            | 1598.1      | 66.6x            | 1.3       |\n|                            | 2048          | False           | 929.9       | 38.7x            | 1.7       |\n|                            |               | True            | 1971.3      | 82.1x            | 1.6       |\n|                            | 16384         | False           | 943.4       | 39.3x            | 5.0       |\n|                            |               | True            | 2026.5      | 84.4x            | 3.9       |\n| NVIDIA GeForce RTX 2080 Ti | 256           | False           | 515.6       | 21.5x            | 1.3       |\n|                            |               | True            | 811.3       | 33.8x            | 1.3       |\n|                            | 2048          | False           | 576.5       | 24.0x            | 1.7       |\n|                            |               | True            | 1023.0      | 42.6x            | 1.5       |\n|                            | 16384         | False           | 589.4       | 24.6x            | 5.0       |\n|                            |               | True            | 1068.1      | 44.5x            | 3.2       |\n\n## 致谢\n\n我们感谢 Vijay Anand Korthikanti 和 Kevin J. Shih 在实现推理用 CUDA 核心方面的慷慨支持。\n\n## 参考文献\n\n- [HiFi-GAN](https:\u002F\u002Fgithub.com\u002Fjik876\u002Fhifi-gan)（用于生成器和多周期判别器）\n- [Snake](https:\u002F\u002Fgithub.com\u002FEdwardDixon\u002Fsnake)（用于周期性激活）\n- [Alias-free-torch](https:\u002F\u002Fgithub.com\u002Fjunjun3518\u002Falias-free-torch)（用于抗混叠处理）\n- [Julius](https:\u002F\u002Fgithub.com\u002Fadefossez\u002Fjulius)（用于低通滤波器）\n- [UnivNet](https:\u002F\u002Fgithub.com\u002Fmindslab-ai\u002Funivnet)（用于多分辨率判别器）\n- [descript-audio-codec](https:\u002F\u002Fgithub.com\u002Fdescriptinc\u002Fdescript-audio-codec) 和 [vocos](https:\u002F\u002Fgithub.com\u002Fgemelo-ai\u002Fvocos)（用于多频段多尺度 STFT 判别器和多尺度梅尔谱损失）\n- [Amphion](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002FAmphion)（用于多尺度子带 CQT 判别器）","# BigVGAN 快速上手指南\n\nBigVGAN 是一款通用的神经声码器，通过大规模训练实现了高质量的语音合成。本指南将帮助你快速在本地环境中部署并使用预训练模型进行推理。\n\n## 环境准备\n\n在开始之前，请确保你的系统满足以下要求：\n- **操作系统**: Linux (推荐) 或 Windows (需配置 CUDA 环境)\n- **Python**: 3.10\n- **PyTorch**: 2.3.1\n- **CUDA**: 11.8 或 12.1 (需与 PyTorch 版本匹配)\n- **硬件**: 推荐使用 NVIDIA GPU (如 A100, V100, RTX 系列) 以加速推理，特别是使用自定义 CUDA 内核时。\n\n## 安装步骤\n\n### 1. 创建 Conda 环境\n建议使用 `conda` 创建隔离环境。以下命令基于 PyTorch 官方源和 NVIDIA 源安装依赖（国内用户可替换为清华或中科大镜像源加速）：\n\n```shell\n# 使用国内镜像源加速安装 (可选)\nconda config --add channels https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fanaconda\u002Fpkgs\u002Fmain\u002F\nconda config --add channels https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fanaconda\u002Fpkgs\u002Ffree\u002F\nconda config --add channels https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fanaconda\u002Fcloud\u002Fconda-forge\u002F\nconda config --add channels https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fanaconda\u002Fcloud\u002Fpytorch\u002F\n\n# 创建环境 (根据实际 CUDA 版本选择 pytorch-cuda=12.1 或 11.8)\nconda create -n bigvgan python=3.10 pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia\nconda activate bigvgan\n```\n\n### 2. 克隆代码并安装依赖\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\ncd BigVGAN\npip install -r requirements.txt\n```\n\n## 基本使用\n\n以下是使用 Hugging Face Hub 加载预训练模型并进行音频生成的最简示例。该流程包括：加载模型、读取音频、提取梅尔频谱图、生成波形。\n\n```python\ndevice = 'cuda'\n\nimport torch\nimport bigvgan\nimport librosa\nfrom meldataset import get_mel_spectrogram\n\n# 1. 实例化模型\n# 从 Hugging Face 加载预训练权重\n# use_cuda_kernel=True 可启用自定义 CUDA 内核加速推理 (首次运行会自动编译)\nmodel = bigvgan.BigVGAN.from_pretrained('nvidia\u002Fbigvgan_v2_24khz_100band_256x', use_cuda_kernel=False)\n\n# 移除权重归一化并设置为评估模式\nmodel.remove_weight_norm()\nmodel = model.eval().to(device)\n\n# 2. 加载音频文件并计算梅尔频谱图\nwav_path = '\u002Fpath\u002Fto\u002Fyour\u002Faudio.wav'  # 替换为你的音频路径\n# 加载音频，采样率需与模型一致\nwav, sr = librosa.load(wav_path, sr=model.h.sampling_rate, mono=True) \nwav = torch.FloatTensor(wav).unsqueeze(0) # 形状: [B(1), T_time]\n\n# 从原始音频计算梅尔频谱图作为模型输入\nmel = get_mel_spectrogram(wav, model.h).to(device) # 形状: [B(1), C_mel, T_frame]\n\n# 3. 生成波形\nwith torch.inference_mode():\n    wav_gen = model(mel) # 输出形状: [B(1), 1, T_time], 值域 [-1, 1]\n\n# 4. 后处理与保存\nwav_gen_float = wav_gen.squeeze(0).cpu()\n\n# 转换为 16-bit PCM 格式以便保存或播放\nwav_gen_int16 = (wav_gen_float * 32767.0).numpy().astype('int16')\n\n# (可选) 使用 soundfile 保存结果\n# import soundfile as sf\n# sf.write('output.wav', wav_gen_int16, model.h.sampling_rate)\n```\n\n### 进阶提示：启用 CUDA 加速\n若需获得 1.5 - 3 倍的推理速度提升，可在实例化模型时设置 `use_cuda_kernel=True`。\n**注意**：首次运行时系统会自动调用 `nvcc` 和 `ninja` 编译 CUDA 内核，请确保已安装且版本与 PyTorch 兼容。\n\n```python\n# 启用加速内核\nmodel = bigvgan.BigVGAN.from_pretrained('nvidia\u002Fbigvgan_v2_24khz_100band_256x', use_cuda_kernel=True)\n```","某多语言有声书制作团队正在构建自动化配音流水线，需要将文本转语音模型生成的梅尔频谱图转换为高保真音频，以支持全球不同语种听众的收听需求。\n\n### 没有 BigVGAN 时\n- **音质失真严重**：在处理复杂发音或非英语语种时，传统声码器生成的音频常带有明显的金属音或背景嘶嘶声，听感生硬不自然。\n- **泛化能力不足**：为中文训练的模型难以直接应用于法语或日语场景，团队不得不针对每种语言单独训练和维护多个专用声码器，运维成本极高。\n- **细节丢失明显**：乐器伴奏或环境音效等高频细节在合成过程中被过度平滑，导致最终成品缺乏真实录音的空间感和丰富度。\n- **推理速度瓶颈**：为了保证基本音质，往往需要牺牲推理速度，难以满足实时互动或大规模批量生成的时效要求。\n\n### 使用 BigVGAN 后\n- **音质显著提升**：得益于大规模多样化数据训练和抗混叠激活函数，BigVGAN 生成的音频消除了金属伪影，人声饱满且背景干净，接近真实录音水平。\n- **实现“万能”通用**：单个 BigVGAN 模型即可完美支持多种语言、乐器及环境音的合成，团队无需再维护多套模型，大幅简化了技术架构。\n- **高频细节还原**：即使是复杂的背景音乐或细微的呼吸声，BigVGAN 也能精准还原，使得有声书的沉浸感和情感表达力大幅增强。\n- **加速推理流程**：通过集成自定义 CUDA 融合内核，BigVGAN 在单张 A100 GPU 上的推理速度提升了 1.5 至 3 倍，显著缩短了整体制作周期。\n\nBigVGAN 通过其强大的通用性和高保真合成能力，将多语言音频生产从“勉强可用”提升至“广播级”标准，同时大幅降低了算力与运维成本。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA_BigVGAN_aef0006d.png","NVIDIA","NVIDIA Corporation","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FNVIDIA_7dcf6000.png","",null,"https:\u002F\u002Fnvidia.com","https:\u002F\u002Fgithub.com\u002FNVIDIA",[83,87,91,95],{"name":84,"color":85,"percentage":86},"Python","#3572A5",87.9,{"name":88,"color":89,"percentage":90},"Cuda","#3A4E3A",6.9,{"name":92,"color":93,"percentage":94},"C","#555555",4.5,{"name":96,"color":97,"percentage":98},"C++","#f34b7d",0.7,1206,145,"2026-04-15T13:37:11","MIT","Linux","必需 NVIDIA GPU。推理建议使用 A100（测试显示加速效果），支持 CUDA 11.8 或 12.1。若使用自定义 CUDA 加速内核，需安装 nvcc 且版本需与 PyTorch 构建版本一致。","未说明",{"notes":107,"python":108,"dependencies":109},"1. 官方仅在 Linux 环境下使用 conda 进行了测试（PyTorch 2.3.1 + CUDA 11.8\u002F12.1）。2. 若启用自定义 CUDA 内核进行加速推理，系统必须安装 nvcc 和 ninja，首次运行会自动编译内核。3. 提供多种预训练模型，最高支持 44kHz 采样率和 512 倍上采样率，参数量高达 1.22 亿。4. 训练时需手动创建数据集的符号链接。","3.10",[110,111,112,113,114,115],"torch>=2.3.1","torchaudio","torchvision","librosa","ninja","gradio",[21],[118,119,120,121,122,123],"audio-synthesis","speech-synthesis","music-synthesis","neural-vocoder","audio-generation","singing-voice-synthesis","2026-03-27T02:49:30.150509","2026-04-17T08:24:24.218089",[],[128,133,138,143,148,153],{"id":129,"version":130,"summary_zh":131,"released_at":132},289371,"v2.4","* 使用500万步训练更新检查点。这是BigVGAN-v2预训练检查点的最终版本。\n* 强健的Mel数据集，支持异构波形文件格式，并在训练过程中进行实时重采样：[PR#8](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\u002Fpull\u002F8)\n* 错误修复：[PR#7](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\u002Fpull\u002F7) [PR#9](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\u002Fpull\u002F9)","2024-09-05T03:50:02",{"id":134,"version":135,"summary_zh":136,"released_at":137},289372,"v2.3","* 通用重构及代码优化，提升可读性：[PR#4](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\u002Fpull\u002F4)  [PR#5](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\u002Fpull\u002F5)\n* 全融合的抗锯齿激活 CUDA 内核（上采样 + 激活 + 下采样），附推理速度基准测试：[PR#6](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\u002Fpull\u002F6)","2024-07-22T12:36:58",{"id":139,"version":140,"summary_zh":141,"released_at":142},289373,"v2.2","* 集成本地 Gradio 示例：[PR#2](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\u002Fpull\u002F2)","2024-07-17T01:56:41",{"id":144,"version":145,"summary_zh":146,"released_at":147},289374,"v2.1","* 集成 Hugging Face Hub：[PR#1](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\u002Fpull\u002F1)  \n* 将 models.py 重构为 bigvgan.py、discriminators.py 和 loss.py：[PR#1](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\u002Fpull\u002F1)","2024-07-16T10:17:53",{"id":149,"version":150,"summary_zh":151,"released_at":152},289375,"v2","[提交](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN\u002Fcommit\u002F2d448238a14f14ef1b5079be00646778604924da)\n* 针对推理的自定义 CUDA 内核\n* 改进的判别器和损失函数\n* 使用更多样化数据训练的新检查点","2024-07-10T06:01:24",{"id":154,"version":155,"summary_zh":156,"released_at":157},289376,"v1","BigVGAN 的原始 v1 版本","2024-07-09T02:12:06"]