ai-audio-datasets

925 91 非常简单 1 次阅读 3天前MIT开发框架数据工具音频

AI 解读由 AI 自动生成，仅供参考

ai-audio-datasets 是一个专为人工智能领域打造的开源音频数据资源库，汇集了涵盖语音、音乐及音效的高质量数据集。它主要解决了 AI 开发者在训练生成式模型（AIGC）、构建智能音频工具时，面临的高质量多模态训练数据稀缺且分散的难题。

该项目不仅提供了如 AISHELL 系列等经典的中文语音识别与合成数据，还收录了 AVSpeech 这样大规模视听对齐数据集，以及 Audio-FLAN 这类支持跨域理解与生成的指令微调数据。其独特亮点在于数据的多样性与专业性并存：既包含多语种、多说话人的语音语料，也涉及卡纳提克音乐等专业领域的标注数据，能够无缝支持从基础识别到复杂零样本生成等多种任务。

ai-audio-datasets 非常适合 AI 研究人员、算法工程师以及音频应用开发者使用。无论是需要训练高精度语音识别系统，还是开发能够理解并创作音乐的创新应用，都能在此找到坚实的数据基石，从而加速模型迭代与应用落地。

使用场景

某初创团队正致力于开发一款支持多方言的智能客服语音助手，急需高质量、带标注的中文语音数据来训练其自动语音识别（ASR）和语音合成（TTS）模型。

没有 ai-audio-datasets 时

数据搜集耗时极长：开发人员需分散在 GitHub、学术论坛及各大学实验室网站手动挖掘，如单独寻找 AISHELL-1 或 CN-Celeb 等资源，耗费数周时间整理链接。
数据格式杂乱不一：找到的数据集标注标准各异，有的缺乏音素级对齐，有的缺少说话人情感标签，导致预处理清洗工作量巨大。
场景覆盖严重不足：难以一次性获取涵盖普通话、方言及不同噪音环境（如 Casual Conversations 中的复杂背景）的综合数据，模型泛化能力差。
版权与合规风险高：部分来源不明的数据缺乏清晰授权协议，给商业落地埋下法律隐患。

使用 ai-audio-datasets 后

一站式高效获取：团队直接通过 ai-audio-datasets 索引，快速定位并下载了 AISHELL-3 等多说话人高保真语料及 Audio-FLAN 指令微调数据，将数据准备周期从数周缩短至两天。
标准化数据接入：所有收录数据均经过筛选，提供了统一的元数据描述和清晰的用途说明（如 TTS 专用或 ASR 专用），大幅降低了数据清洗和对齐成本。
多维度场景覆盖：利用其分类结构，团队轻松组合了纯净语音（AVSpeech）、日常对话及特定意图（ATIS）数据，显著提升了模型在真实复杂环境下的鲁棒性。
源头清晰可追溯：每个数据集都附带官方源链接和明确的项目背景，确保了训练数据的版权合规性，让产品上线无后顾之忧。

ai-audio-datasets 通过聚合分散的优质资源，将音频 AI 开发中最繁琐的数据基建工作转化为即取即用的标准化服务，极大加速了模型迭代进程。

运行环境要求

GPU

未说明

内存

未说明

依赖

notes该项目（AI Audio Datasets）是一个音频数据集列表和索引库，而非可执行的软件工具或模型代码库。README 内容仅列出了各种开源语音、音乐和音效数据集的名称、描述及下载链接（如 AISHELL, LibriSpeech, Common Voice 等）。因此，该项目本身没有特定的操作系统、GPU、内存、Python 版本或依赖库要求。用户需根据具体想要使用的某个数据集及其对应的处理工具或模型，去查阅该特定资源的文档以获取运行环境需求。

python未说明

快速开始

AI 音频数据集（AI-ADS）🎵

AI-Audio-Datasets

AI 音频数据集（AI-ADS）🎵，包括语音、音乐和音效，可为生成式 AI、AIGC、AI 模型训练、智能音频工具开发以及音频应用提供训练数据。

项目列表

语音

AISHELL-1 - AISHELL-1 是用于普通话语音识别研究和系统构建的语料库。
AISHELL-3 - AISHELL-3 是由北京贝壳壳科技有限公司发布的一个大规模、高保真度的多说话人普通话语音语料库，可用于训练多说话人文本到语音合成（TTS）系统。该语料库包含约85小时的情感中性录音，由218位以汉语普通话为母语的说话人录制，共计88035个话语片段。
阿拉伯语语音语料库 - 阿拉伯语语音语料库（1.5 GB）是一个用于语音合成的现代标准阿拉伯语（MSA）语料库。该语料库包含超过3.7小时MSA语音的音素级和正字法转录，并与录音精确对齐。标注信息还包括每个音素上的词重音标记。
Audio-FLAN - Audio-FLAN 旨在统一音频-语言模型，使其能够无缝处理语音、音乐、声音等领域的理解与生成任务。这是一个面向统一音频理解与生成的指令微调数据集，涵盖语音、音乐和声音三大领域中的80种多样化任务，拥有超过1亿条样本。Audio-FLAN 为构建能够在零样本条件下跨多种音频领域同时完成理解（如转录、理解）和生成（如语音、音乐、声音）任务的统一音频-语言模型奠定了基础。
AudioMNIST - 该数据集包含60位不同说话人的0至9数字的语音样本，共计30000条。
AVSpeech - AVSpeech 是一个大规模的视听数据集，包含无干扰背景信号的语音片段。这些片段长度不一，介于3至10秒之间，且每个视频中唯一可见的人脸及音频轨道中的声音均属于同一位说话人。整个数据集共包含约4700小时的视频片段，涉及大约15万名不同的说话人，涵盖了广泛的人群、语言和面部姿态。
ATIS（航空旅行信息系统） - ATIS（航空旅行信息系统）数据集由人类在自动航空旅行查询系统上询问航班信息的音频录音及其对应的逐字转录组成。数据包含17个独特的意图类别。原始划分中，训练集、开发集和测试集分别包含4478、500和893个带有意图标签的参考话语。
卡纳提克瓦尔南数据集 - 卡纳提克瓦尔南数据集是一组28首独唱声乐录音，为我们关于卡纳提克拉加之音调分析的研究而录制。该集合包括音频录音、时间对齐的塔拉周期标注以及机器可读格式的斯瓦拉记谱。
休闲对话数据集 - 休闲对话数据集旨在帮助研究人员评估其计算机视觉和音频模型在不同年龄、性别、肤色及环境光照条件下的准确性。
CN-Celeb - CN-Celeb 是一个大规模的“野外”采集的说话人识别数据集。该数据集包含来自1000位中国名人的超过13万条语音记录，覆盖了现实世界中的11种不同风格。
Clotho - Clotho 是一个音频字幕数据集，包含4981个音频样本，每个音频样本配有五个字幕（总计24905个字幕）。音频样本时长为15至30秒，字幕长度为8至20个单词。
Common Voice - Common Voice 是一个音频数据集，由独特的MP3文件及其对应的文本文件组成。数据集中共有9,283小时的录音。该数据集还包含人口统计学元数据，如年龄、性别和口音。其中，有7,335小时的验证过的录音，涵盖60种语言。
CoVoST - CoVoST 是一个大规模的多语言语音转文字翻译语料库。其最新的第二版涵盖了21种语言到英语以及英语到15种语言的翻译。总共有2880小时的语音，涉及7.8万名说话人和66种口音。
CVSS - CVSS 是一个大规模的多语言到英语的语音到语音翻译（S2ST）语料库，涵盖了21种语言到英语的句子级平行S2ST对。CVSS 源自Common Voice语音语料库和CoVoST 2语音转文字翻译（ST）语料库，通过使用最先进的TTS系统将CoVoST 2中的翻译文本合成语音而成。
EasyCom - Easy Communications（EasyCom）数据集是首个旨在从增强现实（AR）驱动的多传感器自我中心视角出发，缓解鸡尾酒会效应的数据集。该数据集包含AR眼镜的自我中心多通道麦克风阵列音频、广视野RGB视频、语音源姿态、耳机麦克风音频、标注的语音活动、语音转录、头部和面部边界框以及声源识别标签。我们创建并发布了此数据集，以促进针对鸡尾酒会问题的多模态AR解决方案的研究。
Emilia - Emilia 数据集是一个全面的多语言资源，包含六种不同语言的超过101,000小时语音数据：英语（En）、中文（Zh）、德语（De）、法语（Fr）、日语（Ja）和韩语（Ko）。它收录了来自多个视频平台和网络播客的各种口语风格，涵盖了脱口秀、访谈、辩论、体育解说和有声读物等多种内容类型。
ESD（情感语音数据库） - ESD 是一个用于语音转换研究的情感语音数据库。该数据库包含由10位英语母语者和10位中文母语者录制的350对平行话语，覆盖5种情感类别（中性、高兴、愤怒、悲伤和惊讶）。超过29小时的语音数据是在受控的声学环境中录制的。该数据库适用于多说话人及跨语言的情感语音转换研究。
FPT开放语音数据集（FOSD） - 该数据集由25,921段越南语演讲及其转录本，以及每段演讲的标注开始和结束时间组成，这些内容是从FPT公司于2018年公开发布的三个子数据集中手动整理而来，总时长约30小时。
免费口语数字数据集（FSDD） - 一个免费的口语数字音频数据集。可以把它看作音频版的MNIST。这是一个简单的音频/语音数据集，包含以8kHz采样率录制的数字发音wav文件。录音经过修剪，使得开头和结尾处几乎没有静音。
流利语音命令数据集 - 流利语音命令数据集是一个用于口语理解（SLU）实验的开源音频数据集。每个话语都标注了“动作”、“对象”和“地点”值；例如，“打开厨房的灯”的标签是{"action": "activate", "object": "lights", "location": "kitchen"}。模型必须预测这三个值，且只有当所有值都正确时，对该话语的预测才被视为正确。
原神数据集 - 原神数据集用于SVC/SVS/TTS。
GenshinVoice - 原神语音数据集
GigaSpeech - GigaSpeech 是一个不断发展的多领域英语语音识别语料库，包含1万小时高质量的标注音频，适合监督式训练，以及4万小时的未标注音频，适合半监督和无监督训练。
GigaSpeech 2 - 一个不断扩展的大规模多领域ASR语料库，专为低资源语言设计，具备自动化爬取、转录和精炼功能。
How2 - How2 数据集包含13,500个视频，即300小时的语音，并被划分为185,187个训练话语、2022个开发话语和2361个测试话语。它提供英文字幕以及众包的葡萄牙语翻译。
inaGVAD - 一个具有挑战性的法语电视和广播数据集，针对语音活动检测（VAD）和说话人性别分割（SGS）进行了标注，并附有评估脚本和详细的标注方案，详细说明非语音事件类型、说话人特征和语音质量。
KdConv - KdConv 是一个中国多领域知识驱动的对话数据集，将多轮对话的主题与知识图谱相结合。KdConv 包含来自电影、音乐和旅游三个领域的4500个对话，共计86000个话语，平均每次对话轮数为19.0。这些对话深入探讨相关话题，并在多个主题之间自然过渡，同时该语料也可以用于迁移学习和领域适应的研究。
Libriheavy - Libriheavy：一个包含标点符号和上下文的5万小时ASR语料库。
LibriSpeech - LibriSpeech 语料库是由LibriVox项目收集的约1000小时有声书的合集。大多数有声书来自古腾堡计划。训练数据被分为三个部分：100小时、360小时和500小时的集合；而开发和测试数据则根据自动语音识别系统的表现难易程度，分别划分为“clean”和“other”两类。开发和测试集的音频长度均约为5小时。
LibriTTS - LibriTTS 是一个由Heiga Zen在Google Speech和Google Brain团队成员协助下准备的多说话人英语语料库，包含约585小时以24kHz采样率录制的英语朗读语音。LibriTTS 语料库专为TTS研究而设计，源自LibriSpeech语料库的原始材料（来自LibriVox的MP3音频文件和来自古腾堡计划的文本文件）。
LibriTTS-R - LibriTTS-R：一个修复后的多说话人文本到语音语料库。它是通过对LibriTTS语料库进行语音修复后得到的，该语料库包含来自2456位说话人、以24kHz采样率录制的585小时语音数据及其对应的文本。LibriTTS-R 的构成样本与LibriTTS完全相同，仅在音质上有所提升。
LJSpeech（LJ语音数据集） - 这是一个公共领域的语音数据集，包含13,100个单个说话人朗读7本非虚构书籍片段的短音频。每个片段都配有转录本。片段长度从1到10秒不等，总时长约24小时。这些文本出版于1884年至1964年间，属于公共领域。音频由LibriVox项目于2016-17年录制，同样属于公共领域。
LRS2（唇读句子2） - 牛津大学与BBC合作的唇读句子2（LRS2）数据集是目前公开可用的最大规模的野外唇读句子数据集。该数据库主要包含来自BBC节目的新闻和谈话节目。每句话的长度不超过100个字符。
LRW（野外唇读） - 野外唇读（LRW）数据集是一个大规模的视听数据库，包含来自1000多位说话人的500个不同词汇。每个话语片段包含29帧，其边界以目标词汇为中心。数据库分为训练集、验证集和测试集。训练集每类至少包含800个话语，而验证和测试集则各包含50个话语。
MuAViC - 一个多语言视听语料库，用于鲁棒的语音识别和鲁棒的语音转文字翻译。
MuST-C - MuST-C 目前代表了最大的公开可用的多语言（一对多）语音翻译语料库。它涵盖了八个语言方向，从英语到德语、西班牙语、法语、意大利语、荷兰语、葡萄牙语、罗马尼亚语和俄语。该语料库由英语TED演讲的音频、转录本和译文组成，并自带预定义的训练、验证和测试划分。
MetaQA（电影文本音频问答） - MetaQA 数据集由WikiMovies数据集衍生出的电影本体论，以及三组用自然语言编写的问答对组成：1跳、2跳和3跳查询。
MELD（多模态情感线数据集） - 多模态情感线数据集（MELD）是在情感线数据集的基础上进行增强和扩展而创建的。MELD 包含情感线数据集中已有的对话实例，同时还加入了音频和视觉模态。MELD 包括来自《老友记》电视剧的1400多段对话和13000个话语。多名说话人参与了这些对话。每个话语都被标注为以下七种情绪之一——愤怒、厌恶、悲伤、喜悦、中性、惊讶和恐惧。MELD 还为每个话语添加了情感（正面、负面和中性）标注。
微软语音语料库（印度语言） - 微软语音语料库（印度语言）版本包含泰卢固语、泰米尔语和古吉拉特语的会话式和短语式语音训练和测试数据。数据包包括音频及其对应的转录本。本数据集中的数据不得用于商业用途，仅可用于研究目的。若发表研究成果，需注明：“数据由微软和SpeechOcean.com提供”。
PATS（姿态音频转录风格） - PATS 数据集包含大量对齐的姿态、音频和转录本。我们希望通过该数据集提供一个基准，以帮助开发能够生成自然且相关手势的虚拟代理技术。
RealMAN - RealMAN：一个真实录制并标注的麦克风阵列数据集，用于动态语音增强和定位。
SAVEE（萨里视听表达情感） - 萨里视听表达情感（SAVEE）数据集是在开发自动情感识别系统之前录制的。数据库包含4位男演员在7种不同情感状态下的录音，共计480句英式英语话语。这些句子选自标准的TIMIT语料库，并针对每种情感进行了音素平衡。
SoS_Dataset - 故事之声：多模态音频叙事。现实中的叙事往往是多模态的。讲故事时，人们通常会结合视觉、声音等多种元素来讲述故事。然而，以往关于叙事数据集和任务的研究很少关注声音，尽管声音同样传达着故事的重要语义。因此，我们建议通过引入“背景声音”这一新组件来扩展故事理解和讲述的范围，这种背景声音基于故事情境，但不含任何语言信息。
语音数据集合集 - 这是一个精选的开放语音数据集列表，用于语音相关的研究（主要是自动语音识别）。该仓库收录了超过110个语音数据集，其中70多个可以直接下载，无需进一步申请或注册。
语音数据集生成器 - 语音数据集生成器致力于创建适合训练文本到语音或语音到文本模型的数据集。其主要功能包括转录音频文件、必要时提升音频质量，以及生成数据集。
3D-说话人数据集 - 一个大规模的多设备、多距离和多方言的人类语音数据集。
TED-LIUM - TED演讲的音频转录。1495段TED演讲音频及其完整的转录本，由缅因大学信息学实验室（LIUM）制作。
Flickr音频字幕语料库 - Flickr 8k音频字幕语料库包含8000张自然图片的4万条口头字幕。该语料库于2015年收集，旨在探索无监督语音模式发现的多模态学习方案。
人民的语音 - 人民的语音 是一个可免费下载的、不断增长的3万小时监督式会话英语语音识别数据集，采用CC-BY-SA许可协议授权用于学术和商业用途（包含CC-BY子许可）。数据通过在网络上搜索具有适当许可且已有转录本的音频资料而收集。
口述维基百科语料库 - 口述维基百科项目汇集了志愿朗读者。数百篇多语言的口述文章可供那些由于各种原因无法或不愿阅读书面文章的用户使用。
TIMIT - DARPA TIMIT 声学-音系连续语音语料库。
tts-frontend-dataset - TTS前端数据集：多音字 / 语调 / 文本规范化。
VoxCeleb2 - VoxCeleb2 是一个大规模的说话人识别数据集，通过开源媒体自动获取。VoxCeleb2 包含来自6000多名说话人的超过一百万条语音记录。由于数据是在“野外”收集的，语音片段常伴有笑声、交叉谈话、声道效应、音乐等现实世界的噪声。该数据集也是多语言的，包含来自145个不同国籍说话人的语音，涵盖了广泛的口音、年龄、民族和语言。
VoxConverse - VoxConverse 是一个视听区分说话人数据集，包含从YouTube视频中提取的多人语音片段。
VoxLingua107 - VoxLingua107 是一个用于6628小时口语识别的数据集（平均每种语言62小时），并附带一个包含1609个经验证的话语的评估集。
VoxPopuli - VoxPopuli 是一个大规模的多语言语料库，提供了23种语言的10万小时未标注语音数据。它是迄今为止最大的用于无监督表征学习以及半监督学习的开放数据。VoxPopuli 还包含1.8万小时的16种语言的转录语音及其对应的口头释义，共5.1万小时。
VoxForge - VoxForge 是一个开放的语音数据集，旨在收集转录语音，供自由及开源语音识别引擎（在Linux、Windows和Mac上）使用。
VocalSound - VocalSound 是一个免费数据集，包含来自3,365位不同主体的21,024条众包笑声、叹息、咳嗽、清嗓子、打喷嚏和嗅气录音。VocalSound 数据集还包含说话人年龄、性别、母语、国家和健康状况等元信息。
VoiceBank + DEMAND - VoiceBank+DEMAND 是一个用于训练语音增强算法和TTS模型的嘈杂语音数据库。该数据库旨在训练和测试在48kHz频率下工作的语音增强方法。更详细的信息可在与该数据库相关的论文中找到。
WaveFake - WaveFake 是一个用于音频深度伪造检测的数据集。该数据集包含超过10万条生成的音频片段。
WenetSpeech - WenetSpeech 是一个多领域普通话语料库，包含1万小时以上的高质量标注语音、2400小时左右的弱标注语音，以及约1万小时的未标注语音，总计超过2.2万小时。作者从YouTube和播客中收集了这些数据，涵盖了多种口语风格、场景、领域、主题以及嘈杂的环境。他们引入了一种基于光学字符识别（OCR）的方法，以根据YouTube视频的字幕生成音频/文本分割候选。
WSJ0-2mix - WSJ0-2mix 是一个利用《华尔街日报》（WSJ0）语料库中话语的语音混合识别语料库。
WHAM!（WSJ0嬉皮士环境混合） - WSJ0嬉皮士环境混合（WHAM!）数据集将WSJ0-2mix数据集中的每两个说话人混合与独特的背景噪音场景配对。这些噪音音频于2018年末在旧金山湾区的各个城市地点收集。背景环境主要包括餐馆、咖啡馆、酒吧和公园。音频使用Apogee Sennheiser双耳麦克风，在离地面1.0至1.5米高的三脚架上录制。
YODAS - 这是我们YODAS数据集中的手动/自动子集，包含369,510小时的语音。该数据集包含来自YouTube的音频话语及其对应的字幕（手动或自动）。请注意，手动字幕仅表示内容由用户上传，但并不一定是由人类转录的。
YODAS2 - YODAS2 是YODAS数据集的长格式版本。它提供了与espnet/yodas相同的数据，但YODAS2 具有以下新特性：1. 以长格式（视频级别）呈现，音频不分割；2. 音频采用更高的采样率编码（即24k）。
YTTTS - YouTube文本到语音数据集由YouTube视频及其英文转录本中的波形音频组成。

^ 返回目录 ^

音乐

AAM: Artificial Audio Multitracks Dataset - This dataset contains 3,000 artificial music audio tracks with rich annotations. It is based on real instrument samples and generated by algorithmic composition with respect to music theory. It provides full mixes of the songs as well as single instrument tracks. The midis used for generation are also available. The annotation files include: Onsets, Pitches, Instruments, Keys, Tempos, Segments, Melody instrument, Beats, and Chords.
Acappella - Acappella comprises around 46 hours of a cappella solo singing videos sourced from YouTbe, sampled across different singers and languages. Four languages are considered: English, Spanish, Hindi and others.
ADD: audio-dataset-downloader - Simple Python CLI script for downloading N-hours of audio from Youtube, based on a list of music genres.
ADL Piano MIDI - The ADL Piano MIDI is a dataset of 11,086 piano pieces from different genres. This dataset is based on the Lakh MIDI dataset, which is a collection on 45,129 unique MIDI files that have been matched to entries in the Million Song Dataset.
Aligned Scores and Performances (ASAP) - ASAP is a dataset of aligned musical scores (both MIDI and MusicXML) and performances (audio and MIDI), all with downbeat, beat, time signature, and key signature annotations.
Annotated Jingju Arias Dataset - The Annotated Jingju Arias Dataset is a collection of 34 jingju arias manually segmented in various levels using the software Praat. The selected arias contain samples of the two main shengqiang in jingju, namely xipi and erhuang, and the five main role types in terms of singing, namely, dan, jing, laodan, laosheng and xiaosheng. The dataset is formed by Praat TextGrid files for each aria, containing tiers for the following information: aria, MusicBrainz ID, artist, school, role type, shengqiang, banshi, line of lyrics, syllables, and percussion patterns.
Bach Doodle - The Bach Doodle Dataset is composed of 21.6 million harmonizations submitted from the Bach Doodle. The dataset contains both metadata about the composition (such as the country of origin and feedback), as well as a MIDI of the user-entered melody and a MIDI of the generated harmonization. The dataset contains about 6 years of user entered music.
Bach Violin Dataset - A collection of high-quality public recordings of Bach's sonatas and partitas for solo violin (BWV 1001–1006).
Batik-plays-Mozart dataset - The Batik-plays-Mozart dataset is a piano performance dataset containing 12 complete Mozart Piano Sonatas (36 distinct movements) performed on a computer-monitored Bösendorfer grand piano by Viennese concert pianist Roland Batik. The performances are provided in MIDI format (the corresponding audio files are commercially available) and note-levelaligned with scores in the New Mozart Edition in MusicXML and musicological harmony, cadence and phrase annotations previously published in The Annotated Mozart Sonatas.
Beijing Opera Percussion Instrument Dataset - Beijing Opera percussion dataset is a collection of 236 examples of isolated strokes spanning the four percussion instrument classes used in Beijing Opera. It can be used to build stroke models for each percussion instrument.
Beijing Opera Percussion Pattern Dataset - Beijing Opera Percussion Pattern (BOPP) dataset is a collection of 133 audio percussion patterns covering five pattern classes. The dataset includes the audio and syllable level transcriptions for the patterns (non-time aligned). It is useful for percussion transcription and classification tasks. The patterns have been extracted from audio recordings of arias and labeled by a musicologist.
BiMMuDa - The Billboard Melodic Music Dataset (BiMMuDa) is a MIDI dataset of the main melodies of the top five singles from the Billboard Year-End Singles Charts for each year from 1950 to 2022. This repository stores the dataset, as well as its metadata and appendices.
CAL500 (Computer Audition Lab 500) - CAL500 (Computer Audition Lab 500) is a dataset aimed for evaluation of music information retrieval systems. It consists of 502 songs picked from western popular music. The audio is represented as a time series of the first 13 Mel-frequency cepstral coefficients (and their first and second derivatives) extracted by sliding a 12 ms half-overlapping short-time window over the waveform of each song.
Carnatic Music Rhythm Dataset - The Carnatic Music Rhythm Dataset is a sub-collection of 176 excerpts (16.6 hours) in four taalas of Carnatic music with audio, associated tala related metadata and time aligned markers indicating the progression through the tala cycles. It is useful as a test corpus for many automatic rhythm analysis tasks in Carnatic music.
CCMixter - CCMixter is a singing voice separation dataset consisting of 50 full-length stereo tracks from ccMixter featuring many different musical genres. For each song there are three WAV files available: the background music, the voice signal, and their sum.
ChMusic - ChMusic is a traditional Chinese music dataset for training model and performance evaluation of musical instrument recognition. This dataset cover 11 musical instruments, consisting of Erhu, Pipa, Sanxian, Dizi, Suona, Zhuiqin, Zhongruan, Liuqin, Guzheng, Yangqin and Sheng.
chongchong-free - Chongchong Piano Downloader is a software for free downloading of Chongchong piano score, which can obtain the link of the score, analyze the content of the score, and export the file.
ComMU - ComMU has 11,144 MIDI samples that consist of short note sequences created by professional composers with their corresponding 12 metadata. This dataset is designed for a new task, combinatorial music generation which generate diverse and high-quality music only with metadata through auto-regressive language model.
CoSoD - CoSoD consists of metadata and analytical data of a 331-song corpus comprising all multi-artist collaborations on the Billboard “Hot 100” year-end charts published between 2010 and 2019. Each song in the dataset is associated with two CSV files: one for metadata and one for analytical data.
DALI - DALI: a large Dataset of synchronised Audio, LyrIcs and vocal notes.
DadaGP - DadaGP is a new symbolic music dataset comprising 26,181 song scores in the GuitarPro format covering 739 musical genres, along with an accompanying tokenized format well-suited for generative sequence models such as the Transformer. The tokenized format is inspired by event-based MIDI encodings, often used in symbolic music generation models. The dataset is released with an encoder/decoder which converts GuitarPro files to tokens and back.
DeepScores - Synthetic dataset of 300000 annotated images of written music for object classification, semantic segmentation and object detection. Based on a large set of MusicXML documents that were obtained from MuseScore, a sophisticated pipeline is used to convert the source into LilyPond files, for which LilyPond is used to engrave and annotate the images.
dMelodies - dMelodies is dataset of simple 2-bar melodies generated using 9 independent latent factors of variation where each data point represents a unique melody based on the following constraints: - Each melody will correspond to a unique scale (major, minor, blues, etc.). - Each melody plays the arpeggios using the standard I-IV-V-I cadence chord pattern. - Bar 1 plays the first 2 chords (6 notes), Bar 2 plays the second 2 chords (6 notes). - Each played note is an 8th note.
DISCO-10M - DISCO-10M is a music dataset created to democratize research on large-scale machine learning models for music.
Dizi - Dizi is a dataset of music style of the Northern school and the Southern School. Characteristics include melody and playing techniques of the two different music styles are deconstructed.
DreamSound - Recently, text-to-music generation models have achieved unprecedented results in synthesizing high-quality and diverse music samples from a given text prompt. Despite these advances, it remains unclear how one can generate personalized, user-specific musical concepts, manipulate them, and combine them with existing ones. Motivated by the computer vision literature, we investigate text-to-music by exploring two established methods, namely Textual Inversion and Dreambooth. Using quantitative metrics and a user study, we evaluate their ability to reconstruct and modify new musical concepts, given only a few samples. Finally, we provide a new dataset and propose an evaluation protocol for this new task.
EMOPIA - A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation. EMOPIA (pronounced ‘yee-mò-pi-uh’) dataset is a shared multi-modal (audio and MIDI) database focusing on perceived emotion in pop piano music, to facilitate research on various tasks related to music emotion. The dataset contains 1,087 music clips from 387 songs and clip-level emotion labels annotated by four dedicated annotators.
ErhuPT (Erhu Playing Technique Dataset) - This dataset is an audio dataset containing about 1500 audio clips recorded by multiple professional players.
FiloBass - A Dataset and Corpus Based Study of Jazz Basslines. FiloBass: a novel corpus of music scores and annotations which focuses on the important but often overlooked role of the double bass in jazz accompaniment. Inspired by recent work that sheds light on the role of the soloist, we offer a collection of 48 manually verified transcriptions of professional jazz bassists, comprising over 50,000 note events, which are based on the backing tracks used in the FiloSax dataset. For each recording we provide audio stems, scores, performance-aligned MIDI and associated metadata for beats, downbeats, chord symbols and markers for musical form.
Finding Tori - Finding Tori: Self-supervised Learning for Analyzing Korean Folk Song. we introduce a computational analysis of the field recording dataset of approximately 700 hours of Korean folk songs, which were recorded around 1980-90s.
FMA - The Free Music Archive (FMA) is a large-scale dataset for evaluating several tasks in Music Information Retrieval. It consists of 343 days of audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies.
GiantMIDI-Piano - GiantMIDI-Piano is a classical piano MIDI dataset contains 10,855 MIDI files of 2,786 composers. The curated subset by constraining composer surnames contains 7,236 MIDI files of 1,787 composers.
Groove (Groove MIDI Dataset) - The Groove MIDI Dataset (GMD) is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive drumming. The dataset contains 1,150 MIDI files and over 22,000 measures of drumming.
GTSinger - GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks. We introduce GTSinger, a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks.
GuitarSet - GuitarSet: a dataset for guitar transcription.
Hindustani Music Rhythm Dataset - The Hindustani Music Rhythm Dataset is a sub-collection of 151 (5 hours) in four taals of Hindustani music with audio, associated taal related metadata and time aligned markers indicating the progression through the taal cycles. The dataset is useful as a test corpus for many automatic rhythm analysis tasks in Hindustani music.
HumTrans - The dataset can also serve as a foundation for downstream tasks such as humming melody based music generation. It consists of 500 musical compositions of different genres and languages, with each composition divided into multiple segments. In total, the dataset comprises 1000 music segments. To collect this humming dataset, we employed 10 college students, all of whom are either music majors or proficient in playing at least one musical instrument. Each of them hummed every segment twice using the web recording interface provided by our designed website. The humming recordings were sampled at a frequency of 44,100 Hz.
Indian Art Music Tonic Datasets - This dataset comprises 597 commercially available audio music recordings of Indian art music (Hindustani and Carnatic music), each manually annotated with the tonic of the lead artist. This dataset is used as the test corpus for the development of tonic identification approaches.
JamendoMaxCaps – JamendoMaxCaps is a large-scale audio-caption dataset comprising 362,238 full-length Jamendo tracks with descriptive captions generated by the Qwen2-Audio model. It includes the audio files themselves and rich metadata fields such as genre, speed, and various variable tags.
Jazz Harmony Treebank - This repository contains the Jazz Harmony Treebank, a corpus of hierarchical harmonic analyses of jazz chord sequences selected from the iRealPro corpus published on zenodo by Shanahan et al.
jazznet - jazznet: A Dataset of Fundamental Piano Patterns for Music Audio Machine Learning Research. This paper introduces the jazznet Dataset, a dataset of fundamental jazz piano music patterns for developing machine learning (ML) algorithms in music information retrieval (MIR). The dataset contains 162520 labeled piano patterns, including chords, arpeggios, scales, and chord progressions with their inversions, resulting in more than 26k hours of audio and a total size of 95GB.
Jingju A Cappella Singing Pitch Contour Dataset - Jingju A Cappella Singing Pitch Contour Dataset is a collection of pitch contour segment ground truth for 39 jingju a cappella singing recordings. The dataset includes the ground truth for (1) melodic transcription, (2) pitch contour segmentation. It is useful for melodic transcription and pitch contour segmentation tasks. The pitch contours have been extracted from audio recordings and manually corrected and segmented by a musicologist.
Jingju Music Scores Collection - This is a collection of 92 jingju music scores gathered for the analysis of jingju singing in terms of its musical system. They were transcribed from their original printed sources into a machine readable format, using MuseScore, and exporting them into MusicXML.
JS Fake Chorales - A MIDI dataset of 500 4-part chorales generated by the KS_Chorus algorithm, annotated with results from hundreds of listening test participants, with 300 further unannotated chorales.
LAION-DISCO-12M - The LAION-DISCO-12M dataset contains 12M links to music on YouTube, inspired by the methodology of DISCO-10M. Starting from an initial seed list of artists, we can discover new artists by recursively exploring the artists listed in the "Fans might also like" section. We explore the related artists graph for as long as we are able to find new artists.
LAKH MuseNet MIDI Dataset - Full LAKH MIDI dataset converted to MuseNet MIDI output format (9 instruments + drums).
Los Angeles MIDI Dataset - SOTA kilo-scale MIDI dataset for MIR and Music AI purposes.
LP-MusicCaps - LP-MusicCaps: LLM-Based Pseudo Music Captioning.
Lyra Dataset - Lyra is a dataset for Greek Traditional and Folk music that includes 1570 pieces, summing in around 80 hours of data. The dataset incorporates YouTube timestamped links for retrieving audio and video, along with rich metadata information with regards to instrumentation, geography and genre, among others.
MAESTRO - The MAESTRO dataset contains over 200 hours of paired audio and MIDI recordings from ten years of International Piano-e-Competition. The MIDI data includes key strike velocities and sustain/sostenuto/una corda pedal positions. Audio and MIDI files are aligned with ∼3 ms accuracy and sliced to individual musical pieces, which are annotated with composer, title, and year of performance. Uncompressed audio is of CD quality or higher (44.1–48 kHz 16-bit PCM stereo).
MagnaTagATune - MagnaTagATune dataset contains 25,863 music clips. Each clip is a 29-seconds-long excerpt belonging to one of the 5223 songs, 445 albums and 230 artists. The clips span a broad range of genres like Classical, New Age, Electronica, Rock, Pop, World, Jazz, Blues, Metal, Punk, and more. Each audio clip is supplied with a vector of binary annotations of 188 tags.
Main Dataset for "Evolution of Popular Music: USA 1960–2010" - This is a large file (~20MB) called EvolutionPopUSA_MainData.csv, in comma-separated data format with column headers. Each row corresponds to a recording. The file is viewable in any text editor, and can also be opened in Excel or imported to other data processing programs.
MetaMIDI Dataset - We introduce the MetaMIDI Dataset (MMD), a large scale collection of 436,631 MIDI files and metadata. In addition to the MIDI files, we provide artist, title and genre metadata that was collected during the scraping process when available. MIDIs in (MMD) were matched against a collection of 32,000,000 30-second audio clips retrieved from Spotify, resulting in over 10,796,557 audio-MIDI matches.
MidiCaps - MidiCaps is a large dataset of MIDI files with text captions fit for any symbolic music-text tasks (e.g., text-to-music generation). Built upon the LakhMIDI dataset with the help of LLM pseudolabeling, it contains 168,385 MIDI music files with associated text captions describing: genre, mood, instrumentation, tempo, key, chord sequence, time signature, and duration. All the extracted music attributes are available in the metadata too.
Million Song Dataset - This dataset contains a million songs from 1922-2011, with artist tagged information from Echonest (now part of Spotify), along with audio measurements, and other relevant information.
MIR-1K - MIR-1K (Multimedia Information Retrieval lab, 1000 song clips) is a dataset designed for singing voice separation.
Mridangam Stroke Dataset - The Mridangam Stroke dataset is a collection of 7162 audio examples of individual strokes of the Mridangam in various tonics. The dataset comprises of 10 different strokes played on Mridangams with 6 different tonic values. The dataset can be used for training models for each Mridangam stroke.
Mridangam Tani-avarthanam dataset - The Mridangam Tani-avarthanam dataset is a transcribed collection of two tani-avarthanams played by the renowned Mridangam maestro Padmavibhushan Umayalpuram K. Sivaraman. The audio was recorded at IIT Madras, India and annotated by professional Carnatic percussionists. It consists of about 24 min of audio and 8800 strokes.
MIRMLPop - It contains 1) annotation of the MIR-MLPop dataset, 2) the source code to obtain the audio of the dataset, 3) source code we used to fine-tune Whisper on MIR-MLPop (both lyrics alignment & lyrics transcription), and 4) source code for evaluation.
MSD (Million Song Dataset) - The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest.
MTG-Jamendo Dataset - We present the MTG-Jamendo Dataset, a new open dataset for music auto-tagging. It is built using music available at Jamendo under Creative Commons licenses and tags provided by content uploaders. The dataset contains over 55,000 full audio tracks with 195 tags from genre, instrument, and mood/theme categories. We provide elaborated data splits for researchers and report the performance of a simple baseline approach on five different sets of tags: genre, instrument, mood/theme, top-50, and overall.
MTG-Jamendo - The MTG-Jamendo dataset is an open dataset for music auto-tagging. The dataset contains over 55,000 full audio tracks with 195 tags categories (87 genre tags, 40 instrument tags, and 56 mood/theme tags). It is built using music available at Jamendo under Creative Commons licenses and tags provided by content uploaders. All audio is distributed in 320kbps MP3 format.
Music Data Sharing Platform for Computational Musicology Research (CCMUSIC DATASET) - This platform is a multi-functional music data sharing platform for Computational Musicology research. It contains many music datas such as the sound information of Chinese traditional musical instruments and the labeling information of Chinese pop music, which is available for free use by computational musicology researchers.
Music Emotion Recognition (MER) - We present a data set for the analysis of personalized Music Emotion Recognition (MER) systems. We developed the Music Enthusiasts platform aiming to improve the gathering and analysis of the so-called “ground truth” needed as input to such systems.
MUSAN - MUSAN is a corpus of music, speech and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises.
Musdb-XL-train - The musdb-XL-train dataset consists of a limiter-applied 300,000 segments of 4-sec audio segments and the 100 original songs. For each segment, we randomly chose arbitrary segment in 4 stems (vocals, bass, drums, other) of musdb-HQ training subset and randomly mixed them. Then, we applied a commercial limiter plug-in to each stem.
MusicBench - MusicBench dataset is a collection of music-text pairs that was designed for text-to-music generation and released with Mustango text-to-music model. MusicCaps dataset is expanded from 5,521 samples to 52,768 training and 400 test samples to create MusicBench!
MusicNet - MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.
MusicCaps - MusicCaps is a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.
MuseData - MuseData is an electronic library of orchestral and piano classical music from CCARH. It consists of around 3MB of 783 files.
MUSDB18 - The MUSDB18 is a dataset of 150 full lengths music tracks (~10h duration) of different genres along with their isolated drums, bass, vocals and others stems. The dataset is split into training and test sets with 100 and 50 songs, respectively. All signals are stereophonic and encoded at 44.1kHz.
Music Topics and Metadata - This dataset provides a list of lyrics from 1950 to 2019 describing music metadata as sadness, danceability, loudness, acousticness, etc. We also provide some informations as lyrics which can be used to natural language processing.
Music genres dataset - Dataset of 1494 genres, each containing 200 songs.
Multimodal Sheet Music Dataset - MSMD is a synthetic dataset of 497 pieces of (classical) music that contains both audio and score representations of the pieces aligned at a fine-grained level (344,742 pairs of noteheads aligned to their audio/MIDI counterpart).
MuVi-Sync - The MuVi-Sync dataset is a multi-model dataset comprising both music features (chord, key, loudness, and note density) and video features (scene offset, emotion, motion, and semantic) extracted from a total of 748 music videos.
Nlakh - Nlakh is a dataset for Musical Instrument Retrieval. It is a combination of the NSynth dataset, which provides a large number of instruments, and the Lakh dataset, which provides multi-track MIDI data.
NSynth - NSynth is a dataset of one shot instrumental notes, containing 305,979 musical notes with unique pitch, timbre and envelope. The sounds were collected from 1006 instruments from commercial sample libraries and are annotated based on their source (acoustic, electronic or synthetic), instrument family and sonic qualities. The instrument families used in the annotation are bass, brass, flute, guitar, keyboard, mallet, organ, reed, string, synth lead and vocal. Four second monophonic 16kHz audio snippets were generated (notes) for the instruments.
NES-MDB (Nintendo Entertainment System Music Database) - The Nintendo Entertainment System Music Database (NES-MDB) is a dataset intended for building automatic music composition systems for the NES audio synthesizer. It consists of 5278 songs from the soundtracks of 397 NES games. The dataset represents 296 unique composers, and the songs contain more than two million notes combined. It has file format options for MIDI, score and NLM (NES Language Modeling).
Niko Chord Progression Dataset - The Niko Chord Progression Dataset is used in AccoMontage2. It contains 5k+ chord progression pieces, labeled with styles. There are four styles in total: Pop Standard, Pop Complex, Dark and R&B.
OnAir Music Dataset - 🎵 a new stem dataset for Music Demixing research, from the OnAir royalty-free music project.
Opencpop - Opencpop , a publicly available high-quality Mandarin singing corpus, is designed for singing voice synthesis (SVS) systems. This corpus consists of 100 unique Mandarin songs , which were recorded by a professional female singer. All audio files were recorded with studio-quality at a sampling rate of 44,100 Hz in a professional recording studio environment.
OpenGufeng - A melody and chord progression dataset for Chinese Gufeng Music.
PBSCSR - The Piano Bootleg Score Composer Style Recognition Dataset. Our overarching goal was to create a dataset for studying composer style recognition that is "as accessible as MNIST and as challenging as ImageNet." To achieve this goal, we sample fixed-length bootleg score fragments from piano sheet music images on IMSLP. The dataset itself contains 40,000 62x64 bootleg score images for a 9-way classification task, 100,000 62x64 bootleg score images for a 100-way classification task, and 29,310 unlabeled variable-length bootleg score images for pretraining.
POP909 - POP909 is a dataset which contains multiple versions of the piano arrangements of 909 popular songs created by professional musicians. The main body of the dataset contains the vocal melody, the lead instrument melody, and the piano accompaniment for each song in MIDI format, which are aligned to the original audio files. Furthermore, annotations are provided of tempo, beat, key, and chords, where the tempo curves are hand-labelled and others are done by MIR algorithms.
ProgGP - A dataset of 173 progressive metal songs, in both GuitarPro and token formats, as per the specifications in DadaGP.
RWC (Real World Computing Music Database) - The RWC (Real World Computing) Music Database is a copyright-cleared music database (DB) that is available to researchers as a common foundation for research. It contains around 100 complete songs with manually labeled section boundaries. For the 50 instruments, individual sounds at half-tone intervals were captured with several variations of playing styles, dynamics, instrument manufacturers and musicians.
Sangeet - An XML Dataset for Hindustani Classical Music. SANGEET preserves all the required information of any given composition including metadata, structural, notational, rhythmic, and melodic information in a standardized way for easy and efficient storage and extraction of musical information. The dataset is intended to provide the ground truth information for music information research tasks, thereby supporting several data-driven analysis from a machine learning perspective.
singKT-dataset - SingKT is a music performance assessment dataset in the field of KT, which attempts to utilize knowledge tracing methods to capture the dynamic changes in learners' sightsinging abilities. The dataset collects data from a public intelligent sightsinging practice platform, SingMaster. The SingKT dataset contains the main answering record data table (RecordDS) and two supplementary information data tables (UserDS, OpernDS). The UserDS table records sightsinging information for the 1074 learners contained in the dataset, and the OpernDS table records music sheet information.
Slakh2100 - The Synthesized Lakh (Slakh) Dataset is a dataset for audio source separation that is synthesized from the Lakh MIDI Dataset v0.1 using professional-grade sample-based virtual instruments. This first release of Slakh, called Slakh2100, contains 2100 automatically mixed tracks and accompanying MIDI files synthesized using a professional-grade sampling engine. The tracks in Slakh2100 are split into training (1500 tracks), validation (375 tracks), and test (225 tracks) subsets, totaling 145 hours of mixtures.
SymphonyNet - SymponyNet is an open-source project aiming to generate complex multi-track and multi-instrument music like symphony. Our method is fully compatible with other types of music like pop, piano, solo music..etc.
Tabla Solo dataset - The Tabla Solo Dataset is a transcribed collection of Tabla solo audio recordings spanning compositions from six different Gharanas of Tabla, played by Pt. Arvind Mulgaonkar. The dataset consists of audio and time aligned bol transcriptions.
Tegridy MIDI Dataset - Tegridy MIDI Dataset for precise and effective Music AI models creation.
The Lakh MIDI Dataset - The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files).
The Italian Music Dataset - The dataset is built by exploiting the Spotify and SoundCloud APIs. It is composed of over 14,500 different songs of both famous and less famous Italian musicians. Each song in the dataset is identified by its Spotify id and its title. Tracks' metadata include also lemmatized and POS-tagged lyrics and, in the most of cases, ten musical features directly gathered from Spotify. Musical features include acousticness (float), danceability (float), duration_ms (int), energy (float), instrumentalness (float), liveness (float), loudness (float), speechiness (float), tempo (float) and valence (float).
The Persian Piano Corpus - The Persian Piano corpus is a comprehensive collection of Persian piano music, spanning from early composers to contemporary figures. It has been meticulously compiled and made publicly accessible, aiming to enable researchers to explore specialized investigations and contribute to new discoveries. The instrument-based approach provides a complete corpus related to the Persian piano, including relevant labels and comprehensive metadata.
The Song Describer Dataset - The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation. The Song Describer dataset is an evaluation dataset made of ~1.1k captions for 706 permissively licensed music recordings.
Universal Music Symbol Classifier - A Python project that trains a Deep Neural Network to distinguish between Music Symbols.
URMP (University of Rochester Multi-Modal Musical Performance) - URMP (University of Rochester Multi-Modal Musical Performance) is a dataset for facilitating audio-visual analysis of musical performances. The dataset comprises 44 simple multi-instrument musical pieces assembled from coordinated but separately recorded performances of individual tracks. For each piece the dataset provided the musical score in MIDI format, the high-quality individual instrument audio recordings and the videos of the assembled pieces.
VGMIDI Dataset - VGMIDI is a dataset of piano arrangements of video game soundtracks. It contains 200 MIDI pieces labelled according to emotion and 3,850 unlabeled pieces. Each labelled piece was annotated by 30 human subjects according to the Circumplex (valence-arousal) model of emotion.
Virtuoso Strings - Virtuoso Strings is a dataset for soft onsets detection for string instruments. It consists of over 144 recordings of professional performances of an excerpt from Haydn's string quartet Op. 74 No. 1 Finale, each with corresponding individual instrumental onset annotations.
WikiMuTe - WikiMuTe: A web-sourced dataset of semantic descriptions for music audio. In this study, we present WikiMuTe, a new and open dataset containing rich semantic descriptions of music. The data is sourced from Wikipedia's rich catalogue of articles covering musical works. Using a dedicated text-mining pipeline, we extract both long and short-form descriptions covering a wide range of topics related to music content such as genre, style, mood, instrumentation, and tempo.
XMIDI - XMIDI is one of the largest known symbolic music dataset with precise emotion and genre labels, comprising 108,023 MIDI files. The average duration of the music pieces is around 176 seconds, resulting in a total dataset length of around 5,278 hours.
YM2413-MDB - YM2413-MDB is an 80s FM video game music dataset with multi-label emotion annotations. It includes 669 audio and MIDI files of music from Sega and MSX PC games in the 80s using YM2413, a programmable sound generator based on FM. The collected game music is arranged with a subset of 15 monophonic instruments and one drum instrument.

^ 返回目录 ^

音效

动物声音数据集 - 该数据集包含875段动物声音，涵盖10种不同的动物叫声。具体包括：200段猫叫、200段狗吠、200段鸟鸣、75段牛哞、45段狮吼、40段羊叫、35段蛙鸣、30段鸡鸣、25段驴叫和25段猴声。
AudioSet - AudioSet是一个音频事件数据集，包含超过200万个由人工标注的10秒视频片段。这些片段大多来自YouTube，因此质量参差不齐，且通常包含多个声源。数据采用一个包含632个事件类别的层次化本体进行标注，这意味着同一段声音可能被标记为不同的标签。例如，“狗吠”既可以被标注为“动物”，也可以被标注为“宠物”或“狗”。所有视频被划分为评估集、平衡训练集和非平衡训练集。
AudioCaps - AudioCaps是一个带有事件描述的声音数据集，专为音频字幕生成任务设计，其声音来源自AudioSet数据集。标注者会同时获得音频轨道以及类别提示（必要时还会提供额外的视频提示）。
Auto-ACD - 我们提出了一套创新的自动音频字幕生成流水线，并构建了一个大规模、高质量的音-文数据集，名为Auto-ACD，包含超过190万对音频-文本样本。Auto-ACD中的文本描述较长（平均18个词），词汇丰富（约2.3万个），并且提供了声音发生时周围听觉环境的相关信息（即带背景噪声的数据点）。
AutoReCap - AutoReCap包含超过5700万对音效的音频-视频-文本样本。该数据集专门用于提供大规模、多样化的带字幕音效片段，经过精心筛选以排除语音和音乐内容，从而确保其适用于大规模音效模型训练。
BBC音效库 - BBC音效数据集包含33,066条音效及其文字描述。类型：主要为环境音。每段音频都配有自然的语言描述。
DCASE 2016 - DCASE 2016是一个用于声音事件检测的数据集。它包含11个声音类别（如清嗓子、拉开抽屉、键盘敲击等办公室环境中的声音），每个类别有20个短的单声道音频文件，每个文件仅包含一个声音事件实例。音频文件标注了事件的开始和结束时间，但实际物理声音之间的静默期（如电话铃响期间的停顿）并未标记，因此被视为事件的一部分。
环境音频数据集 - 该页面旨在维护一份适合环境音频研究的数据集列表。除了免费公开的数据集外，还列出了部分专有及商业数据集，以保证完整性。此外，页面末尾还列出了一些在线音频服务。
ESC-50 - ESC-50数据集是一个包含2000条环境音频录音的标注数据集，适用于环境声音分类方法的基准测试。它涵盖了50个不同类别、共2000段5秒长的音频片段，涉及自然、人类活动和家庭环境中的声音，所有音频均来自Freesound.org。
FAIR-Play - FAIR-Play是一个视频-音频数据集，包含1,871段视频及其对应的双耳音频片段，录制于音乐教室中。同索引的视频和双耳音频大致对齐。
FSD50K（Freesound数据库5万条） - Freesound数据集5万条（简称FSD50K）是一个开放的、由人工标注的声音事件数据集，包含51,197段Freesound音频片段，分布于200个类别中，这些类别源自AudioSet本体。FSD50K由庞培法布拉大学音乐技术组创建。数据主要包含由物理声源和生产机制产生的声音事件，包括人类声音、物体发出的声音、动物声音、自然声音、乐器声音等。
FSDnoisy18k - FSDnoisy18k数据集是一个开放数据集，包含20个声音事件类别的42.5小时音频，其中包括少量人工标注的数据和大量真实场景中的嘈杂数据。音频内容取自Freesound，数据集通过Freesound标注工具进行整理。其中，嘈杂数据集包含15,813段音频（38.8小时），测试集则包含947段音频（1.4小时），且均已正确标注。该数据集存在两种主要的标签噪声类型：在词汇内（IV）和不在词汇内（OOV）。当观察到的标签不正确或不完整时，如果真实或缺失的标签属于目标类别集合，则称为IV；反之，若真实或缺失的标签不属于这20个类别，则称为OOV。
FUSS（自由通用声音分离） - 自由通用声音分离（FUSS）数据集是一个包含任意声音混合及其源级参考信号的数据库，用于任意声音分离实验。FUSS基于FSD50K语料库构建。
iNaturalist声音数据集 - 我们推出了iNaturalist声音数据集（iNatSounds），这是一个包含23万条音频文件的集合，记录了来自全球5,500多种物种的声音，由超过27,000名录音者贡献而成。
带有情感意图的敲击音效 - 该数据集由专业拟音师Ulf Olausson于2019年10月15日在斯德哥尔摩的FoleyWorks工作室录制。受此前关于敲击音效研究的启发，我们选择了五种情感来表现：愤怒、恐惧、快乐、中性与悲伤。
MIMII - 工业机器故障检测与检查用声音数据集（MIMII）是一个工业机器声音数据集。
Mivia音频事件数据集 - MIVIA音频事件数据集总共包含6,000个用于监控应用的事件，包括玻璃破碎、枪声和尖叫声。这6,000个事件被分为训练集（4,200个）和测试集（1,800个）。
Pitch音频数据集（Surge合成器） - 使用开源Surge合成器合成的3.4小时音频，基于Surge软件包中包含的2,084个预设。这些代表“自然”合成声音——即由人类设计的预设。我们生成了4秒长的采样，以速度64播放，音符持续时间为3秒。对于每个预设，我们仅改变音高，范围从MIDI 21到108，即一台三角钢琴的音域。数据集中所有声音均使用normalize工具进行了RMS归一化处理。由于数据集中存在大量重复内容，去重工作较为困难；不过只有少数预设（如鼓声和音效）几乎没有可感知的音高变化或顺序差异。
RemFX - RemFX：评估数据集。这些数据集最初来源于VocalSet、GuitarSet、DSD100和IDMT-SMT-Drums数据集，随后通过我们的数据集生成脚本进行处理。数据集按所应用的效果数量命名（0至5）。例如，2-2.zip表示每个输入音频示例都应用了2种效果，而目标音频本身保持不变。所使用的音频效果来自失真、延迟、动态范围压缩器、相位器和混响等效果库，并且针对每个示例随机抽取，不重复使用。
SoundCam - SoundCam是迄今为止公开发布的最大规模的野外房间独特冲激响应数据集。它包含5,000个10通道的真实房间冲激响应测量数据，以及2,000个10通道的音乐录音，分别录制于三个不同的房间：一个受控的声学实验室、一个真实的客厅和一个会议室，每个房间内都有不同的人处于不同位置。
SoundingEarth - SoundingEarth包含全球各地的同步航拍影像和音频样本。
Spatial LibriSpeech - Spatial LibriSpeech是一个空间音频数据集，包含超过650小时的一阶ambisonics音频，并可选添加干扰噪声（原始19通道音频即将发布）。Spatial LibriSpeech专为机器学习模型训练设计，包含声源位置、说话方向、房间声学和几何结构等标注信息。该数据集通过在LibriSpeech样本上叠加8,000多个合成房间中的20多万种模拟声学条件而生成。
STARSS22（索尼-TAu 2022年真实空间音景） - 索尼-TAu 2022年真实空间音景（STARSS22）数据集包含使用高通道数球形麦克风阵列（SMA）录制的真实场景音频。录音由两支团队分别在两个地点完成：芬兰坦佩雷的坦佩雷大学和日本东京的索尼设施。两地的录音过程和标注方式相同，组织架构也相似。
ToyADMOS - ToyADMOS数据集是一个机器运行声音数据集，包含约540小时的正常机器运行声音，以及通过四只麦克风以48kHz采样率收集的超过12,000个异常声音样本。该数据集由小泉勇马及NTT媒体智能实验室的成员共同准备。
TUT声音事件2017 - TUT声音事件2017数据集包含24段街头环境下的音频录音，涵盖6个不同的类别：刹车尖叫、汽车、儿童、大型车辆、人们交谈和人们行走。
UrbanSound8K - Urban Sound 8K是一个音频数据集，包含8,732段城市声音的标注片段（不超过4秒），覆盖10个类别：空调、汽车喇叭、儿童玩耍、狗吠、钻孔、发动机怠速、枪声、电镐、警笛和街头音乐。这些类别均来自城市声音分类体系。所有片段均取自上传至www.freesound.org的实地录音。
VGG-Sound - 一个大规模的视听数据集。VGG-Sound是一个视听对应数据集，包含从上传至YouTube的视频中提取的简短音频片段。
视觉指示的声音 - 不同材料在受到撞击或刮擦时会发出独特的声响——泥土发出沉闷的撞击声，陶瓷则发出清脆的叮当声。这些声音不仅揭示了物体的材质特性，还反映了物理交互中的力量和运动方式。

^ 返回目录 ^

AI Audio Datasets (AI-ADS) 快速上手指南

AI Audio Datasets (AI-ADS) 是一个 curated（精选）的开源音频数据集列表，涵盖语音（Speech）、音乐（Music）和音效（Sound Effects）。它本身不是一个可安装的软件库，而是为生成式 AI、AIGC 模型训练及智能音频应用开发提供高质量数据资源的索引目录。

本指南将帮助开发者快速定位并获取适合中文场景的训练数据。

环境准备

由于本项目是数据集清单，无需安装特定的 Python 包或系统依赖。但在下载和使用数据前，请确保满足以下基础条件：

操作系统：Linux (推荐), macOS, 或 Windows。
存储空间：根据所选数据集不同，需预留从几百 MB 到数 TB 不等的磁盘空间（建议至少预留 50GB 用于测试常用数据集）。
网络环境：部分数据集托管在 Hugging Face、Google Drive 或国外服务器，国内访问可能较慢。建议配置科学上网环境，或使用支持断点续传的下载工具（如 wget, curl, aria2c）。
数据处理工具：
- Python 3.8+
- 常用音频处理库：librosa, torchaudio, soundfile
- 压缩文件解压工具：unzip, tar

获取与使用步骤

由于 AI-ADS 是资源列表，"安装"即为"选择并下载特定数据集"。以下是针对中国开发者优化的获取流程：

1. 浏览与选择数据集

访问项目仓库或相关链接，根据需求选择数据集。针对中文开发者，优先推荐以下包含中文数据的项目：

AISHELL-1 / AISHELL-3: 高质量普通话语音识别与合成数据集。
CN-Celeb: 大规模中文明星说话人识别数据集。
Emilia: 包含大量中文的多语言语音数据（可通过 Hugging Face 获取）。
GenshinVoice: 原神角色语音数据集（适合 SVC/TTS 研究）。

2. 下载数据集（以 AISHELL-1 为例）

大多数数据集提供直接下载链接或通过脚本获取。

方式 A：直接下载 (使用 aria2c 加速)

# 安装 aria2 (Ubuntu/Debian)
sudo apt-get install aria2

# 下载 AISHELL-1 (示例链接，请以官网 openslr.org 最新链接为准)
aria2c -x 16 -s 16 http://www.openslr.org/resources/33/data_aishell.tgz

方式 B：通过 Hugging Face 下载 (推荐 Emilia 等大数据集) 若数据集托管在 Hugging Face，可使用其 CLI 工具（国内建议使用镜像源或代理）：

# 安装 huggingface-cli
pip install huggingface_hub

# 下载 Emilia 数据集的子集或全量
huggingface-cli download amphion/Emilia-Dataset --local-dir ./emilia_data

方式 C：使用 Git 克隆代码库 部分数据集（如 Audio-FLAN, GenshinVoice）以代码库形式提供，需克隆仓库：

git clone https://github.com/lmxue/Audio-FLAN.git
git clone https://github.com/w4123/GenshinVoice.git

3. 数据解压与整理

下载完成后，通常需要解压 .tgz, .zip 或 .tar 文件。

# 解压 tgz 文件
tar -zxvf data_aishell.tgz

# 查看目录结构
ls data_aishell/wav

基本使用示例

以下是一个使用 Python 和 torchaudio 加载已下载的中文语音数据（以 AISHELL-1 格式为例）并进行简单处理的示例。

前置依赖安装：

pip install torchaudio librosa

Python 代码示例：

import torchaudio
import os

# 指定数据路径 (假设已解压到当前目录)
audio_path = "data_aishell/wav/test/S0764/BAC009S0764W0121.wav"

if os.path.exists(audio_path):
    # 加载音频波形和采样率
    waveform, sample_rate = torchaudio.load(audio_path)
    
    print(f"音频采样率：{sample_rate} Hz")
    print(f"波形形状：{waveform.shape}")
    print("音频加载成功，可开始用于模型训练或推理。")
    
    # 简单的重采样示例 (如需统一为 16kHz)
    if sample_rate != 16000:
        transform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
        waveform = transform(waveform)
        print(f"已完成重采样至 16000 Hz")
else:
    print("未找到音频文件，请检查路径是否正确。")

常用中文数据资源速查

数据集名称	类型	语言	特点	获取来源
AISHELL-1	语音识别	中文	开源免费，标准普通话，适合入门	OpenSLR
AISHELL-3	语音合成	中文	多说话人，高保真，含情感中性录音	OpenSLR
CN-Celeb	说话人识别	中文	野外采集，覆盖多种场景和风格	OpenSLR
Emilia	多任务训练	中英等多语	超大规模 (10w+ 小时)，源自互联网视频	Hugging Face
GenshinVoice	歌声/语音转换	中文/日文	游戏角色语音，适合 SVC/SVS 研究	GitHub

提示：对于大型数据集，建议先在本地小规模样本上验证数据处理管道（Pipeline），确认无误后再进行全量下载和训练。

相似工具推荐

openclaw

OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。 OpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你

★ 349.3k|★★★☆☆|1周前

Agent开发框架图像

stable-diffusion-webui

stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。

★ 162.1k|★★★☆☆|1周前

开发框架图像Agent

everything-claude-code

everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上

★ 157.4k|★★☆☆☆|今天

开发框架Agent语言模型

ComfyUI

ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。

★ 108.3k|★★☆☆☆|6天前

开发框架图像Agent

gemini-cli

gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。

★ 100.8k|★★☆☆☆|6天前

插件Agent图像

markitdown

MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器

★ 93.4k|★★☆☆☆|1周前

插件开发框架