[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-castorini--honk":3,"similar-castorini--honk":92},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":8,"readme_en":9,"readme_zh":10,"quickstart_zh":11,"use_case_zh":12,"hero_image_url":13,"owner_login":14,"owner_name":15,"owner_avatar_url":16,"owner_bio":17,"owner_company":18,"owner_location":18,"owner_email":18,"owner_twitter":18,"owner_website":19,"owner_url":20,"languages":21,"stars":34,"forks":35,"last_commit_at":36,"license":37,"difficulty_score":38,"env_os":39,"env_gpu":40,"env_ram":41,"env_deps":42,"category_tags":55,"github_topics":18,"view_count":57,"oss_zip_url":18,"oss_zip_packed_at":18,"status":58,"created_at":59,"updated_at":60,"faqs":61,"releases":91},5261,"castorini\u002Fhonk","honk","PyTorch implementations of neural network models for keyword spotting","Honk 是一个基于 PyTorch 开发的开源项目，专注于实现用于关键词检测（Keyword Spotting）的卷积神经网络模型。它复现了谷歌在 TensorFlow 中提出的经典架构，旨在帮助设备高效识别特定的语音指令，如“停止”、“前进”或自定义的唤醒词（类似\"Hey Siri\"）。\n\nHonk 主要解决了在资源受限的边缘设备上部署轻量级语音识别功能的难题。通过优化模型结构，它能够在不依赖云端算力的情况下，让智能代理实时响应简单命令，从而降低延迟并保护用户隐私。\n\n这款工具非常适合人工智能研究人员、嵌入式系统开发者以及希望构建离线语音交互功能的技术团队使用。无论是学术探索还是实际产品落地，Honk 都提供了坚实的代码基础。\n\n其技术亮点在于不仅支持 PyTorch 后端，还兼容 Caffe2 和 ONNX 格式，提供了多种预训练模型以便灵活切换。此外，项目包含了完整的演示应用和针对树莓派等硬件的部署指南，展现了其在低功耗设备上的出色适应能力，是进入端侧语音识别领域的优质入门选择。","\u003Cp align=\"center\">\n  \u003Cimg width=\"400\" height=\"185,005\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcastorini_honk_readme_268185fe88fa.png\">\n\u003C\u002Fp>\n\n# Honk: CNNs for Keyword Spotting\n\nHonk is a PyTorch reimplementation of Google's TensorFlow convolutional neural networks for keyword spotting, which accompanies the recent release of their [Speech Commands Dataset](https:\u002F\u002Fresearch.googleblog.com\u002F2017\u002F08\u002Flaunching-speech-commands-dataset.html). For more details, please consult our writeup:\n\n+ Raphael Tang, Jimmy Lin. [Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting.](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.06554) _arXiv:1710.06554_, October 2017.\n+ Raphael Tang, Jimmy Lin. [Deep Residual Learning for Small-Footprint Keyword Spotting.](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.10361) _Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing_, pp. 5479-5483.\n\nHonk is useful for building on-device speech recognition capabilities for interactive intelligent agents. Our code can be used to identify simple commands (e.g., \"stop\" and \"go\") and be adapted to detect custom \"command triggers\" (e.g., \"Hey Siri!\").\n\nCheck out [this video](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UbAsDvinnXc) for a demo of Honk in action!\n\n## Demo Application\n\nUse the instructions below to run the demo application (shown in the above video) yourself!\n\nCurrently, PyTorch has official support for only Linux and OS X. Thus, Windows users will not be able to run this demo easily.\n\nTo deploy the demo, run the following commands:\n- If you do not have PyTorch, please see [the website](http:\u002F\u002Fpytorch.org).\n- Install Python dependencies: `pip install -r requirements.txt`\n- Install GLUT (OpenGL Utility Toolkit) through your package manager (e.g. `apt-get install freeglut3-dev`)\n- Fetch the data and models: `.\u002Ffetch_data.sh`\n- Start the PyTorch server: `python .`\n- Run the demo: `python utils\u002Fspeech_demo.py`\n\nIf you need to adjust options, like turning off CUDA, please edit `config.json`.\n\nAdditional notes for Mac OS X:\n- GLUT is already installed on Mac OS X, so that step isn't needed.\n- If you have issues installing pyaudio, [this](https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F33513522\u002Fwhen-installing-pyaudio-pip-cannot-find-portaudio-h-in-usr-local-include) may be the issue.\n\n## Server\n### Setup and deployment\n`python .` deploys the web service for identifying if audio contain the command word. By default, `config.json` is used for configuration, but that can be changed with `--config=\u003Cfile_name>`. If the server is behind a firewall, one workflow is to create an SSH tunnel and use port forwarding with the port specified in config (default 16888).\n\nIn our [honk-models](https:\u002F\u002Fgithub.com\u002Fcastorini\u002Fhonk-models) repository, there are several pre-trained models for Caffe2 (ONNX) and PyTorch. The `fetch_data.sh` script fetches these models and extracts them to the `model` directory. You may specify which model and backend to use in the config file's `model_path` and `backend`, respectively. Specifically, `backend` can be either `caffe2` or `pytorch`, depending on what format `model_path` is in. Note that, in order to run our ONNX models, the packages `onnx` and `onnx_caffe2` must be present on your system; these are absent in requirements.txt.\n\n### Raspberry Pi (RPi) Infrastructure Setup\nUnfortunately, getting the libraries to work on the RPi, especially librosa, isn't as straightforward as running a few commands. We outline our process, which may or may not work for you.\n1. Obtain an RPi, preferably an RPi 3 Model B running Raspbian. Specifically, we used [this version](https:\u002F\u002Fdownloads.raspberrypi.org\u002Fraspbian\u002Fimages\u002Fraspbian-2017-09-08\u002F) of Raspbian Stretch.\n2. Install dependencies: `sudo apt-get install -y protobuf-compiler libprotoc-dev python-numpy python-pyaudio python-scipy python-sklearn`\n3. Install Protobuf: `pip install protobuf`\n4. Install ONNX without dependencies: `pip install --no-deps onnx`\n5. Follow the [official instructions](https:\u002F\u002Fcaffe2.ai\u002Fdocs\u002Fgetting-started.html?platform=raspbian&configuration=compile) for installing Caffe2 on Raspbian. This process takes about two hours. You may need to add the `caffe2` module path to the `PYTHONPATH` environment variable. For us, this was accomplished by `export PYTHONPATH=$PYTHONPATH:\u002Fhome\u002Fpi\u002Fcaffe2\u002Fbuild`\n6. Install the ONNX extension for Caffe2: `pip install onnx-caffe2`\n7. Install further requirements: `pip install -r requirements_rpi.txt`\n8. Install librosa: `pip install --no-deps resampy librosa`\n9. Try importing librosa: `python -c \"import librosa\"`. It should throw an error regarding numba, since we haven't installed it.\n10. We haven't found a way to easily install numba on the RPi, so we need to remove it from resampy. For our setup, we needed to remove numba and `@numba.jit` from `\u002Fhome\u002Fpi\u002F.local\u002Flib\u002Fpython2.7\u002Fsite-packages\u002Fresampy\u002Finterpn.py`\n11. All dependencies should now be installed. We should try deploying an ONNX model.\n12. Fetch the models and data: `.\u002Ffetch_data.sh`\n13. In `config.json`, change `backend` to `caffe2` and `model_path` to `model\u002Fgoogle-speech-dataset-full.onnx`.\n14. Deploy the server: `python .` If there are no errors, you have successfully deployed the model, accessible via port 16888 by default.\n15. Run the speech commands demo: `python utils\u002Fspeech_demo.py`. You'll need a working microphone and speakers. If you're interacting with your RPi remotely, you can run the speech demo locally and specify the remote endpoint `--server-endpoint=http:\u002F\u002F[RPi IP address]:16888`.\n\n## Utilities\n### QA client\nUnfortunately, the QA client has no support for the general public yet, since it requires a custom QA service. However, it can still be used to retarget the command keyword.\n\n`python client.py` runs the QA client. You may retarget a keyword by doing `python client.py --mode=retarget`. Please note that text-to-speech may not work well on Linux distros; in this case, please supply IBM Watson credentials via `--watson-username` and `--watson--password`. You can view all the options by doing `python client.py -h`.\n\n### Training and evaluating the model\n**CNN models**. `python -m utils.train --type [train|eval]` trains or evaluates the model. It expects all training examples to follow the same format as that of [Speech Commands Dataset](http:\u002F\u002Fdownload.tensorflow.org\u002Fdata\u002Fspeech_commands_v0.02.tar.gz). The recommended workflow is to download the dataset and add custom keywords, since the dataset already contains many useful audio samples and background noise.\n\n**Residual models**. We recommend the following hyperparameters for training any of our `res{8,15,26}[-narrow]` models on the Speech Commands Dataset:\n```\npython -m utils.train --wanted_words yes no up down left right on off stop go --dev_every 1 --n_labels 12 --n_epochs 26 --weight_decay 0.00001 --lr 0.1 0.01 0.001 --schedule 3000 6000 --model res{8,15,26}[-narrow]\n```\nFor more information about our deep residual models, please see our paper:\n\n+ Raphael Tang, Jimmy Lin. [Deep Residual Learning for Small-Footprint Keyword Spotting.](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.10361)  _Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018)_, April 2018, Calgary, Alberta, Canada.\n\nThere are command options available:\n\n| option         | input format | default | description |\n|----------------|--------------|---------|-------------|\n| `--audio_preprocess_type`   | {MFCCs, PCEN}       | MFCCs     | type of audio preprocess to use            |\n| `--batch_size`   | [1, n)       | 100     | the mini-batch size to use            |\n| `--cache_size`   | [0, inf)       | 32768     | number of items in audio cache, consumes around 32 KB * n   |\n| `--conv1_pool`   | [1, inf) [1, inf) | 2 2     | the width and height of the pool filter       |\n| `--conv1_size`     | [1, inf) [1, inf) | 10 4  | the width and height of the conv filter            |\n| `--conv1_stride`     | [1, inf) [1, inf) | 1 1  | the width and length of the stride            |\n| `--conv2_pool`   | [1, inf) [1, inf) |  1 1    | the width and height of the pool filter            |\n| `--conv2_size`     | [1, inf) [1, inf) | 10 4  | the width and height of the conv filter            |\n| `--conv2_stride`     | [1, inf) [1, inf) | 1 1  | the width and length of the stride            |\n| `--data_folder`   | string       | \u002Fdata\u002Fspeech_dataset     | path to data       |\n| `--dev_every`    | [1, inf)     |  10     | dev interval in terms of epochs            |\n| `--dev_pct`   | [0, 100]       | 10     | percentage of total set to use for dev        |\n| `--dropout_prob` | [0.0, 1.0)   | 0.5     | the dropout rate to use            |\n| `--gpu_no`     | [-1, n] | 1  | the gpu to use            |\n| `--group_speakers_by_id` | {true, false} | true | whether to group speakers across train\u002Fdev\u002Ftest |\n| `--input_file`   | string       |      | the path to the model to load   |\n| `--input_length`   | [1, inf)       | 16000     | the length of the audio   |\n| `--lr`           | (0.0, inf)   | {0.1, 0.001}   | the learning rate to use            |\n| `--type`         | {train, eval}| train   | the mode to use            |\n| `--model`        | string       | cnn-trad-pool2 | one of `cnn-trad-pool2`, `cnn-tstride-{2,4,8}`, `cnn-tpool{2,3}`, `cnn-one-fpool3`, `cnn-one-fstride{4,8}`, `res{8,15,26}[-narrow]`, `cnn-trad-fpool3`, `cnn-one-stride1` |\n| `--momentum` | [0.0, 1.0) | 0.9 | the momentum to use for SGD |\n| `--n_dct_filters`| [1, inf)     | 40      | the number of DCT bases to use  |\n| `--n_epochs`     | [0, inf) | 500  | number of epochs            |\n| `--n_feature_maps` | [1, inf) | {19, 45} | the number of feature maps to use for the residual architecture |\n| `--n_feature_maps1` | [1, inf)             | 64        | the number of feature maps for conv net 1            |\n| `--n_feature_maps2`   | [1, inf)       | 64     | the number of feature maps for conv net 2        |\n| `--n_labels`   | [1, n)       | 4     | the number of labels to use            |\n| `--n_layers` | [1, inf) | {6, 13, 24} | the number of convolution layers for the residual architecture |\n| `--n_mels`       | [1, inf)     |   40    | the number of Mel filters to use            |\n| `--no_cuda`      | switch     | false   | whether to use CUDA            |\n| `--noise_prob`     | [0.0, 1.0] | 0.8  | the probability of mixing with noise    |\n| `--output_file`   | string    | model\u002Fgoogle-speech-dataset.pt     | the file to save the model to        |\n| `--seed`   | (inf, inf)       | 0     | the seed to use        |\n| `--silence_prob`     | [0.0, 1.0] | 0.1  | the probability of picking silence    |\n| `--test_pct`   | [0, 100]       | 10     | percentage of total set to use for testing       |\n| `--timeshift_ms`| [0, inf)       | 100    | time in milliseconds to shift the audio randomly |\n| `--train_pct`   | [0, 100]       | 80     | percentage of total set to use for training       |\n| `--unknown_prob`     | [0.0, 1.0] | 0.1  | the probability of picking an unknown word    |\n| `--wanted_words` | string1 string2 ... stringn  | command random  | the desired target words            |\n\n### JavaScript-based Keyword Spotting\n\n[Honkling](https:\u002F\u002Fgithub.com\u002Fcastorini\u002Fhonkling) is a JavaScript implementation of Honk.\nWith Honkling, it is possible to implement various web applications with in-browser keyword spotting functionality.\n\n### Keyword Spotting Data Generator\n\nIn order to improve the flexibility of Honk and Honkling, we provide a program that constructs a dataset from youtube videos.\nDetails can be found in `keyword_spotting_data_generator` folder\n\n### Recording audio\n\nYou may do the following to record sequential audio and save to the same format as that of speech command dataset:\n```\npython -m utils.record\n```\nInput return to record, up arrow to undo, and \"q\" to finish. After one second of silence, recording automatically halts.\n\nSeveral options are available:\n```\n--output-begin-index: Starting sequence number\n--output-prefix: Prefix of the output audio sequence\n--post-process: How the audio samples should be post-processed. One or more of \"trim\" and \"discard_true\".\n```\nPost-processing consists of trimming or discarding \"useless\" audio. Trimming is self-explanatory: the audio recordings are trimmed to the loudest window of *x* milliseconds, specified by `--cutoff-ms`. Discarding \"useless\" audio (`discard_true`) uses a pre-trained model to determine which samples are confusing, discarding correctly labeled ones. The pre-trained model and correct label are defined by `--config` and `--correct-label`, respectively.\n\nFor example, consider `python -m utils.record --post-process trim discard_true --correct-label no --config config.json`. In this case, the utility records a sequence of speech snippets, trims them to one second, and finally discards those not labeled \"no\" by the model in `config.json`.\n\n### Listening to sound level\n\n```bash\npython manage_audio.py listen\n```\n\nThis assists in setting sane values for `--min-sound-lvl` for recording.\n\n### Generating contrastive examples\n`python manage_audio.py generate-contrastive --directory [directory]` generates contrastive examples from all .wav files in `[directory]` using phonetic segmentation.\n\n### Trimming audio\nSpeech command dataset contains one-second-long snippets of audio.\n\n`python manage_audio.py trim --directory [directory]` trims to the loudest one-second for all .wav files in `[directory]`. The careful user should manually check all audio samples using an audio editor like Audacity.\n","\u003Cp align=\"center\">\n  \u003Cimg width=\"400\" height=\"185,005\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcastorini_honk_readme_268185fe88fa.png\">\n\u003C\u002Fp>\n\n# Honk：用于关键词检测的卷积神经网络\n\nHonk 是 Google TensorFlow 卷积神经网络用于关键词检测的 PyTorch 重新实现，伴随着他们最近发布的 [语音命令数据集](https:\u002F\u002Fresearch.googleblog.com\u002F2017\u002F08\u002Flaunching-speech-commands-dataset.html)。更多详情，请参阅我们的论文：\n\n+ Raphael Tang, Jimmy Lin. [Honk：用于关键词检测的卷积神经网络的 PyTorch 重新实现。](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.06554) _arXiv:1710.06554_, 2017年10月。\n+ Raphael Tang, Jimmy Lin. [面向小规模部署的关键词检测的深度残差学习。](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.10361) _2018 IEEE 国际声学、语音与信号处理会议论文集_，第5479–5483页。\n\nHonk 对于构建交互式智能代理的设备端语音识别功能非常有用。我们的代码可用于识别简单命令（例如“停止”和“开始”），并可进一步调整以检测自定义“命令触发词”（例如“嘿，Siri！”）。\n\n请观看[这段视频](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UbAsDvinnXc)，了解 Honk 的实际演示！\n\n## 演示应用\n\n按照以下说明，您也可以运行上面视频中展示的演示应用！\n\n目前，PyTorch 官方仅支持 Linux 和 OS X 系统。因此，Windows 用户将无法轻松运行此演示。\n\n要部署演示程序，请执行以下步骤：\n- 如果尚未安装 PyTorch，请访问 [官方网站](http:\u002F\u002Fpytorch.org)。\n- 安装 Python 依赖项：`pip install -r requirements.txt`\n- 通过包管理器安装 GLUT（OpenGL 工具包），例如 `apt-get install freeglut3-dev`。\n- 获取数据和模型：`.\u002Ffetch_data.sh`\n- 启动 PyTorch 服务器：`python .`\n- 运行演示程序：`python utils\u002Fspeech_demo.py`\n\n如果需要调整选项，例如关闭 CUDA，请编辑 `config.json` 文件。\n\nMac OS X 的补充说明：\n- Mac OS X 已经预装了 GLUT，因此无需额外安装。\n- 如果在安装 pyaudio 时遇到问题，可能是由于缺少 [portaudio.h](https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F33513522\u002Fwhen-installing-pyaudio-pip-cannot-find-portaudio-h-in-usr-local-include) 文件导致的。\n\n## 服务器\n### 设置与部署\n`python .` 将部署用于识别音频中是否包含命令词的 Web 服务。默认情况下，使用 `config.json` 进行配置，但也可以通过 `--config=\u003Cfile_name>` 更改配置文件。如果服务器位于防火墙之后，一种方法是创建 SSH 隧道，并使用配置文件中指定的端口（默认为 16888）进行端口转发。\n\n在我们的 [honk-models](https:\u002F\u002Fgithub.com\u002Fcastorini\u002Fhonk-models) 仓库中，提供了多个针对 Caffe2（ONNX）和 PyTorch 的预训练模型。`fetch_data.sh` 脚本会下载这些模型并将其解压到 `model` 目录下。您可以在配置文件的 `model_path` 和 `backend` 中分别指定使用的模型和后端。其中，`backend` 可以是 `caffe2` 或 `pytorch`，具体取决于 `model_path` 所在的格式。请注意，要运行我们的 ONNX 模型，您的系统必须安装 `onnx` 和 `onnx_caffe2` 包；而这些包并未包含在 `requirements.txt` 中。\n\n### 树莓派 (RPi) 基础设施设置\n遗憾的是，在 RPi 上使库正常工作，尤其是 librosa，并不像运行几条命令那样简单。我们在此概述了我们的流程，但这可能并不适用于所有人。\n1. 准备一台 RPi，最好是运行 Raspbian 的 RPi 3 Model B。我们具体使用的是 [这个版本](https:\u002F\u002Fdownloads.raspberrypi.org\u002Fraspbian\u002Fimages\u002Fraspbian-2017-09-08\u002F)的 Raspbian Stretch。\n2. 安装依赖项：`sudo apt-get install -y protobuf-compiler libprotoc-dev python-numpy python-pyaudio python-scipy python-sklearn`\n3. 安装 Protobuf：`pip install protobuf`\n4. 安装 ONNX，不带依赖项：`pip install --no-deps onnx`\n5. 按照 [官方说明](https:\u002F\u002Fcaffe2.ai\u002Fdocs\u002Fgetting-started.html?platform=raspbian&configuration=compile) 在 Raspbian 上安装 Caffe2。这一过程大约需要两个小时。您可能需要将 `caffe2` 模块路径添加到 `PYTHONPATH` 环境变量中。对我们而言，操作如下：`export PYTHONPATH=$PYTHONPATH:\u002Fhome\u002Fpi\u002Fcaffe2\u002Fbuild`\n6. 安装 Caffe2 的 ONNX 扩展：`pip install onnx-caffe2`\n7. 安装其他要求：`pip install -r requirements_rpi.txt`\n8. 安装 librosa：`pip install --no-deps resampy librosa`\n9. 尝试导入 librosa：`python -c \"import librosa\"`。此时应会抛出关于 numba 的错误，因为我们尚未安装它。\n10. 我们尚未找到在 RPi 上轻松安装 numba 的方法，因此需要从 resampy 中移除 numba。对于我们的设置，我们需要从 `\u002Fhome\u002Fpi\u002F.local\u002Flib\u002Fpython2.7\u002Fsite-packages\u002Fresampy\u002Finterpn.py` 中移除 numba 和 `@numba.jit`。\n11. 至此，所有依赖项均已安装完毕。我们可以尝试部署一个 ONNX 模型。\n12. 下载模型和数据：`.\u002Ffetch_data.sh`\n13. 在 `config.json` 中，将 `backend` 改为 `caffe2`，并将 `model_path` 改为 `model\u002Fgoogle-speech-dataset-full.onnx`。\n14. 部署服务器：`python .` 如果没有报错，则表示模型已成功部署，默认可通过端口 16888 访问。\n15. 运行语音命令演示：`python utils\u002Fspeech_demo.py`。您需要一个可用的麦克风和扬声器。如果您远程控制 RPi，可以本地运行语音演示，并通过 `--server-endpoint=http:\u002F\u002F[RPi IP地址]:16888` 指定远程端点。\n\n## 工具\n### QA 客户端\n遗憾的是，QA 客户端目前尚不支持普通用户，因为它需要一个自定义的 QA 服务。不过，它仍然可以用来重新定位命令关键词。\n\n`python client.py` 用于运行 QA 客户端。您可以通过 `python client.py --mode=retarget` 来重新定位关键词。请注意，文本转语音功能在 Linux 发行版上可能表现不佳；在这种情况下，请通过 `--watson-username` 和 `--watson-password` 提供 IBM Watson 的凭据。您可以运行 `python client.py -h` 查看所有选项。\n\n### 模型的训练与评估\n**CNN 模型**。`python -m utils.train --type [train|eval]` 用于训练或评估模型。它假定所有训练样本都遵循与 [Speech Commands 数据集](http:\u002F\u002Fdownload.tensorflow.org\u002Fdata\u002Fspeech_commands_v0.02.tar.gz) 相同的格式。推荐的工作流程是下载该数据集并添加自定义关键词，因为该数据集已经包含许多有用的音频样本和背景噪声。\n\n**残差模型**。我们建议在 Speech Commands 数据集上训练我们的任何 `res{8,15,26}[-narrow]` 模型时使用以下超参数：\n```\npython -m utils.train --wanted_words yes no up down left right on off stop go --dev_every 1 --n_labels 12 --n_epochs 26 --weight_decay 0.00001 --lr 0.1 0.01 0.001 --schedule 3000 6000 --model res{8,15,26}[-narrow]\n```\n有关我们深度残差模型的更多信息，请参阅我们的论文：\n\n+ Raphael Tang, Jimmy Lin. [Deep Residual Learning for Small-Footprint Keyword Spotting.](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.10361) _2018 IEEE 国际声学、语音与信号处理会议（ICASSP 2018）论文集_，2018年4月，加拿大阿尔伯塔省卡尔加里。\n\n以下是可用的命令选项：\n\n| 选项                 | 输入格式         | 默认值   | 描述                                                         |\n|----------------------|------------------|----------|--------------------------------------------------------------|\n| `--audio_preprocess_type`   | {MFCCs, PCEN}       | MFCCs     | 使用的音频预处理类型                                        |\n| `--batch_size`   | [1, n)       | 100     | 使用的小批量大小                                            |\n| `--cache_size`   | [0, inf)       | 32768     | 音频缓存中的项目数量，大约消耗 32 KB * n                   |\n| `--conv1_pool`   | [1, inf) [1, inf) | 2 2     | 池化滤波器的宽度和高度                                      |\n| `--conv1_size`     | [1, inf) [1, inf) | 10 4  | 卷积滤波器的宽度和高度                                      |\n| `--conv1_stride`     | [1, inf) [1, inf) | 1 1  | 步幅的宽度和长度                                            |\n| `--conv2_pool`   | [1, inf) [1, inf) |  1 1    | 池化滤波器的宽度和高度                                      |\n| `--conv2_size`     | [1, inf) [1, inf) | 10 4  | 卷积滤波器的宽度和高度                                      |\n| `--conv2_stride`     | [1, inf) [1, inf) | 1 1  | 步幅的宽度和长度                                            |\n| `--data_folder`   | string       | \u002Fdata\u002Fspeech_dataset     | 数据路径                                                    |\n| `--dev_every`    | [1, inf)     |  10     | 以 epoch 为单位的验证间隔                                    |\n| `--dev_pct`   | [0, 100]       | 10     | 用于验证的总数据百分比                                      |\n| `--dropout_prob` | [0.0, 1.0)   | 0.5     | 使用的丢弃率                                                |\n| `--gpu_no`     | [-1, n] | 1  | 使用的 GPU                                                  |\n| `--group_speakers_by_id` | {true, false} | true | 是否按 ID 对训练\u002F验证\u002F测试集中的说话人分组                    |\n| `--input_file`   | string       |      | 要加载的模型路径                                            |\n| `--input_length`   | [1, inf)       | 16000     | 音频的长度                                                  |\n| `--lr`           | (0.0, inf)   | {0.1, 0.001}   | 使用的学习率                                                |\n| `--type`         | {train, eval}| train   | 使用的模式                                                  |\n| `--model`        | string       | cnn-trad-pool2 | 可选 `cnn-trad-pool2`, `cnn-tstride-{2,4,8}`, `cnn-tpool{2,3}`, `cnn-one-fpool3`, `cnn-one-fstride{4,8}`, `res{8,15,26}[-narrow]`, `cnn-trad-fpool3`, `cnn-one-stride1` |\n| `--momentum` | [0.0, 1.0) | 0.9 | 用于 SGD 的动量                                             |\n| `--n_dct_filters`| [1, inf)     | 40      | 使用的 DCT 基础数量                                         |\n| `--n_epochs`     | [0, inf) | 500  | 训练的 epoch 数                                             |\n| `--n_feature_maps` | [1, inf) | {19, 45} | 残差架构中使用的特征图数量                                  |\n| `--n_feature_maps1` | [1, inf)             | 64        | 第一个卷积网络使用的特征图数量                            |\n| `--n_feature_maps2`   | [1, inf)       | 64     | 第二个卷积网络使用的特征图数量                              |\n| `--n_labels`   | [1, n)       | 4     | 使用的标签数量                                              |\n| `--n_layers` | [1, inf) | {6, 13, 24} | 残差架构中的卷积层数                                        |\n| `--n_mels`       | [1, inf)     |   40    | 使用的 Mel 滤波器数量                                       |\n| `--no_cuda`      | 开关     | false   | 是否使用 CUDA                                               |\n| `--noise_prob`     | [0.0, 1.0] | 0.8  | 与噪声混合的概率                                            |\n| `--output_file`   | string    | model\u002Fgoogle-speech-dataset.pt     | 保存模型的文件                                              |\n| `--seed`   | (inf, inf)       | 0     | 使用的随机种子                                              |\n| `--silence_prob`     | [0.0, 1.0] | 0.1  | 选择静音的概率                                              |\n| `--test_pct`   | [0, 100]       | 10     | 用于测试的总数据百分比                                      |\n| `--timeshift_ms`| [0, inf)       | 100    | 随机偏移音频的时间（毫秒）                                  |\n| `--train_pct`   | [0, 100]       | 80     | 用于训练的总数据百分比                                      |\n| `--unknown_prob`     | [0.0, 1.0] | 0.1  | 选择未知单词的概率                                          |\n| `--wanted_words` | string1 string2 ... stringn  | command random  | 所需的目标词汇                                              |\n\n### 基于 JavaScript 的关键词检测\n\n[Honkling](https:\u002F\u002Fgithub.com\u002Fcastorini\u002Fhonkling) 是 Honk 的 JavaScript 实现。借助 Honkling，可以在浏览器中实现具有关键词检测功能的各种 Web 应用程序。\n\n### 关键词检测数据生成器\n\n为了提高 Honk 和 Honkling 的灵活性，我们提供了一个从 YouTube 视频构建数据集的程序。详细信息请参见 `keyword_spotting_data_generator` 文件夹。\n\n### 录制音频\n\n您可以执行以下操作来录制连续的音频，并将其保存为与语音命令数据集相同的格式：\n```\npython -m utils.record\n```\n输入回车键开始录制，向上箭头撤销，输入“q”结束。静音一秒钟后，录制会自动停止。\n\n此外，还提供以下选项：\n```\n--output-begin-index: 起始序列号\n--output-prefix: 输出音频序列的前缀\n--post-process: 音频样本的后处理方式。可选择“trim”和“discard_true”中的一项或多件。\n```\n后处理包括修剪或丢弃“无用”的音频。修剪很简单：将音频录音修剪至由 `--cutoff-ms` 指定的 *x* 毫秒内最响亮的部分。而丢弃“无用”音频（`discard_true`）则使用预先训练好的模型来判断哪些样本令人困惑，并丢弃那些被正确标记的样本。预先训练好的模型和正确标签分别由 `--config` 和 `--correct-label` 定义。\n\n例如，考虑运行 `python -m utils.record --post-process trim discard_true --correct-label no --config config.json`。在这种情况下，该工具会记录一系列语音片段，将其修剪至一秒，最后丢弃那些未被 `config.json` 中的模型标记为“no”的片段。\n\n### 监听声音级别\n\n```bash\npython manage_audio.py listen\n```\n\n这有助于为录制设置合理的 `--min-sound-lvl` 值。\n\n### 生成对比样本\n`python manage_audio.py generate-contrastive --directory [目录]` 会使用音素分割技术，从 `[目录]` 中的所有 `.wav` 文件中生成对比样本。\n\n### 裁剪音频\n语音命令数据集包含时长为一秒钟的音频片段。\n\n`python manage_audio.py trim --directory [目录]` 会将 `[目录]` 中所有 `.wav` 文件裁剪为最响亮的一秒。建议谨慎的用户使用 Audacity 等音频编辑软件手动检查所有音频样本。","# Honk 快速上手指南\n\nHonk 是一个基于 PyTorch 的关键词检测（Keyword Spotting, KWS）工具，复现了 Google 用于语音命令数据集的卷积神经网络模型。它适用于构建设备端语音识别功能，可识别简单指令（如\"stop\"、\"go\"）或自定义唤醒词（如\"Hey Siri\"）。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**：官方支持 Linux 和 macOS。Windows 用户运行演示应用较为困难，建议在 Linux 环境或虚拟机中使用。\n- **硬件**：如需使用 GPU 加速，请确保已安装 CUDA 驱动；树莓派（Raspberry Pi）需特定配置（见下文备注）。\n- **Python 版本**：建议 Python 3.6+。\n\n### 前置依赖\n1. **PyTorch**：请访问 [PyTorch 官网](https:\u002F\u002Fpytorch.org) 安装适合你环境的版本。\n   > **国内加速建议**：可使用清华或中科大镜像源安装：\n   > ```bash\n   > pip install torch torchvision torchaudio -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n   > ```\n2. **系统库**：\n   - Linux: 安装 OpenGL Utility Toolkit (GLUT)\n     ```bash\n     sudo apt-get install freeglut3-dev\n     ```\n   - macOS: GLUT 已预装，无需额外操作。\n   - 若安装 `pyaudio` 报错找不到 `portaudio.h`，请参考相关教程安装 `portaudio` 开发库。\n\n## 安装步骤\n\n1. **克隆项目并安装 Python 依赖**\n   ```bash\n   git clone \u003Crepository_url>  # 替换为实际仓库地址\n   cd honk\n   pip install -r requirements.txt\n   ```\n   > **国内加速建议**：\n   > ```bash\n   > pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n   > ```\n\n2. **获取数据与预训练模型**\n   运行脚本自动下载 Speech Commands 数据集及预训练模型：\n   ```bash\n   .\u002Ffetch_data.sh\n   ```\n   *注：模型文件将解压至 `model` 目录。*\n\n3. **配置选项（可选）**\n   如需关闭 CUDA 或修改端口，编辑 `config.json` 文件。\n\n## 基本使用\n\n### 1. 启动识别服务\n部署 Web 服务以检测音频中是否包含命令词（默认端口 16888）：\n```bash\npython .\n```\n若需指定配置文件：\n```bash\npython . --config=\u003Cfile_name>\n```\n*注：若服务器在防火墙后，可通过 SSH 隧道进行端口转发。*\n\n### 2. 运行语音演示\n启动麦克风实时检测演示（需连接麦克风和扬声器）：\n```bash\npython utils\u002Fspeech_demo.py\n```\n若远程连接树莓派等设备，可指定服务端点：\n```bash\npython utils\u002Fspeech_demo.py --server-endpoint=http:\u002F\u002F[设备 IP]:16888\n```\n\n### 3. 训练与评估模型\n使用 Speech Commands 数据集格式进行训练或评估：\n```bash\npython -m utils.train --type train\n```\n\n**推荐参数（残差模型示例）：**\n以下命令用于训练识别 \"yes\", \"no\", \"up\" 等 12 个关键词的残差网络模型：\n```bash\npython -m utils.train --wanted_words yes no up down left right on off stop go --dev_every 1 --n_labels 12 --n_epochs 26 --weight_decay 0.00001 --lr 0.1 0.01 0.001 --schedule 3000 6000 --model res{8,15,26}[-narrow]\n```\n\n### 4. 录制音频数据\n录制序列音频并保存为数据集格式：\n```bash\npython -m utils.record\n```\n- **操作说明**：按 `Enter` 开始录制，按 `↑` 撤销，按 `q` 结束。静音 1 秒后自动停止。","一家初创团队正在为资源受限的树莓派开发一款离线语音控制智能家居中枢，需要识别“开灯”、“关闭”等特定指令。\n\n### 没有 honk 时\n- **模型移植困难**：团队难以将 Google TensorFlow 版本的关键词检测模型高效转换为 PyTorch 架构，导致无法利用团队熟悉的深度学习栈进行迭代。\n- **端侧部署受阻**：在树莓派等低功耗设备上运行复杂的语音识别库（如 librosa）依赖冲突频发，环境配置耗时数天仍无法跑通 Demo。\n- **自定义训练缺失**：缺乏针对小 footprint（轻量级）场景的预训练残差网络模型，从头训练“嘿，Siri\"类自定义唤醒词需要海量数据和算力支持。\n- **实时响应延迟**：自行搭建的简易音频处理流程未针对嵌入式优化，导致语音指令识别延迟高，用户体验卡顿。\n\n### 使用 honk 后\n- **无缝架构迁移**：直接调用 honk 提供的 PyTorch 复现版本，完美兼容现有代码库，无需手动重写卷积神经网络结构。\n- **快速边缘落地**：利用官方整理的树莓派部署指南和预编译依赖，迅速在 Raspbian 系统上跑通服务，大幅缩短环境调试周期。\n- **灵活定制唤醒词**：基于内置的深度残差学习模型，仅需少量自定义音频数据即可微调出高精度的专属命令触发器。\n- **流畅交互体验**：部署优化的 ONNX 或 PyTorch 后端模型，在低算力设备上实现了低延迟、高准确率的实时语音指令响应。\n\nhonk 让开发者能够以极低的成本，将谷歌级别的关键词检测能力快速移植到资源受限的边缘设备上。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcastorini_honk_0b9b1d0f.png","castorini","Castorini","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fcastorini_706a9662.jpg","Jimmy Lin's research group at the University of Waterloo",null,"castorini.io","https:\u002F\u002Fgithub.com\u002Fcastorini",[22,26,30],{"name":23,"color":24,"percentage":25},"Python","#3572A5",58.3,{"name":27,"color":28,"percentage":29},"Jupyter Notebook","#DA5B0B",41.6,{"name":31,"color":32,"percentage":33},"Shell","#89e051",0.1,525,122,"2026-03-05T14:55:29","MIT",4,"Linux, macOS","可选。支持 CUDA 加速，可通过配置关闭（--no_cuda）。未指定具体型号或显存要求。","未说明",{"notes":43,"python":44,"dependencies":45},"Windows 用户无法轻松运行演示程序。在树莓派上安装 librosa 和 numba 非常困难，可能需要手动修改源码移除 numba 依赖。若使用 ONNX 模型需额外安装 onnx 和 onnx_caffe2（不在 requirements.txt 中）。Mac OS X 已预装 GLUT，但安装 pyaudio 可能需手动解决 portaudio 头文件问题。","2.7 (树莓派示例), 3.x (通用，未指定具体小版本)",[46,47,48,49,50,51,52,53,54],"pytorch","librosa","pyaudio","numpy","scipy","scikit-learn","protobuf","onnx","onnx_caffe2 (如需使用 Caffe2 后端)",[56],"音频",2,"ready","2026-03-27T02:49:30.150509","2026-04-08T05:30:04.813889",[62,67,72,77,82,87],{"id":63,"question_zh":64,"answer_zh":65,"source_url":66},23848,"运行代码时遇到依赖库版本不兼容的问题，应该如何解决？","需要在 requirements 文件中指定具体的库版本。根据维护者提供的配置，建议使用的版本如下：librosa (0.6.2), pytube (9.5.2 或降级到 9.5.6 以解决特定错误), requests (2.21.0), inflect (2.1.0), pocketsphinx (0.1.15), google-api-python-client (1.7.11), pydub (0.23.1)。如果遇到 pytube 报错，尝试从 9.6.0 降级到 9.5.6 通常能解决问题。","https:\u002F\u002Fgithub.com\u002Fcastorini\u002Fhonk\u002Fissues\u002F104",{"id":68,"question_zh":69,"answer_zh":70,"source_url":71},23849,"Demo 应用程序在后台如何处理音频输入流？","音频输入被分割成片段进行处理。虽然 Demo 中使用了压缩和 base64 编码作为一种快速实现的变通方法（推荐使用 WebSocket），但关键要求是采样率必须为 16 kHz。如果您的音频是 44.1 kHz，需要先进行降采样处理。关于时间步长（stride）的定义，如果没有重叠，则意味着 `(len(array) - window_size) \u002F\u002F stride_size` 小于 2。","https:\u002F\u002Fgithub.com\u002Fcastorini\u002Fhonk\u002Fissues\u002F47",{"id":73,"question_zh":74,"answer_zh":75,"source_url":76},23850,"如何使用超过 1 秒长度的音频文件（例如 3 秒）训练模型？","默认情况下 Honkling 未针对超过 1 秒的音频进行测试和支持。若要支持更长音频，需要修改 honkling 项目中的配置文件。具体需要更新 `common\u002Fconfig.js` 文件的第 9 行和第 186 行，以及 `common\u002FofflineAudioProcessor.js` 文件的第 12 行，以适配新的音频长度参数。","https:\u002F\u002Fgithub.com\u002Fcastorini\u002Fhonk\u002Fissues\u002F98",{"id":78,"question_zh":79,"answer_zh":80,"source_url":81},23851,"模型输入特征的高度为什么设置为 101，而不是标准的左右帧数（如左 23 右 8）？如何标记部分包含命令的帧？","输入高度 101 对应于约 1 秒的音频（帧移 10ms）。对于标签标记，一种启发式方法是将连续的非静音帧标记为该音频剪辑的标签（通过 RMS 阈值判断）。整个输入的标签通常取自中心帧（即\"now\"时刻，索引为 0 的帧）的标签。上下文结构大致为：过去帧 (-23 到 -1)，当前帧 (0)，未来帧 (1 到 8)。","https:\u002F\u002Fgithub.com\u002Fcastorini\u002Fhonk\u002Fissues\u002F58",{"id":83,"question_zh":84,"answer_zh":85,"source_url":86},23852,"Honk 项目与其他相关项目（如 Anserini 和 Castor）有什么关系？","Honk 所属的研究组还有两个相关项目：Anserini（代表动物是鹅）和 Castor（代表动物是海狸）。这两个系统设计为协同工作，其中 Castor 使用 Anserini 提供的数据。这种设计灵感来源于海狸和鹅合作的图像。","https:\u002F\u002Fgithub.com\u002Fcastorini\u002Fhonk\u002Fissues\u002F48",{"id":88,"question_zh":89,"answer_zh":90,"source_url":81},23853,"除了当前的实现外，还有哪些关键词检测（Keyword Spotting）的技术路线可以参考？","如果想尝试完全不同的方法，可以参考基于 CTC（Connectionist Temporal Classification）和 Transducer 的关键词检测方案。相关的学术论文包括 arXiv:1709.03665 和 arXiv:1710.09617。此外，也可以探索将 Sphinx 作为命令触发的替代后端，并在树莓派上对比其与 CNN 模型的功耗表现。",[],[93,109,118,126,134,142],{"id":94,"name":95,"github_repo":96,"description_zh":97,"stars":98,"difficulty_score":57,"last_commit_at":99,"category_tags":100,"status":58},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85013,"2026-04-06T11:09:19",[101,102,103,104,105,106,107,108,56],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架",{"id":110,"name":111,"github_repo":112,"description_zh":113,"stars":114,"difficulty_score":115,"last_commit_at":116,"category_tags":117,"status":58},4128,"GPT-SoVITS","RVC-Boss\u002FGPT-SoVITS","GPT-SoVITS 是一款强大的开源语音合成与声音克隆工具，旨在让用户仅需极少量的音频数据即可训练出高质量的个性化语音模型。它核心解决了传统语音合成技术依赖海量录音数据、门槛高且成本大的痛点，实现了“零样本”和“少样本”的快速建模：用户只需提供 5 秒参考音频即可即时生成语音，或使用 1 分钟数据进行微调，从而获得高度逼真且相似度极佳的声音效果。\n\n该工具特别适合内容创作者、独立开发者、研究人员以及希望为角色配音的普通用户使用。其内置的友好 WebUI 界面集成了人声伴奏分离、自动数据集切片、中文语音识别及文本标注等辅助功能，极大地降低了数据准备和模型训练的技术门槛，让非专业人士也能轻松上手。\n\n在技术亮点方面，GPT-SoVITS 不仅支持中、英、日、韩、粤语等多语言跨语种合成，还具备卓越的推理速度，在主流显卡上可实现实时甚至超实时的生成效率。无论是需要快速制作视频配音，还是进行多语言语音交互研究，GPT-SoVITS 都能以极低的数据成本提供专业级的语音合成体验。",56375,3,"2026-04-05T22:15:46",[56],{"id":119,"name":120,"github_repo":121,"description_zh":122,"stars":123,"difficulty_score":115,"last_commit_at":124,"category_tags":125,"status":58},2863,"TTS","coqui-ai\u002FTTS","🐸TTS 是一款功能强大的深度学习文本转语音（Text-to-Speech）开源库，旨在将文字自然流畅地转化为逼真的人声。它解决了传统语音合成技术中声音机械生硬、多语言支持不足以及定制门槛高等痛点，让高质量的语音生成变得触手可及。\n\n无论是希望快速集成语音功能的开发者，还是致力于探索前沿算法的研究人员，亦或是需要定制专属声音的数据科学家，🐸TTS 都能提供得力支持。它不仅预置了覆盖全球 1100 多种语言的训练模型，让用户能够即刻上手，还提供了完善的工具链，支持用户利用自有数据训练新模型或对现有模型进行微调，轻松实现特定风格的声音克隆。\n\n在技术亮点方面，🐸TTS 表现卓越。其最新的 ⓍTTSv2 模型支持 16 种语言，并在整体性能上大幅提升，实现了低于 200 毫秒的超低延迟流式输出，极大提升了实时交互体验。此外，它还无缝集成了 🐶Bark、🐢Tortoise 等社区热门模型，并支持调用上千个 Fairseq 模型，展现了极强的兼容性与扩展性。配合丰富的数据集分析与整理工具，🐸TTS 已成为科研与生产环境中备受信赖的语音合成解决方案。",44971,"2026-04-03T14:47:02",[56,108,101],{"id":127,"name":128,"github_repo":129,"description_zh":130,"stars":131,"difficulty_score":115,"last_commit_at":132,"category_tags":133,"status":58},2375,"LocalAI","mudler\u002FLocalAI","LocalAI 是一款开源的本地人工智能引擎，旨在让用户在任意硬件上轻松运行各类 AI 模型，包括大语言模型、图像生成、语音识别及视频处理等。它的核心优势在于彻底打破了高性能计算的门槛，无需昂贵的专用 GPU，仅凭普通 CPU 或常见的消费级显卡（如 NVIDIA、AMD、Intel 及 Apple Silicon）即可部署和运行复杂的 AI 任务。\n\n对于担心数据隐私的用户而言，LocalAI 提供了“隐私优先”的解决方案，确保所有数据处理均在本地基础设施内完成，无需上传至云端。同时，它完美兼容 OpenAI、Anthropic 等主流 API 接口，这意味着开发者可以无缝迁移现有应用，直接利用本地资源替代云服务，既降低了成本又提升了可控性。\n\nLocalAI 内置了超过 35 种后端支持（如 llama.cpp、vLLM、Whisper 等），并集成了自主 AI 代理、工具调用及检索增强生成（RAG）等高级功能，且具备多用户管理与权限控制能力。无论是希望保护敏感数据的企业开发者、进行算法实验的研究人员，还是想要在个人电脑上体验最新 AI 技术的极客玩家，都能通过 LocalAI 获",44782,"2026-04-02T22:14:26",[101,56,107,105,108,102,104],{"id":135,"name":136,"github_repo":137,"description_zh":138,"stars":139,"difficulty_score":115,"last_commit_at":140,"category_tags":141,"status":58},3108,"bark","suno-ai\u002Fbark","Bark 是由 Suno 推出的开源生成式音频模型，能够根据文本提示创造出高度逼真的多语言语音、音乐、背景噪音及简单音效。与传统仅能朗读文字的语音合成工具不同，Bark 基于 Transformer 架构，不仅能模拟说话，还能生成笑声、叹息、哭泣等非语言声音，甚至能处理带有情感色彩和语气停顿的复杂文本，极大地丰富了音频表达的可能性。\n\n它主要解决了传统语音合成声音机械、缺乏情感以及无法生成非语音类音效的痛点，让创作者能通过简单的文字描述获得生动自然的音频素材。无论是需要为视频配音的内容创作者、探索多模态生成的研究人员，还是希望快速原型设计的开发者，都能从中受益。普通用户也可通过集成的演示页面轻松体验其神奇效果。\n\n技术亮点方面，Bark 支持商业使用（MIT 许可），并在近期更新中实现了显著的推理速度提升，同时提供了适配低显存 GPU 的版本，降低了使用门槛。此外，社区还建立了丰富的提示词库，帮助用户更好地驾驭模型生成特定风格的声音。只需几行 Python 代码，即可将创意文本转化为高质量音频，是连接文字与声音世界的强大桥梁。",39067,"2026-04-04T03:33:35",[56],{"id":143,"name":144,"github_repo":145,"description_zh":146,"stars":147,"difficulty_score":148,"last_commit_at":149,"category_tags":150,"status":58},3788,"airi","moeru-ai\u002Fairi","airi 是一款开源的本地化 AI 伴侣项目，旨在将虚拟角色（如“二次元老婆”或赛博生命）带入用户的现实世界。它的核心目标是复刻并超越知名 AI 主播 Neuro-sama 的能力，让用户能够拥有完全自主掌控、可私有化部署的智能伙伴。\n\nairi 主要解决了用户对高度定制化、具备情感交互能力且数据隐私安全的 AI 角色的需求。不同于依赖云端服务的通用助手，airi 允许用户在本地运行，不仅保护了对话隐私，还赋予了用户定义角色性格与灵魂的自由。它支持实时语音聊天，甚至能直接参与《我的世界》（Minecraft）和《异星工厂》（Factorio）等游戏，实现了从单纯对话到共同娱乐的跨越。\n\n这款工具非常适合喜爱虚拟角色的普通用户、希望搭建个性化 AI 陪伴的技术爱好者，以及研究多模态交互的开发者。其独特的技术亮点在于跨平台支持（涵盖 Web、macOS 和 Windows）以及强大的游戏交互能力，让 AI 不仅能“说”，还能“玩”。通过容器化的灵魂设计，airi 为每个人创造专属数字生命提供了可能，让虚拟陪伴变得更加真实且触手可及。",37086,1,"2026-04-05T10:54:25",[107,56,105]]