[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-google--uis-rnn":3,"tool-google--uis-rnn":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",158594,2,"2026-04-16T23:34:05",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":32,"env_os":94,"env_gpu":95,"env_ram":95,"env_deps":96,"category_tags":103,"github_topics":104,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":111,"updated_at":112,"faqs":113,"releases":143},8181,"google\u002Fuis-rnn","uis-rnn","This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.","uis-rnn 是谷歌开源的一款基于“无界交错状态循环神经网络”算法的深度学习库，核心应用于说话人日记（Speaker Diarization）领域。它主要解决的是如何将一段连续的音频序列自动分割成不同片段，并准确聚类识别出每一段是由哪位说话人发出的这一难题。\n\n与传统方法不同，uis-rnn 的独特亮点在于其能够通过学习示例数据，端到端地处理序列分割与聚类任务，特别适用于需要高精度区分多人对话场景的研究与开发。该工具基于 PyTorch 构建，提供了完整的模型训练、推理及评估接口，并附带了演示脚本和测试用例，方便用户快速上手验证。\n\n需要注意的是，由于依赖谷歌内部基础设施和专有数据，此开源版本不包含论文中使用的完整说话人识别系统（如 d-vector 嵌入模型），且明确声明非谷歌官方正式产品。因此，uis-rnn 最适合具备一定机器学习基础的研究人员和开发者使用，用于算法复现、学术探索或作为构建自定义语音分析系统的核心组件。对于普通用户而言，直接使用封装好的成品软件可能更为便捷，但对于希望深入理解前沿语音分离技术的技术人员来说，这是一个极具参考价值的开源项目。","# UIS-RNN\n[![Python application](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fuis-rnn\u002Fworkflows\u002FPython%20application\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fuis-rnn\u002Factions\u002Fworkflows\u002Fpythonapp.yml)\n[![PyPI Version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fuisrnn.svg)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fuisrnn)\n[![Python Versions](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fuisrnn.svg)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fuisrnn)\n[![Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle_uis-rnn_readme_50b3c3f284a0.png)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fuisrnn)\n[![codecov](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fgoogle\u002Fuis-rnn\u002Fbranch\u002Fmaster\u002Fgraph\u002Fbadge.svg)](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fgoogle\u002Fuis-rnn)\n[![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fapi-documentation-blue.svg)](https:\u002F\u002Fgoogle.github.io\u002Fuis-rnn)\n\n## Overview\n\nThis is the library for the\n*Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN)* algorithm.\nUIS-RNN solves the problem of segmenting and clustering sequential data\nby learning from examples.\n\nThis algorithm was originally proposed in the paper\n[Fully Supervised Speaker Diarization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04719).\n\nThe work has been introduced by\n[Google AI Blog](https:\u002F\u002Fai.googleblog.com\u002F2018\u002F11\u002Faccurate-online-speaker-diarization.html).\n\n![gif](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle_uis-rnn_readme_260d42f52f6c.gif)\n\n## Disclaimer\n\nThis open source implementation is slightly different than the internal one\nwhich we used to produce the results in the\n[paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04719), due to dependencies on\nsome internal libraries.\n\nWe CANNOT share the data, code, or model for the speaker recognition system\n([d-vector embeddings](https:\u002F\u002Fgoogle.github.io\u002Fspeaker-id\u002Fpublications\u002FGE2E\u002F))\nused in the paper, since the speaker recognition system\nheavily depends on Google's internal infrastructure and proprietary data.\n\n**This library is NOT an official Google product.**\n\nWe welcome community contributions ([guidelines](CONTRIBUTING.md))\nto the [`uisrnn\u002Fcontrib`](uisrnn\u002Fcontrib) folder.\nBut we won't be responsible for the correctness of any community contributions.\n\n## Dependencies\n\nThis library depends on:\n\n* python 3.5+\n* numpy 1.15.1\n* pytorch 1.3.0\n* scipy 1.1.0 (for evaluation only)\n\n## Getting Started\n\n[![YouTube](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle_uis-rnn_readme_d835db83e740.png)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=pGkqwRPzx9U)\n\n### Install the package\n\nWithout downloading the repository, you can install the\n[package](https:\u002F\u002Fpypi.org\u002Fproject\u002Fuisrnn\u002F) by:\n\n```\npip3 install uisrnn\n```\n\nor\n\n```\npython3 -m pip install uisrnn\n```\n\n### Run the demo\n\nTo get started, simply run this command:\n\n```bash\npython3 demo.py --train_iteration=1000 -l=0.001\n```\n\nThis will train a UIS-RNN model using `data\u002Ftoy_training_data.npz`,\nthen store the model on disk, perform inference on `data\u002Ftoy_testing_data.npz`,\nprint the inference results, and save the averaged accuracy in a text file.\n\nPS. The files under `data\u002F` are manually generated *toy data*,\nfor demonstration purpose only.\nThese data are very simple, so we are supposed to get 100% accuracy on the\ntesting data.\n\n### Run the tests\n\nYou can also verify the correctness of this library by running:\n\n```bash\nbash run_tests.sh\n```\n\nIf you fork this library and make local changes, be sure to use these tests\nas a sanity check.\n\nBesides, these tests are also great examples for learning\nthe APIs, especially `tests\u002Fintegration_test.py`.\n\n## Core APIs\n\n### Glossary\n\n| General Machine Learning | Speaker Diarization    |\n|--------------------------|------------------------|\n| Sequence                 | Utterance              |\n| Observation \u002F Feature    | Embedding \u002F d-vector   |\n| Label \u002F Cluster ID       | Speaker                |\n\n### Arguments\n\nIn your main script, call this function to get the arguments:\n\n```python\nmodel_args, training_args, inference_args = uisrnn.parse_arguments()\n```\n\n### Model construction\n\nAll algorithms are implemented as the `UISRNN` class. First, construct a\n`UISRNN` object by:\n\n```python\nmodel = uisrnn.UISRNN(args)\n```\n\nThe definitions of the args are described in `uisrnn\u002Farguments.py`.\nSee `model_parser`.\n\n### Training\n\nNext, train the model by calling the `fit()` function:\n\n```python\nmodel.fit(train_sequences, train_cluster_ids, args)\n```\n\nThe definitions of the args are described in `uisrnn\u002Farguments.py`.\nSee `training_parser`.\n\nThe `fit()` function accepts two types of input, as described below.\n\n#### Input as list of sequences (recommended)\n\nHere, `train_sequences` is a list of observation sequences.\nEach observation sequence is a 2-dim numpy array of type `float`.\n\n* The first dimension is the length of this sequence. And the length\n  can vary from one sequence to another.\n* The second dimension is the size of each observation. This\n  must be consistent among all sequences. For speaker diarization,\n  the observation could be the\n  [d-vector embeddings](https:\u002F\u002Fgoogle.github.io\u002Fspeaker-id\u002Fpublications\u002FGE2E\u002F).\n\n`train_cluster_ids` is also a list, which has the same length as\n`train_sequences`. Each element of `train_cluster_ids` is a 1-dim list or\nnumpy array of strings, containing the ground truth labels for the\ncorresponding sequence in `train_sequences`.\nFor speaker diarization, these labels are the speaker identifiers for each\nobservation.\n\nWhen calling `fit()` in this way, please be very careful with the argument\n`--enforce_cluster_id_uniqueness`.\n\nFor example, assume:\n\n```python\ntrain_cluster_ids = [['a', 'b'], ['a', 'c']]\n```\n\nIf the label `'a'` from the two sequences refers to the same cluster across\nthe entire dataset, then we should have `enforce_cluster_id_uniqueness=False`;\notherwise, if `'a'` is only a local indicator to distinguish from `'b'` in the\n1st sequence, and to distinguish from `'c'` in the 2nd sequence, then we should\nhave `enforce_cluster_id_uniqueness=True`.\n\nAlso, please note that, when calling `fit()` in this way, we are going to\nconcatenate all sequences and all cluster IDs, and delegate to\nthe next section below.\n\n#### Input as single concatenated sequence\n\nHere, `train_sequences` should be a single 2-dim numpy array of type `float`,\nfor the **concatenated** observation sequences.\n\nFor example, if you have *M* training utterances,\nand each utterance is a sequence of *L* embeddings. Each embedding is\na vector of *D* numbers. Then the shape of `train_sequences` is *N * D*,\nwhere *N = M * L*.\n\n`train_cluster_ids` is a 1-dim list or numpy array of strings, of length *N*.\nIt is the **concatenated** ground truth labels of all training data.\n\nSince we are concatenating observation sequences, it is important to note that,\nground truth labels in `train_cluster_id` across different sequences are\nsupposed to be **globally unique**.\n\nFor example, if the set of labels in the first\nsequence is `{'A', 'B', 'C'}`, and the set of labels in the second sequence\nis `{'B', 'C', 'D'}`. Then before concatenation, we should rename them to\nsomething like `{'1_A', '1_B', '1_C'}` and `{'2_B', '2_C', '2_D'}`,\nunless `'B'` and `'C'` in the two sequences are meaningfully identical\n(in speaker diarization, this means they are the same speakers across\nutterances). This part will be automatically taken care of by the argument\n`--enforce_cluster_id_uniqueness` for the previous section.\n\nThe reason we concatenate all training sequences is that, we will be resampling\nand *block-wise* shuffling the training data as a **data augmentation**\nprocess, such that we result in a robust model even when there is insufficient\nnumber of training sequences.\n\n#### Training on large datasets\n\nFor large datasets, the data usually could not be loaded into memory at once.\nIn such cases, the `fit()` function needs to be called multiple times.\n\nHere we provide a few guidelines as our suggestions:\n\n1. Do not feed different datasets into different calls of `fit()`. Instead,\n   for each call of `fit()`, the input should cover sequences from\n   different datasets.\n2. For each call to the `fit()` function, make the size of input roughly the\n   same. And, don't make the input size too small.\n\n### Prediction\n\nOnce we are done with training, we can run the trained model to perform\ninference on new sequences by calling the `predict()` function:\n\n```python\npredicted_cluster_ids = model.predict(test_sequences, args)\n```\n\nHere `test_sequences` should be a list of 2-dim numpy arrays of type `float`,\ncorresponding to the observation sequences for testing.\n\nThe returned `predicted_cluster_ids` is a list of the same size as\n`test_sequences`. Each element of `predicted_cluster_ids` is a list of integers,\nwith the same length as the corresponding test sequence.\n\nYou can also use a single test sequence for `test_sequences`. Then the returned\n`predicted_cluster_ids` will also be a single list of integers.\n\nThe definitions of the args are described in `uisrnn\u002Farguments.py`.\nSee `inference_parser`.\n\n## Citations\n\nOur paper is cited as:\n\n```\n@inproceedings{zhang2019fully,\n  title={Fully supervised speaker diarization},\n  author={Zhang, Aonan and Wang, Quan and Zhu, Zhenyao and Paisley, John and Wang, Chong},\n  booktitle={International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n  pages={6301--6305},\n  year={2019},\n  organization={IEEE}\n}\n```\n\n## References\n\n### Baseline diarization system\n\nTo learn more about our baseline diarization system based on\n*unsupervised clustering* algorithms, check out\n[this site](https:\u002F\u002Fgoogle.github.io\u002Fspeaker-id\u002Fpublications\u002FLstmDiarization\u002F).\n\nA Python re-implementation of the *spectral clustering* algorithm used in this\npaper is available [here](https:\u002F\u002Fgithub.com\u002Fwq2012\u002FSpectralCluster).\n\nThe ground truth labels for the\n[NIST SRE 2000](https:\u002F\u002Fcatalog.ldc.upenn.edu\u002FLDC2001S97)\ndataset (Disk6 and Disk8) can be found\n[here](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fspeaker-id\u002Ftree\u002Fmaster\u002Fpublications\u002FLstmDiarization\u002Fevaluation\u002FNIST_SRE2000).\n\nFor more public resources on speaker diarization, check out [awesome-diarization](https:\u002F\u002Fgithub.com\u002Fwq2012\u002Fawesome-diarization).\n\n### Speaker recognizer\u002Fencoder\n\nTo learn more about our speaker embedding system, check out\n[this site](https:\u002F\u002Fgoogle.github.io\u002Fspeaker-id\u002Fpublications\u002FGE2E\u002F).\n\nWe are aware of several third-party implementations of this work:\n\n* [Resemblyzer: PyTorch implementation by resemble-ai](https:\u002F\u002Fgithub.com\u002Fresemble-ai\u002FResemblyzer)\n* [TensorFlow implementation by Janghyun1230](https:\u002F\u002Fgithub.com\u002FJanghyun1230\u002FSpeaker_Verification)\n* [PyTorch implementaion by HarryVolek](https:\u002F\u002Fgithub.com\u002FHarryVolek\u002FPyTorch_Speaker_Verification) - with UIS-RNN integration\n* [PyTorch implementation as part of SV2TTS](https:\u002F\u002Fgithub.com\u002FCorentinJ\u002FReal-Time-Voice-Cloning)\n\nPlease use your own judgement to decide whether you want to use these\nimplementations.\n\n**We are NOT responsible for the correctness of any third-party implementations.**\n\n## Variants\n\nHere we list the repositories that are based on UIS-RNN, but integrated with\nother technologies or added some improvements.\n\n| Link | Description |\n| ---- | ----------- |\n| [taylorlu\u002FSpeaker-Diarization](https:\u002F\u002Fgithub.com\u002Ftaylorlu\u002FSpeaker-Diarization) ![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftaylorlu\u002FSpeaker-Diarization?style=social) | Speaker diarization using UIS-RNN and GhostVLAD. An easier way to support openset speakers. |\n| [DonkeyShot21\u002Fuis-rnn-sml](https:\u002F\u002Fgithub.com\u002FDonkeyShot21\u002Fuis-rnn-sml) ![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDonkeyShot21\u002Fuis-rnn-sml?style=social) | A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data. |\n","# UIS-RNN\n[![Python应用](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fuis-rnn\u002Fworkflows\u002FPython%20application\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fuis-rnn\u002Factions\u002Fworkflows\u002Fpythonapp.yml)\n[![PyPI版本](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fuisrnn.svg)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fuisrnn)\n[![Python版本](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fuisrnn.svg)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fuisrnn)\n[![下载量](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle_uis-rnn_readme_50b3c3f284a0.png)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fuisrnn)\n[![codecov](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fgoogle\u002Fuis-rnn\u002Fbranch\u002Fmaster\u002Fgraph\u002Fbadge.svg)](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fgoogle\u002Fuis-rnn)\n[![文档](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fapi-文档-蓝色.svg)](https:\u002F\u002Fgoogle.github.io\u002Fuis-rnn)\n\n## 概述\n\n这是用于*无界交错状态循环神经网络（UIS-RNN）*算法的库。UIS-RNN通过从示例中学习，解决了对序列数据进行分割和聚类的问题。\n\n该算法最初在论文《完全监督的说话人日志》中提出（https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04719）。\n\n这项工作也曾在[Google AI 博客](https:\u002F\u002Fai.googleblog.com\u002F2018\u002F11\u002Faccurate-online-speaker-diarization.html)上介绍过。\n\n![gif](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle_uis-rnn_readme_260d42f52f6c.gif)\n\n## 免责声明\n\n由于依赖于一些内部库，此开源实现与我们用于生成论文（https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04719）结果的内部实现略有不同。\n\n我们无法分享论文中使用的说话人识别系统（[d-vector 嵌入](https:\u002F\u002Fgoogle.github.io\u002Fspeaker-id\u002Fpublications\u002FGE2E\u002F)）的数据、代码或模型，因为该说话人识别系统高度依赖于 Google 的内部基础设施和专有数据。\n\n**本库并非 Google 官方产品。**\n\n我们欢迎社区对 [`uisrnn\u002Fcontrib`](uisrnn\u002Fcontrib) 文件夹的贡献（[贡献指南](CONTRIBUTING.md)）。但我们不对任何社区贡献的正确性负责。\n\n## 依赖项\n\n本库依赖于以下内容：\n\n* Python 3.5+\n* NumPy 1.15.1\n* PyTorch 1.3.0\n* SciPy 1.1.0（仅用于评估）\n\n## 快速入门\n\n[![YouTube](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle_uis-rnn_readme_d835db83e740.png)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=pGkqwRPzx9U)\n\n### 安装包\n\n无需下载仓库，您可以通过以下方式安装 [package](https:\u002F\u002Fpypi.org\u002Fproject\u002Fuisrnn\u002F)：\n\n```\npip3 install uisrnn\n```\n\n或者\n\n```\npython3 -m pip install uisrnn\n```\n\n### 运行演示\n\n要开始使用，只需运行以下命令：\n\n```bash\npython3 demo.py --train_iteration=1000 -l=0.001\n```\n\n这将使用 `data\u002Ftoy_training_data.npz` 训练一个 UIS-RNN 模型，然后将其保存到磁盘，并在 `data\u002Ftoy_testing_data.npz` 上进行推理，打印推理结果，并将平均准确率保存到文本文件中。\n\n注：`data\u002F` 目录下的文件是手动生成的 *玩具数据*，仅用于演示目的。这些数据非常简单，因此我们应该能够在测试数据上获得 100% 的准确率。\n\n### 运行测试\n\n您还可以通过运行以下命令来验证本库的正确性：\n\n```bash\nbash run_tests.sh\n```\n\n如果您分叉了本库并进行了本地修改，请务必使用这些测试作为基本检查。\n\n此外，这些测试也是学习 API 的绝佳示例，尤其是 `tests\u002Fintegration_test.py`。\n\n## 核心 API\n\n### 术语表\n\n| 通用机器学习 | 说话人日志 |\n|----------------|------------|\n| 序列           | 发言段     |\n| 观测值 \u002F 特征  | 嵌入 \u002F d-vector |\n| 标签 \u002F 聚类 ID | 说话人     |\n\n### 参数\n\n在您的主脚本中，调用以下函数以获取参数：\n\n```python\nmodel_args, training_args, inference_args = uisrnn.parse_arguments()\n```\n\n### 模型构建\n\n所有算法都实现为 `UISRNN` 类。首先，通过以下方式构造一个 `UISRNN` 对象：\n\n```python\nmodel = uisrnn.UISRNN(args)\n```\n\n参数的定义在 `uisrnn\u002Farguments.py` 中描述。请参阅 `model_parser`。\n\n### 训练\n\n接下来，通过调用 `fit()` 函数来训练模型：\n\n```python\nmodel.fit(train_sequences, train_cluster_ids, args)\n```\n\n`args` 的定义在 `uisrnn\u002Farguments.py` 中有说明，请参阅 `training_parser`。\n\n`fit()` 函数接受两种类型的输入，如下所述。\n\n#### 输入为序列列表（推荐）\n\n在此情况下，`train_sequences` 是一个观测序列的列表。每个观测序列是一个 2 维的 `float` 类型 NumPy 数组。\n\n* 第一维是该序列的长度，不同序列的长度可以不同。\n* 第二维是每个观测的维度，所有序列必须保持一致。对于说话人日志任务，观测可以是 [d-vector 嵌入](https:\u002F\u002Fgoogle.github.io\u002Fspeaker-id\u002Fpublications\u002FGE2E\u002F)。\n\n`train_cluster_ids` 同样是一个列表，其长度与 `train_sequences` 相同。`train_cluster_ids` 中的每个元素是一个 1 维字符串列表或 NumPy 数组，包含对应于 `train_sequences` 中相应序列的真实标签。对于说话人日志任务，这些标签是每个观测对应的说话人标识符。\n\n以这种方式调用 `fit()` 时，请务必小心使用 `--enforce_cluster_id_uniqueness` 参数。\n\n例如，假设：\n\n```python\ntrain_cluster_ids = [['a', 'b'], ['a', 'c']]\n```\n\n如果两个序列中的标签 `'a'` 指的是整个数据集中同一个聚类，那么应设置 `enforce_cluster_id_uniqueness=False`；反之，如果 `'a'` 只是在第一个序列中用来区分 `'b'`，在第二个序列中用来区分 `'c'` 的局部标记，则应设置 `enforce_cluster_id_uniqueness=True`。\n\n此外，请注意，以这种方式调用 `fit()` 时，我们会将所有序列和所有聚类标签拼接在一起，并交由下一部分处理。\n\n#### 输入为单个拼接后的序列\n\n在此情况下，`train_sequences` 应该是一个 2 维的 `float` 类型 NumPy 数组，用于表示**拼接后的**观测序列。\n\n例如，如果你有 *M* 个训练语音片段，每个片段包含 *L* 个嵌入向量，每个嵌入向量由 *D* 个数字组成，则 `train_sequences` 的形状为 *N × D*，其中 *N = M × L*。\n\n`train_cluster_ids` 是一个 1 维字符串列表或 NumPy 数组，长度为 *N*，它是所有训练数据的**拼接后**真实标签。\n\n由于我们正在拼接观测序列，需要注意的是，`train_cluster_ids` 中不同序列的真实标签应当是**全局唯一**的。\n\n例如，如果第一个序列的标签集是 `{'A', 'B', 'C'}`，第二个序列的标签集是 `{'B', 'C', 'D'}`，那么在拼接之前，我们应该将它们重命名为类似 `{'1_A', '1_B', '1_C'}` 和 `{'2_B', '2_C', '2_D'}` 的形式，除非这两个序列中的 `'B'` 和 `'C'` 在语义上完全相同（在说话人日志任务中，这意味着它们代表的是跨语音片段的同一说话人）。这一部分通常会由前一部分的 `--enforce_cluster_id_uniqueness` 参数自动处理。\n\n我们之所以要将所有训练序列拼接起来，是因为我们将对训练数据进行重采样和**分块式**打乱，作为一种**数据增强**手段，从而即使在训练序列数量不足的情况下，也能得到一个鲁棒的模型。\n\n#### 大规模数据集上的训练\n\n对于大规模数据集，数据通常无法一次性全部加载到内存中。在这种情况下，需要多次调用 `fit()` 函数。\n\n以下是我们的一些建议：\n\n1. 不要将不同的数据集分别传入不同的 `fit()` 调用中。相反，每次调用 `fit()` 时，输入应涵盖来自不同数据集的序列。\n2. 每次调用 `fit()` 函数时，尽量使输入大小大致相同，但也不要过小。\n\n### 预测\n\n训练完成后，我们可以使用训练好的模型对新序列进行推理，方法是调用 `predict()` 函数：\n\n```python\npredicted_cluster_ids = model.predict(test_sequences, args)\n```\n\n这里，`test_sequences` 应该是一个 2 维 `float` 类型 NumPy 数组的列表，对应于用于测试的观测序列。\n\n返回的 `predicted_cluster_ids` 是一个与 `test_sequences` 大小相同的列表。`predicted_cluster_ids` 中的每个元素是一个整数列表，其长度与相应的测试序列相同。\n\n你也可以只使用一个测试序列作为 `test_sequences`，此时返回的 `predicted_cluster_ids` 将是一个单独的整数列表。\n\n`args` 的定义同样在 `uisrnn\u002Farguments.py` 中说明，请参阅 `inference_parser`。\n\n## 引用\n\n我们的论文引用如下：\n\n```\n@inproceedings{zhang2019fully,\n  title={Fully supervised speaker diarization},\n  author={Zhang, Aonan and Wang, Quan and Zhu, Zhenyao and Paisley, John and Wang, Chong},\n  booktitle={International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n  pages={6301--6305},\n  year={2019},\n  organization={IEEE}\n}\n```\n\n## 参考文献\n\n### 基线日志系统\n\n如需了解更多基于*无监督聚类*算法的基线日志系统，请访问[此网站](https:\u002F\u002Fgoogle.github.io\u002Fspeaker-id\u002Fpublications\u002FLstmDiarization\u002F)。\n\n本文中使用的*谱聚类*算法的 Python 重新实现版本可在[此处](https:\u002F\u002Fgithub.com\u002Fwq2012\u002FSpectralCluster)找到。\n\n[NIST SRE 2000](https:\u002F\u002Fcatalog.ldc.upenn.edu\u002FLDC2001S97) 数据集（Disk6 和 Disk8）的真实标签可在此处找到：[https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fspeaker-id\u002Ftree\u002Fmaster\u002Fpublications\u002FLstmDiarization\u002Fevaluation\u002FNIST_SRE2000](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fspeaker-id\u002Ftree\u002Fmaster\u002Fpublications\u002FLstmDiarization\u002Fevaluation\u002FNIST_SRE2000)。\n\n更多关于说话人日志的公开资源，请参阅 [awesome-diarization](https:\u002F\u002Fgithub.com\u002Fwq2012\u002Fawesome-diarization)。\n\n### 说话人识别器\u002F编码器\n\n如需了解更多关于我们说话人嵌入系统的相关信息，请访问[此网站](https:\u002F\u002Fgoogle.github.io\u002Fspeaker-id\u002Fpublications\u002FGE2E\u002F)。\n\n我们了解到有几款第三方实现了这项工作：\n\n* [Resemblyzer：resemble-ai 提供的 PyTorch 实现](https:\u002F\u002Fgithub.com\u002Fresemble-ai\u002FResemblyzer)\n* [Janghyun1230 提供的 TensorFlow 实现](https:\u002F\u002Fgithub.com\u002FJanghyun1230\u002FSpeaker_Verification)\n* [HarryVolek 提供的 PyTorch 实现](https:\u002F\u002Fgithub.com\u002FHarryVolek\u002FPyTorch_Speaker_Verification)，并集成了 UIS-RNN\n* [SV2TTS 项目中的 PyTorch 实现](https:\u002F\u002Fgithub.com\u002FCorentinJ\u002FReal-Time-Voice-Cloning)\n\n请根据自己的判断决定是否使用这些实现。\n\n**我们不对任何第三方实现的正确性负责。**\n\n## 变体\n\n这里列出了基于 UIS-RNN，但集成了其他技术或进行了一些改进的仓库。\n\n| 链接 | 描述 |\n| ---- | ----------- |\n| [taylorlu\u002FSpeaker-Diarization](https:\u002F\u002Fgithub.com\u002Ftaylorlu\u002FSpeaker-Diarization) ![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftaylorlu\u002FSpeaker-Diarization?style=social) | 使用 UIS-RNN 和 GhostVLAD 进行说话人日志分割。一种更简便的方式来支持开放集说话人识别。 |\n| [DonkeyShot21\u002Fuis-rnn-sml](https:\u002F\u002Fgithub.com\u002FDonkeyShot21\u002Fuis-rnn-sml) ![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDonkeyShot21\u002Fuis-rnn-sml?style=social) | UIS-RNN 的一个变体，用于论文《面向多领域数据的带样本均值损失的监督在线日志分割》。 |","# UIS-RNN 快速上手指南\n\nUIS-RNN（Unbounded Interleaved-State Recurrent Neural Network）是一种用于序列数据分割和聚类的算法，最初应用于全监督说话人日记化（Speaker Diarization）任务。本指南将帮助你快速在本地环境中部署并运行该工具。\n\n## 环境准备\n\n在开始之前，请确保你的开发环境满足以下要求：\n\n*   **操作系统**：Linux, macOS 或 Windows\n*   **Python 版本**：3.5 及以上\n*   **核心依赖**：\n    *   `numpy` >= 1.15.1\n    *   `pytorch` >= 1.3.0\n    *   `scipy` >= 1.1.0 (仅用于评估环节)\n\n> **提示**：国内开发者建议使用国内镜像源安装依赖，以提升下载速度。例如使用清华源或阿里源。\n\n## 安装步骤\n\n你可以通过 PyPI 直接安装预编译包，无需克隆仓库。\n\n### 方式一：使用默认源安装\n```bash\npip3 install uisrnn\n```\n\n### 方式二：使用国内镜像源加速安装（推荐）\n```bash\npython3 -m pip install uisrnn -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n\n安装完成后，你可以直接运行官方提供的演示脚本来验证安装并体验完整流程（包含训练、模型保存、推理及结果评估）。\n\n### 运行演示 Demo\n\n在项目根目录或任意位置执行以下命令：\n\n```bash\npython3 demo.py --train_iteration=1000 -l=0.001\n```\n\n**脚本执行流程说明：**\n1.  **数据加载**：自动加载 `data\u002Ftoy_training_data.npz` 中的玩具数据（Toy Data）。\n2.  **模型训练**：使用指定参数训练 UIS-RNN 模型。\n3.  **模型保存**：将训练好的模型保存到磁盘。\n4.  **推理预测**：对 `data\u002Ftoy_testing_data.npz` 进行测试数据推理。\n5.  **结果输出**：打印推理结果，并将平均准确率保存为文本文件。\n\n> **注意**：Demo 使用的数据是人工生成的简单玩具数据，理论上在测试集上应达到 100% 的准确率。此脚本主要用于展示 API 调用流程和验证环境正确性。\n\n### 代码集成示例\n\n若需在自有项目中集成，核心调用逻辑如下：\n\n```python\nimport uisrnn\n\n# 1. 解析参数\nmodel_args, training_args, inference_args = uisrnn.parse_arguments()\n\n# 2. 构建模型\nmodel = uisrnn.UISRNN(model_args)\n\n# 3. 训练模型\n# train_sequences: 观察序列列表 (List of 2-dim numpy arrays)\n# train_cluster_ids: 对应的真实标签列表 (List of 1-dim lists\u002Farrays)\nmodel.fit(train_sequences, train_cluster_ids, training_args)\n\n# 4. 进行预测\n# test_sequences: 测试观察序列列表\npredicted_cluster_ids = model.predict(test_sequences, inference_args)\n```\n\n对于大规模数据集无法一次性载入内存的情况，可以分批次多次调用 `model.fit()` 方法进行增量训练。","某智能会议助手团队正在处理海量多人在线会议录音，急需将连续的音频流精准切割并区分出每位发言者，以生成结构化的会议纪要。\n\n### 没有 uis-rnn 时\n- **无法应对动态人数**：传统聚类算法需预先设定发言者数量，面对会议中随时加入或退出的参与者，系统往往直接崩溃或归类错误。\n- **长会议精度骤降**：随着会议时长增加，基于静态规则的分割方法误差累积严重，导致同一个人的发言被误判为多个不同角色。\n- **开发门槛极高**：团队需从零复现复杂的时序建模论文，花费数周调试递归神经网络的状态转移逻辑，难以快速落地业务。\n- **缺乏在线处理能力**：现有方案多依赖离线全量数据，无法支持实时会议中的即时发言人追踪与标注。\n\n### 使用 uis-rnn 后\n- **自适应无界状态**：uis-rnn 独有的“无界交错状态”机制，无需预设人数即可自动识别新发言者，完美适配人员流动的开放场景。\n- **长序列稳定输出**：通过学习序列数据的内在规律，即使在长达数小时的会议中，也能保持极高的说话人分离准确率，避免身份混淆。\n- **开箱即用加速研发**：直接调用封装好的 PyTorch 接口和训练脚本，团队在两天内即可完成模型集成，将精力聚焦于上层应用优化。\n- **支持实时流式推断**：算法天然适合在线处理，能够边接收音频边输出带说话人标签的文本流，满足实时字幕与纪要生成的需求。\n\nuis-rnn 通过解决开放场景下的时序聚类难题，让高保真的全自动说话人分离技术从实验室论文真正走向了生产环境。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle_uis-rnn_7a325afc.png","google","Google","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fgoogle_c4bedcda.png","Google ❤️ Open Source",null,"opensource@google.com","GoogleOSS","https:\u002F\u002Fopensource.google\u002F","https:\u002F\u002Fgithub.com\u002Fgoogle",[82,86],{"name":83,"color":84,"percentage":85},"Python","#3572A5",98.6,{"name":87,"color":88,"percentage":89},"Shell","#89e051",1.4,1588,320,"2026-04-05T01:21:44","Apache-2.0","","未说明",{"notes":97,"python":98,"dependencies":99},"该库主要用于序列数据的分割和聚类（如说话人日志）。README 中明确指出此开源版本与论文中使用的内部版本略有不同，且不提供论文中使用的说话人识别系统（d-vector embeddings）的数据、代码或模型。Scipy 仅用于评估阶段。数据目录下的文件仅为演示用的手动生成玩具数据。","3.5+",[100,101,102],"numpy==1.15.1","pytorch==1.3.0","scipy==1.1.0",[14],[105,64,106,107,108,109,110],"speaker-diarization","speaker-recognition","supervised-learning","clustering","supervised-clustering","machine-learning","2026-03-27T02:49:30.150509","2026-04-17T09:53:25.202805",[114,119,124,129,134,138],{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},36595,"加载保存的模型后进行预测，为什么返回的是从序列长度开始的数字序列而不是预期的标签？","这是一个已修复的 Bug。维护者已提交修复代码（commit: a619126ce64b3209f6d3d22cd1ca4619a537b5db）。如果您遇到此问题，请确保更新到包含该修复的最新版本代码。","https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fuis-rnn\u002Fissues\u002F13",{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},36596,"如何加速预测过程？是否支持批量预测或多进程处理？","目前文档指出一次只能预测一个序列，但可以通过多进程加速。用户可以使用 `uisrnn.parallel_predict()` 函数。注意：在 GPU 上运行通常比 CPU 快 3-4 倍。另外需注意，如果在循环中频繁打开和关闭 `Pool` 可能会导致内存泄漏，建议复用进程池。","https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fuis-rnn\u002Fissues\u002F32",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},36597,"运行集成测试（run_test.sh 或 demo.py）时，为什么准确率有时是 1.0，有时却只有 0.8 或 0.9？","这通常是因为训练步数较少导致网络未完全收敛，属于正常现象。测试的目的是验证代码正确性而非追求极致精度。解决方法：1. 在 `integration_test.py` 中将 `training_args.train_iteration` 从 200 增加到更大值（如 300）；2. 更改 `setUp()` 函数中的随机种子。实际应用中通常训练更多步数或并行训练多个网络取最优。","https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fuis-rnn\u002Fissues\u002F14",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},36598,"训练用的输入 d-vectors 是否需要 L2 归一化？其数值可以是负数吗？","是的，示例数据（toy_training_data.npz）中的 d-vectors 是 L2 归一化的。关于数值范围，它们可以是负数，因为模型最后一层 256 维线性层之后没有使用 ReLU 激活函数。","https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fuis-rnn\u002Fissues\u002F80",{"id":135,"question_zh":136,"answer_zh":137,"source_url":133},36599,"代码中提到的“segments”（片段）具体指什么？长度必须是 400ms 吗？","“Segments”指的是非重叠的音频片段。虽然论文和代码中默认最大长度为 400ms，但这并不是强制固定的。400ms 是在开发\u002F评估数据集上表现较好的经验值，您可以根据实际需求调整该长度。",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},36600,"transition_bias 参数的计算逻辑是否正确？为什么连接序列后计算会导致偏差？","用户指出原代码在连接序列后计算 transition_bias 会将不同序列间的切换误计为说话人变化，导致估算偏差。维护者已确认该问题并合并了相关的修复贡献。建议检查代码是否已更新，确保计算逻辑改为先分别计算各序列的切换次数再汇总，避免序列间边界带来的误差。","https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fuis-rnn\u002Fissues\u002F55",[]]