[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-segment-any-text--wtpsplit":3,"tool-segment-any-text--wtpsplit":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150037,2,"2026-04-10T23:33:47",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":76,"owner_url":77,"languages":78,"stars":91,"forks":92,"last_commit_at":93,"license":94,"difficulty_score":32,"env_os":95,"env_gpu":96,"env_ram":95,"env_deps":97,"category_tags":102,"github_topics":103,"view_count":10,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":112,"updated_at":113,"faqs":114,"releases":145},3959,"segment-any-text\u002Fwtpsplit","wtpsplit","Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.","wtpsplit 是一款强大的开源文本分割工具，旨在将任意文本精准地拆分为句子或其他语义单元。它有效解决了传统方法在处理多语言、标点缺失或格式混乱文本时容易出错、效率低下的难题，尤其擅长应对复杂的自然语言场景。\n\n这款工具非常适合开发者、数据科学家及 NLP 研究人员使用，无论是构建预处理流水线还是进行多语言学术研究，都能从中获益。wtpsplit 的核心亮点在于其集成了最新的 SaT（Segment Any Text）模型，支持全球 85 种语言，在准确性与计算效率之间取得了卓越的平衡。它不仅提供通用的分割模型，还创新性地支持 LoRA 适配技术，允许用户针对特定领域、语言风格进行微调，从而获得更贴合需求的分割效果。此外，wtpsplit 原生支持 ONNX 加速，结合 GPU 使用时推理速度可提升约 50%，确保在大规模数据处理任务中依然保持高效流畅。通过简洁的 Python 接口，用户可以轻松将其集成到现有项目中，实现鲁棒且灵活的文本结构化处理。","\u003Ch1 align=\"center\">wtpsplit🪓\u003C\u002Fh1>\n\u003Ch3 align=\"center\">Segment any Text - Robustly, Efficiently, Adaptably⚡\u003C\u002Fh3>\n\nThis repository allows you to segment text into sentences or other semantic units. It implements the models from:\n\n- **SaT** &mdash; [Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678) by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vulić and Markus Schedl (**state-of-the-art, encouraged**).\n- **WtP** &mdash; [Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation](https:\u002F\u002Faclanthology.org\u002F2023.acl-long.398\u002F) by Benjamin Minixhofer, Jonas Pfeiffer and Ivan Vulić (*previous version, maintained for reproducibility*).\n\nThe namesake WtP is maintained for consistency. Our new followup SaT provides robust, efficient and adaptable sentence segmentation across 85 languages at higher performance and less compute cost. Check out the **state-of-the-art** results in 8 distinct corpora and 85 languages demonstrated in our [Segment any Text paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678).\n\n![System Figure](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsegment-any-text_wtpsplit_readme_8cc1c8731d00.png)\n\n## Installation\n\n```bash\npip install wtpsplit\n```\nOr one of the following for ONNX support:\n```bash\npip install wtpsplit[onnx-gpu]\npip install wtpsplit[onnx-cpu]\n```\n\n## Usage\n\n```python\nfrom wtpsplit import SaT\n\nsat = SaT(\"sat-3l\")\n# optionally run on GPU for better performance\n# also supports TPUs via e.g. sat.to(\"xla:0\"), in that case pass `pad_last_batch=True` to sat.split\nsat.half().to(\"cuda\")\n\nsat.split(\"This is a test This is another test.\")\n# returns [\"This is a test \", \"This is another test.\"]\n\n# do this instead of calling sat.split on every text individually for much better performance\nsat.split([\"This is a test This is another test.\", \"And some more texts...\"])\n# returns an iterator yielding lists of sentences for every text\n\n# use our '-sm' models for general sentence segmentation tasks\nsat_sm = SaT(\"sat-3l-sm\")\nsat_sm.half().to(\"cuda\") # optional, see above\nsat_sm.split(\"this is a test this is another test\")\n# returns [\"this is a test \", \"this is another test\"]\n\n# use trained lora modules for strong adaptation to language & domain\u002Fstyle\nsat_adapted = SaT(\"sat-3l\", style_or_domain=\"ud\", language=\"en\")\nsat_adapted.half().to(\"cuda\") # optional, see above\nsat_adapted.split(\"This is a test This is another test.\")\n# returns ['This is a test ', 'This is another test']\n```\n\n## ONNX Support\n\n🚀 You can now enable even faster ONNX inference for `sat` and `sat-sm` models! 🚀\n\n```python\nsat = SaT(\"sat-3l-sm\", ort_providers=[\"CUDAExecutionProvider\", \"CPUExecutionProvider\"])\n```\n\n```python\n>>> from wtpsplit import SaT\n>>> texts = [\"This is a sentence. This is another sentence.\"] * 1000\n\n# PyTorch GPU\n>>> model_pytorch = SaT(\"sat-3l-sm\")\n>>> model_pytorch.half().to(\"cuda\");\n>>> %timeit list(model_pytorch.split(texts))\n# 144 ms ± 252 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n# quite fast already, but...\n\n# onnxruntime GPU\n>>> model_ort = SaT(\"sat-3l-sm\", ort_providers=[\"CUDAExecutionProvider\", \"CPUExecutionProvider\"])\n>>> %timeit list(model_ort.split(texts))\n# 94.9 ms ± 165 μs per loop (mean ± std. dev. of 7 runs, 10 loops each\n# ...this should be ~50% faster! (tested on RTX 3090)\n```\n\nIf you wish to use LoRA in combination with an ONNX model:\n\n- Run `scripts\u002Fexport_to_onnx_sat.py` with `use_lora: True` and an appropriate `output_dir: \u003COUTPUT_DIR>`.\n  - If you have a local LoRA module, use `lora_path`.\n  - If you wish to load a LoRA module from the HuggingFace hub, use `style_or_domain` and `language`.\n- Load the ONNX model with merged LoRA weights:\n  `sat = SaT(\u003COUTPUT_DIR>, onnx_providers=[\"CUDAExecutionProvider\", \"CPUExecutionProvider\"])`\n\n## Available Models\n\nIf you need a general sentence segmentation model, use `-sm` models (e.g., `sat-3l-sm`)\nFor speed-sensitive applications, we recommend 3-layer models (`sat-3l` and `sat-3l-sm`). They provide a great tradeoff between speed and performance.\nThe best models are our 12-layer models: `sat-12l` and `sat-12l-sm`.\n\n| Model                                                                        | English Score | Multilingual Score |\n| :--------------------------------------------------------------------------- | ------------: | -----------------: |\n| [sat-1l](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-1l)                        |          88.5 |               84.3 |\n| [sat-1l-sm](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-1l-sm)                  |          88.2 |               87.9 |\n| [sat-3l](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-3l)                        |          93.7 |               89.2 |\n| [sat-3l-lora](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-3l\u002Ftree\u002Fmain\u002Floras)   |          96.7 |               94.8 |\n| [sat-3l-sm](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-3l-sm)                  |          96.5 |               93.5 |\n| [sat-6l](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-6l)                        |          94.1 |               89.7 |\n| [sat-6l-sm](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-6l-sm)                  |          96.9 |               95.1 |\n| [sat-9l](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-9l)                        |          94.3 |               90.3 |\n| [sat-12l](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-12l)                      |          94.0 |               90.4 |\n| [sat-12l-lora](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-12l\u002Ftree\u002Fmain\u002Floras) |          97.3 |               95.9 |\n| [sat-12l-sm](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-12l-sm)                |          97.4 |               96.0 |\n\nThe scores are macro-average F1 score across all available datasets for \"English\", and macro-average F1 score across all datasets and languages for \"Multilingual\". \"adapted\" means adapation via LoRA; check out the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678) for details.\n\nFor comparison, here the English scores of some other tools:\n\n| Model                                                    | English Score |\n| :------------------------------------------------------- | ------------: |\n| PySBD                                                    |          69.6 |\n| SpaCy (sentencizer; monolingual)                         |          92.9 |\n| SpaCy (sentencizer; multilingual)                        |          91.5 |\n| Ersatz                                                   |          91.4 |\n| Punkt (`nltk.sent_tokenize`)                           |          92.2 |\n| [WtP (3l)](https:\u002F\u002Fhuggingface.co\u002Fbenjamin\u002Fwtp-canine-s-3l) |          93.9 |\n\nNote that this library also supports previous [`WtP`](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18893) models.\nYou can use them in essentially the same way as `SaT`models:\n\n```python\nfrom wtpsplit import WtP\n\nwtp = WtP(\"wtp-bert-mini\")\n# similar functionality as for SaT models\nwtp.split(\"This is a test This is another test.\")\n```\n\nFor more details on WtP and reproduction details, see the [WtP doc](.\u002FREADME_WTP.md).\n\n## Paragraph Segmentation\n\nSince SaT are trained to predict newline probablity, they can segment text into paragraphs in addition to sentences.\n\n```python\n# returns a list of paragraphs, each containing a list of sentences\n# adjust the paragraph threshold via the `paragraph_threshold` argument.\nsat.split(text, do_paragraph_segmentation=True)\n```\n\n## (NEW! v2.2+) Length-Constrained Segmentation\n\nControl segment lengths with `min_length` and `max_length` parameters. This is useful when you need segments within specific size limits (e.g., for embedding models, storage, or downstream processing).\n\n### Basic Usage\n\n```python\nfrom wtpsplit import SaT\n\nsat = SaT(\"sat-3l-sm\")\n\ntext = (\n    \"In the beginning God created the heaven and the earth. \"\n    \"And the earth was without form, and void; and darkness was upon the face of the deep. \"\n    \"And the Spirit of God moved upon the face of the waters. \"\n    \"And God said, Let there be light: and there was light. \"\n    \"And God saw the light, that it was good: and God divided the light from the darkness. \"\n    \"And God called the light Day, and the darkness he called Night. \"\n    \"And the evening and the morning were the first day.\"\n)\n\n# Split with a maximum segment length of 120 characters\nsegments = sat.split(text, max_length=120)\nfor i, s in enumerate(segments):\n    print(f\"[{len(s):3d} chars] {s}\")\n# [ 55 chars] In the beginning God created the heaven and the earth. \n# [ 86 chars] And the earth was without form, and void; and darkness was upon the face of the deep. \n# [112 chars] And the Spirit of God moved upon the face of the waters. And God said, Let there be light: and there was light. \n# [ 86 chars] And God saw the light, that it was good: and God divided the light from the darkness. \n# [115 chars] And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.\n\nassert \"\".join(segments) == text  # text is perfectly preserved\n\n# Enforce both min and max length\nsat.split(text, min_length=80, max_length=200)\n\n# Use the greedy algorithm for minimally faster (but less optimal) results\nsat.split(text, max_length=120, algorithm=\"greedy\")\n```\n\n### Priors for Length Preference\n\nUse priors to influence segment length distribution. Available priors:\n\n| Prior | Best For |\n|-------|----------|\n| `\"uniform\"` (default) | Just enforce max_length, let model decide |\n| `\"gaussian\"` | Prefer segments around a target length (intuitive) |\n| `\"lognormal\"` | Right-skewed preference (more tolerant of longer segments) |\n| `\"clipped_polynomial\"` | Must be very close to target length |\n\n```python\n# Gaussian prior (recommended): prefer segments around target_length\nsat.split(text, max_length=100, prior_type=\"gaussian\", \n          prior_kwargs={\"target_length\": 50, \"spread\": 10})\n\n# Log-normal prior: right-skewed (more tolerant of longer segments)\nsat.split(text, max_length=100, prior_type=\"lognormal\", \n          prior_kwargs={\"target_length\": 70, \"spread\": 25})\n\n# Clipped polynomial: hard cutoff at ±spread from target\nsat.split(text, max_length=100, prior_type=\"clipped_polynomial\", \n          prior_kwargs={\"target_length\": 60, \"spread\": 25})\n```\n\n### Language-Aware Defaults\n\nPass `lang_code` to use language-specific defaults for `target_length` and `spread` (based on language-specific corpus statistics):\n\n```python\n# German has longer average sentences → auto-uses target_length=90, spread=35\nsat.split(text, max_length=150, prior_type=\"gaussian\", \n          prior_kwargs={\"lang_code\": \"de\"})\n\n# Chinese has shorter sentences → auto-uses target_length=45, spread=15\nsat.split(text, max_length=100, prior_type=\"gaussian\", \n          prior_kwargs={\"lang_code\": \"zh\"})\n```\n\nWhen using LoRA with a language, this happens automatically:\n\n```python\nsat = SaT(\"sat-3l\", style_or_domain=\"ud\", language=\"de\")\nsat.split(text, max_length=150, prior_type=\"gaussian\")  # auto-uses German defaults\n```\n\n### How It Works\n\nThe Viterbi algorithm finds globally optimal segmentation points that balance:\n- The model's sentence boundary predictions (where natural splits occur)\n- Your length preferences (via the prior; if provided)\n\n**Text Reconstruction:**\n```python\n# With constraints (max_length or min_length):\noriginal_text = \"\".join(segments)  # segments may contain newlines\n\n# Without constraints (SaT default with split_on_input_newlines=True):\noriginal_text = \"\\n\".join(segments)\n```\n\n> **Note**: When using length constraints, segments may contain newlines. If you want to remove them, you can just post-process the output.\n\n> **Note**: When `max_length` is set, the `threshold` parameter is ignored. The Viterbi\u002Fgreedy algorithms use raw model probabilities directly instead of threshold-based filtering.\n\nFor more details, see the [Length Constraints Documentation](.\u002Fdocs\u002FLENGTH_CONSTRAINTS.md).\n\n## Adaptation\n\nSaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speeches) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678).\n\nWe also provided verse segmentation modules for 16 genres for `sat-12-no-limited-lookahead`.\n\nLoad LoRA modules like this:\n\n```python\n\n# requires both lang_code and style_or_domain\n# for available ones, check the \u003Cmodel_repository>\u002Floras folder\nsat_lora = SaT(\"sat-3l\", style_or_domain=\"ud\", language=\"en\")\nsat_lora.split(\"Hello this is a test But this is different now Now the next one starts looool\")\n# now for a highly distinct domain\nsat_lora_distinct = SaT(\"sat-12l\", style_or_domain=\"code-switching\", language=\"es-en\")\nsat_lora_distinct.split(\"in the morning over there cada vez que yo decía algo él me decía algo\")\n```\n\nYou can also freely adapt the segmentation threshold, with a higher threshold leading to more conservative segmentation:\n\n```python\n\nsat.split(\"This is a test This is another test.\", threshold=0.4)\n# works similarly for lora; but thresholds are higher\nsat_lora.split(\"Hello this is a test But this is different now Now the next one starts looool\", threshold=0.7)\n```\n\n## Advanced Usage\n\n### Get the newline or sentence boundary probabilities for a text:\n\n```python\n# returns newline probabilities (supports batching!)\nsat.predict_proba(text)\n```\n\n### Load a SaT model in [HuggingFace `transformers`](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers):\n\n```python\n# import library to register the custom models \nimport wtpsplit.models\nfrom transformers import AutoModelForTokenClassification\n\nmodel = AutoModelForTokenClassification.from_pretrained(\"segment-any-text\u002Fsat-3l-sm\") # or some other model name; see https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\n```\n\n### Adapt to your own corpus via LoRA\n\nOur models can be efficiently adapted via LoRA in a powerful way. Only 10-100 training segmented training sentences should already improve performance considerably. To do so:\n\nClone the repository and install requirements:\n\n```\ngit clone https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\ncd wtpsplit\npip install -r requirements.txt\npip install adapters==0.2.1 --no-dependencies\ncd ..\n```\n\n1. Create data in this format:\n\n```python\nimport torch\n\ntorch.save(\n    {\n        \"language_code\": {\n            \"sentence\": {\n                \"dummy-dataset\": {\n                    \"meta\": {\n                        \"train_data\": [\"train sentence 1\", \"train sentence 2\"],\n                    },\n                    \"data\": [\n                        \"test sentence 1\",\n                        \"test sentence 2\",\n                    ]\n                }\n            }\n        }\n    },\n    \"dummy-dataset.pth\"\n)\n```\n\nNote that there should not be any newlines within individual sentences! This now raises an error. Instead, each entry of a list should be a sentence, and there should be no \"\\n\" characters. So your corpus should already be well-split.\n\n2. Create\u002Fadapt config; provide base model via `model_name_or_path` and training data .pth via `text_path`:\n\n`configs\u002Flora\u002Flora_dummy_config.json`\n\nWe recommend starting using this config, and adapting `model_name_or_path`, `output_dir`, and `text_path` if needed.\nYou may also wish to adapt other aspects such as `adapter_config` and batch sizes, but this is more experimental.\n\n3. Train LoRA:\n\n```\npython3 wtpsplit\u002Ftrain\u002Ftrain_lora.py configs\u002Flora\u002Flora_dummy_config.json\n```\n\n4. Once training is done, provide your saved module's path to SaT:\n\n```python\n\nsat_lora_adapted = SaT(\"model-used\", lora_path=\"dummy_lora_path\")\nsat_lora_adapted.split(\"Some domains-specific or styled text\")\n```\n\n**Important:** Use the **same model variant** for inference as for training (e.g. `sat-12l-sm` and `sat-12l` have different configs; an adapter trained on one cannot be loaded on the other).\n\nAdjust the dataset name, language and model in the above to your needs.\n\n## Reproducing the paper\n\n`configs\u002F` contains the configs for the runs from the paper for base and sm models as well as LoRA modules. Launch training for each of them like this:\n\n```\npython3 wtpsplit\u002Ftrain\u002Ftrain.py configs\u002F\u003Cconfig_name>.json\npython3 wtpsplit\u002Ftrain\u002Ftrain_sm.py configs\u002F\u003Cconfig_name>.json\npython3 wtpsplit\u002Ftrain\u002Ftrain_lora.py configs\u002F\u003Cconfig_name>.json\n```\n\nIn addition:\n\n- `wtpsplit\u002Fdata_acquisition` contains the code for obtaining evaluation data and raw text from the mC4 corpus.\n- `wtpsplit\u002Fevaluation` contains the code for:\n  - evaluation (i.e. sentence segmentation results) via `intrinsic.py`.\n  - short-sequence evaluation (i.e. sentence segmentation results for pairs\u002Fk-mers of sentences) via `intrinsic_pairwise.py`.\n  - LLM baseline evaluation (`llm_sentence.py`), legal baseline evaluation (`legal_baselines.py`)\n  - baseline (PySBD, nltk, etc.) evaluation results in `intrinsic_baselines.py` and `intrinsic_baselines_multi.py`\n  - Raw results in JSON format are also in `evaluation_results\u002F`\n  - Statistical significane testing code and results ara in `stat_tests\u002F`\n  - punctuation annotation experiments in `punct_annotation.py` and `punct_annotation_wtp.py` (WtP only)\n  - extrinsic evaluation on Machine Translation in `extrinsic.py` (WtP only)\n\nEnsure to install packages from `requirements.txt` beforehand.\n\n## Supported Languages\n\n\u003Cdetails>\n  \u003Csummary>Table with supported languages\u003C\u002Fsummary>\n\n| iso | Name            |\n| :-- | :-------------- |\n| af  | Afrikaans       |\n| am  | Amharic         |\n| ar  | Arabic          |\n| az  | Azerbaijani     |\n| be  | Belarusian      |\n| bg  | Bulgarian       |\n| bn  | Bengali         |\n| ca  | Catalan         |\n| ceb | Cebuano         |\n| cs  | Czech           |\n| cy  | Welsh           |\n| da  | Danish          |\n| de  | German          |\n| el  | Greek           |\n| en  | English         |\n| eo  | Esperanto       |\n| es  | Spanish         |\n| et  | Estonian        |\n| eu  | Basque          |\n| fa  | Persian         |\n| fi  | Finnish         |\n| fr  | French          |\n| fy  | Western Frisian |\n| ga  | Irish           |\n| gd  | Scottish Gaelic |\n| gl  | Galician        |\n| gu  | Gujarati        |\n| ha  | Hausa           |\n| he  | Hebrew          |\n| hi  | Hindi           |\n| hu  | Hungarian       |\n| hy  | Armenian        |\n| id  | Indonesian      |\n| ig  | Igbo            |\n| is  | Icelandic       |\n| it  | Italian         |\n| ja  | Japanese        |\n| jv  | Javanese        |\n| ka  | Georgian        |\n| kk  | Kazakh          |\n| km  | Central Khmer   |\n| kn  | Kannada         |\n| ko  | Korean          |\n| ku  | Kurdish         |\n| ky  | Kirghiz         |\n| la  | Latin           |\n| lt  | Lithuanian      |\n| lv  | Latvian         |\n| mg  | Malagasy        |\n| mk  | Macedonian      |\n| ml  | Malayalam       |\n| mn  | Mongolian       |\n| mr  | Marathi         |\n| ms  | Malay           |\n| mt  | Maltese         |\n| my  | Burmese         |\n| ne  | Nepali          |\n| nl  | Dutch           |\n| no  | Norwegian       |\n| pa  | Panjabi         |\n| pl  | Polish          |\n| ps  | Pushto          |\n| pt  | Portuguese      |\n| ro  | Romanian        |\n| ru  | Russian         |\n| si  | Sinhala         |\n| sk  | Slovak          |\n| sl  | Slovenian       |\n| sq  | Albanian        |\n| sr  | Serbian         |\n| sv  | Swedish         |\n| ta  | Tamil           |\n| te  | Telugu          |\n| tg  | Tajik           |\n| th  | Thai            |\n| tr  | Turkish         |\n| uk  | Ukrainian       |\n| ur  | Urdu            |\n| uz  | Uzbek           |\n| vi  | Vietnamese      |\n| xh  | Xhosa           |\n| yi  | Yiddish         |\n| yo  | Yoruba          |\n| zh  | Chinese         |\n| zu  | Zulu            |\n\n\u003C\u002Fdetails>\n\nFor details, please see our [Segment any Text paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678).\n\n## Community Ports\n\n- **Rust**: [wtsplit-rs](https:\u002F\u002Fgithub.com\u002F19h\u002Fwtsplit-rs) by [@19h](https:\u002F\u002Fgithub.com\u002F19h)\n\n*Note: Community ports are independently maintained and may have different feature sets or update schedules.*\n\n## Citations\n\nFor the `SaT` models, please kindly cite our paper:\n\n```\n@inproceedings{frohmann-etal-2024-segment,\n    title = \"Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation\",\n    author = \"Frohmann, Markus  and\n      Sterner, Igor  and\n      Vuli{\\'c}, Ivan  and\n      Minixhofer, Benjamin  and\n      Schedl, Markus\",\n    editor = \"Al-Onaizan, Yaser  and\n      Bansal, Mohit  and\n      Chen, Yun-Nung\",\n    booktitle = \"Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing\",\n    month = nov,\n    year = \"2024\",\n    address = \"Miami, Florida, USA\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Faclanthology.org\u002F2024.emnlp-main.665\",\n    pages = \"11908--11941\"\n}\n\n```\n\nFor the library and the WtP models, please cite:\n\n```\n@inproceedings{minixhofer-etal-2023-wheres,\n    title = \"Where{'}s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation\",\n    author = \"Minixhofer, Benjamin  and\n      Pfeiffer, Jonas  and\n      Vuli{\\'c}, Ivan\",\n    booktitle = \"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\",\n    month = jul,\n    year = \"2023\",\n    address = \"Toronto, Canada\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Faclanthology.org\u002F2023.acl-long.398\",\n    pages = \"7215--7235\"\n}\n```\n\n## Acknowledgments\n\nThis research was funded in whole or in part by the Austrian Science Fund (FWF): P36413, P33526, and DFH-23, and by the State of Upper Austria and the Federal Ministry of Education, Science, and Research, through grants LIT-2021-YOU-215. In addition, Ivan Vulic and Benjamin Minixhofer have been supported through the Royal Society University Research Fellowship ‘Inclusive and Sustainable Language Technology for a Truly Multilingual World’ (no 221137) awarded to Ivan Vulić. This research has also been supported with Cloud TPUs from Google’s TPU Research Cloud (TRC). This work was also supported by compute credits from a Cohere For AI Research Grant, these grants are designed to support academic partners conducting research with the goal of releasing scientific artifacts and data for good projects. We also thank Simone Teufel for fruitful discussions.\n\n---\n\nFor any questions, please create an issue or send an email to markus.frohmann@gmail.com, and I will get back to you as soon as possible.\n","\u003Ch1 align=\"center\">wtpsplit🪓\u003C\u002Fh1>\n\u003Ch3 align=\"center\">对任意文本进行分段——稳健、高效、可适应⚡\u003C\u002Fh3>\n\n本仓库允许您将文本分割成句子或其他语义单元。它实现了以下模型：\n\n- **SaT** &mdash; [Segment Any Text: 一种稳健、高效且可适应的通用句子分割方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678)，作者为Markus Frohmann、Igor Sterner、Benjamin Minixhofer、Ivan Vulić 和 Markus Schedl（**当前最先进，推荐使用**）。\n- **WtP** &mdash; [句号在哪里？自监督多语言无标点符号依赖的句子分割](https:\u002F\u002Faclanthology.org\u002F2023.acl-long.398\u002F)，作者为Benjamin Minixhofer、Jonas Pfeiffer 和 Ivan Vulić（*旧版本，为保证结果可复现而保留*）。\n\n出于一致性考虑，沿用了“WtP”这一名称。我们的新模型 SaT 在 85 种语言上提供了更稳健、高效且可适应的句子分割功能，同时性能更高、计算成本更低。请参阅我们在 [Segment any Text 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678) 中展示的在 8 个不同语料库和 85 种语言上的**最先进**结果。\n\n![系统图](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsegment-any-text_wtpsplit_readme_8cc1c8731d00.png)\n\n## 安装\n\n```bash\npip install wtpsplit\n```\n或者，如果您需要 ONNX 支持，可以安装以下任一选项：\n```bash\npip install wtpsplit[onnx-gpu]\npip install wtpsplit[onnx-cpu]\n```\n\n## 使用方法\n\n```python\nfrom wtpsplit import SaT\n\nsat = SaT(\"sat-3l\")\n# 可选：在 GPU 上运行以获得更好的性能\n# 也支持 TPU，例如通过 sat.to(\"xla:0\")；此时需将 pad_last_batch 参数设为 True\nsat.half().to(\"cuda\")\n\nsat.split(\"这是一个测试 这是另一个测试。\")\n# 返回 [\"这是一个测试 \", \"这是另一个测试。\"]\n\n# 为了提升性能，建议不要逐个调用 sat.split 对每段文本进行处理，而是直接传入文本列表：\nsat.split([\"这是一个测试 这是另一个测试。\", \"还有一些其他的文本...\"])\n# 返回一个迭代器，依次生成每段文本对应的句子列表\n\n# 对于一般的句子分割任务，可以使用 '-sm' 模型：\nsat_sm = SaT(\"sat-3l-sm\")\nsat_sm.half().to(\"cuda\") # 可选，参考上述说明\nsat_sm.split(\"这是一个测试 这是另一个测试\")\n# 返回 [\"这是一个测试 \", \"这是另一个测试\"]\n\n# 如果需要针对特定语言或领域\u002F风格进行强适应性调整，可以使用经过微调的 LoRA 模块：\nsat_adapted = SaT(\"sat-3l\", style_or_domain=\"ud\", language=\"en\")\nsat_adapted.half().to(\"cuda\") # 可选，参考上述说明\nsat_adapted.split(\"这是一个测试 这是另一个测试。\")\n# 返回 ['这是一个测试 ', '这是另一个测试']\n```\n\n## ONNX 支持\n\n🚀 现在您可以为 `sat` 和 `sat-sm` 模型启用更快的 ONNX 推理！🚀\n\n```python\nsat = SaT(\"sat-3l-sm\", ort_providers=[\"CUDAExecutionProvider\", \"CPUExecutionProvider\"])\n```\n\n```python\n>>> from wtpsplit import SaT\n>>> texts = [\"这是一句话。这又是另一句话。\"] * 1000\n\n# PyTorch GPU\n>>> model_pytorch = SaT(\"sat-3l-sm\")\n>>> model_pytorch.half().to(\"cuda\");\n>>> %timeit list(model_pytorch.split(texts))\n# 144 ms ± 252 μs 每循环（7 次运行，每次 10 次循环的平均值 ± 标准差）\n# 已经相当快了，但...\n\n# onnxruntime GPU\n>>> model_ort = SaT(\"sat-3l-sm\", ort_providers=[\"CUDAExecutionProvider\", \"CPUExecutionProvider\"])\n>>> %timeit list(model_ort.split(texts))\n# 94.9 ms ± 165 μs 每循环（7 次运行，每次 10 次循环的平均值 ± 标准差）\n# ...这应该会快约 50%！（在 RTX 3090 上测试过）\n```\n\n如果您希望将 LoRA 与 ONNX 模型结合使用：\n\n- 运行 `scripts\u002Fexport_to_onnx_sat.py`，并将 `use_lora` 设置为 `True`，同时指定合适的 `output_dir: \u003COUTPUT_DIR>`。\n  - 如果您已有本地的 LoRA 模块，请使用 `lora_path`。\n  - 如果您希望从 Hugging Face Hub 加载 LoRA 模块，则使用 `style_or_domain` 和 `language`。\n- 加载已合并 LoRA 权重的 ONNX 模型：\n  `sat = SaT(\u003COUTPUT_DIR>, onnx_providers=[\"CUDAExecutionProvider\", \"CPUExecutionProvider\"])`\n\n## 可用模型\n\n如果您需要一个通用的句子分割模型，建议使用 `-sm` 模型（如 `sat-3l-sm`）。对于对速度敏感的应用场景，我们推荐 3 层模型（`sat-3l` 和 `sat-3l-sm`），它们在速度和性能之间取得了很好的平衡。而性能最佳的则是我们的 12 层模型：`sat-12l` 和 `sat-12l-sm`。\n\n| 模型                                                                        | 英语得分 | 多语言得分 |\n| :--------------------------------------------------------------------------- | ----------: | -----------: |\n| [sat-1l](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-1l)                        |       88.5 |        84.3 |\n| [sat-1l-sm](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-1l-sm)                  |       88.2 |        87.9 |\n| [sat-3l](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-3l)                        |       93.7 |        89.2 |\n| [sat-3l-lora](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-3l\u002Ftree\u002Fmain\u002Floras)   |       96.7 |        94.8 |\n| [sat-3l-sm](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-3l-sm)                  |       96.5 |        93.5 |\n| [sat-6l](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-6l)                        |       94.1 |        89.7 |\n| [sat-6l-sm](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-6l-sm)                  |       96.9 |        95.1 |\n| [sat-9l](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-9l)                        |       94.3 |        90.3 |\n| [sat-12l](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-12l)                      |       94.0 |        90.4 |\n| [sat-12l-lora](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-12l\u002Ftree\u002Fmain\u002Floras) |       97.3 |        95.9 |\n| [sat-12l-sm](https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\u002Fsat-12l-sm)                |       97.4 |        96.0 |\n\n以上分数分别为“英语”和“多语言”类别下的宏平均 F1 分数，分别基于所有可用数据集计算得出。“adapted”表示通过 LoRA 进行适配；详细信息请参阅 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678)。\n\n作为对比，以下是其他一些工具的英语得分：\n\n| 模型                                                    | 英语得分 |\n| :------------------------------------------------------- | ----------: |\n| PySBD                                                    |       69.6 |\n| SpaCy（单语句分割器）                         |       92.9 |\n| SpaCy（多语句分割器）                        |       91.5 |\n| Ersatz                                                   |       91.4 |\n| Punkt (`nltk.sent_tokenize`)                           |       92.2 |\n| [WtP (3l)](https:\u002F\u002Fhuggingface.co\u002Fbenjamin\u002Fwtp-canine-s-3l) |       93.9 |\n\n请注意，本库同样支持之前的 [`WtP`](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18893) 模型。您可以以与 `SaT` 模型基本相同的方式使用它们：\n\n```python\nfrom wtpsplit import WtP\n\nwtp = WtP(\"wtp-bert-mini\")\n\n# 与 SaT 模型类似的功能\nwtp.split(“这是一个测试 这是另一个测试。”)\n```\n\n有关 WtP 的更多详细信息以及复现细节，请参阅 [WtP 文档](.\u002FREADME_WTP.md)。\n\n## 段落分割\n\n由于 SaT 模型经过训练可以预测换行概率，因此除了句子之外，它们还可以将文本分割成段落。\n\n```python\n# 返回一个段落列表，每个段落包含一个句子列表\n# 可通过 `paragraph_threshold` 参数调整段落阈值。\nsat.split(text, do_paragraph_segmentation=True)\n```\n\n## （新增！v2.2+）长度约束分割\n\n使用 `min_length` 和 `max_length` 参数控制片段长度。这在需要将文本分割为特定大小限制的片段时非常有用（例如，用于嵌入模型、存储或下游处理）。\n\n### 基本用法\n\n```python\nfrom wtpsplit import SaT\n\nsat = SaT(\"sat-3l-sm\")\n\ntext = (\n    \"起初，上帝创造了天地。 地是空虚混沌，渊面黑暗；上帝的灵运行在水面上。 上帝说：‘要有光！’于是就有了光。 上帝看着光，觉得甚好；于是将光与暗分开。 上帝称光为昼，称暗为夜。 这样，就有了晚上和早晨，这是第一天。\"\n)\n\n# 将文本按最大片段长度 120 字符进行分割\nsegments = sat.split(text, max_length=120)\nfor i, s in enumerate(segments):\n    print(f\"[{len(s):3d} 字] {s}\")\n# [ 55 字] 起初，上帝创造了天地。\n# [ 86 字] 地是空虚混沌，渊面黑暗。\n# [112 字] 上帝的灵运行在水面上。上帝说：‘要有光！’于是就有了光。\n# [ 86 字] 上帝看着光，觉得甚好；于是将光与暗分开。\n# [115 字] 上帝称光为昼，称暗为夜。这样，就有了晚上和早晨，这是第一天。\n\nassert \"\".join(segments) == text  # 文本被完整保留\n\n# 同时强制最小和最大长度\nsat.split(text, min_length=80, max_length=200)\n\n# 使用贪心算法以获得更快但次优的结果\nsat.split(text, max_length=120, algorithm=\"greedy\")\n```\n\n### 长度偏好先验\n\n使用先验来影响片段长度分布。可用的先验包括：\n\n| 先验 | 最佳用途 |\n|-------|----------|\n| `\"uniform\"`（默认） | 只强制最大长度，让模型自行决定 |\n| `\"gaussian\"` | 偏好接近目标长度的片段（直观） |\n| `\"lognormal\"` | 右偏偏好（对较长片段更宽容） |\n| `\"clipped_polynomial\"` | 必须非常接近目标长度 |\n\n```python\n# 高斯先验（推荐）：偏好接近目标长度的片段\nsat.split(text, max_length=100，prior_type=\"gaussian\", \n          prior_kwargs={\"target_length\": 50, \"spread\": 10})\n\n# 对数正态先验：右偏（对较长片段更宽容）\nsat.split(text, max_length=100，prior_type=\"lognormal\", \n          prior_kwargs={\"target_length\": 70, \"spread\": 25})\n\n# 截断多项式：在目标长度上下一定范围内硬性截断\nsat.split(text, max_length=100，prior_type=\"clipped_polynomial\", \n          prior_kwargs={\"target_length\": 60, \"spread\": 25})\n```\n\n### 语言感知的默认设置\n\n传递 `lang_code` 参数，即可根据语言特定的语料库统计信息使用语言相关的 `target_length` 和 `spread` 默认值：\n\n```python\n# 德语的平均句长较长 → 自动使用 target_length=90，spread=35\nsat.split(text, max_length=150，prior_type=\"gaussian\", \n          prior_kwargs={\"lang_code\": \"de\"})\n\n# 中文的平均句长较短 → 自动使用 target_length=45，spread=15\nsat.split(text，max_length=100，prior_type=\"gaussian\", \n          prior_kwargs={\"lang_code\": \"zh\"})\n```\n\n当使用 LoRA 并指定语言时，这些设置会自动应用：\n\n```python\nsat = SaT(\"sat-3l\", style_or_domain=\"ud\", language=\"de\")\nsat.split(text，max_length=150，prior_type=\"gaussian\")  # 自动使用德语默认值\n```\n\n### 工作原理\n\n维特比算法会找到全局最优的分割点，从而在以下两者之间取得平衡：\n- 模型对句子边界的预测（自然分隔的位置）\n- 您的长度偏好（通过先验；如果提供）\n\n**文本重建：**\n```python\n# 如果有长度约束（max_length 或 min_length）：\noriginal_text = \"\".join(segments)  # 片段可能包含换行符\n\n# 如果没有长度约束（SaT 默认且 split_on_input_newlines=True）：\noriginal_text = \"\\n\".join(segments)\n```\n\n> **注意**：使用长度约束时，片段可能会包含换行符。如果您希望去除这些换行符，可以在后处理阶段进行清理。\n\n> **注意**：当设置了 `max_length` 时，`threshold` 参数将被忽略。维特比\u002F贪心算法会直接使用原始模型概率，而不是基于阈值的过滤。\n\n有关更多信息，请参阅 [长度约束文档](.\u002Fdocs\u002FLENGTH_CONSTRAINTS.md)。\n\n## 适应性\n\nSaT 模型可以通过 LoRA 技术针对特定领域和风格进行适配。我们为 `sat-3l` 和 `sat-12l` 提供了适用于 81 种语言的通用依存关系、OPUS100、Ersatz 和 TED（即 ASR 风格转录演讲）句子风格的训练好的 LoRA 模块。此外，我们还提供了适用于 6 种语言的法律文件（法律和判决）、4 对语言组合中的代码转换以及 3 种语言的推文的 LoRA 模块。有关详细信息，请参阅我们的论文 [arXiv:2406.16678](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678)。\n\n我们还为 `sat-12-no-limited-lookahead` 提供了 16 种体裁的诗句分割模块。\n\n加载 LoRA 模块的方式如下：\n\n```python\n\n# 需要同时指定 lang_code 和 style_or_domain\n# 可用的模块请查看 \u003Cmodel_repository>\u002Floras 文件夹\nsat_lora = SaT(\"sat-3l\", style_or_domain=\"ud\", language=\"en\")\nsat_lora.split(\"你好，这是一个测试。但现在情况不同了。现在下一个开始了，looool\")\n# 接下来是一个非常不同的领域\nsat_lora_distinct = SaT(\"sat-12l\", style_or_domain=\"code-switching\", language=\"es-en\")\nsat_lora_distinct.split(\"早上在那里，每当我讲些什么，他就会回应我。\")\n```\n\n您也可以自由调整分割阈值，较高的阈值会导致更保守的分割：\n\n```python\n\nsat.split(\"这是一个测试 这是另一个测试。\", threshold=0.4)\n# LoRA 模块同样适用；但其阈值更高\nsat_lora.split(\"你好，这是一个测试。但现在情况不同了。现在下一个开始了，looool\", threshold=0.7)\n```\n\n## 高级用法\n\n### 获取文本中换行符或句子边界的概率：\n\n```python\n# 返回换行符概率（支持批量处理！）\nsat.predict_proba(text)\n```\n\n### 在 HuggingFace `transformers` 中加载 SaT 模型：\n\n```python\n\n# 导入库以注册自定义模型\nimport wtpsplit.models\nfrom transformers import AutoModelForTokenClassification\n\nmodel = AutoModelForTokenClassification.from_pretrained(\"segment-any-text\u002Fsat-3l-sm\") # 或者其他模型名称；请参阅 https:\u002F\u002Fhuggingface.co\u002Fsegment-any-text\n```\n\n### 通过LoRA适配您自己的语料库\n\n我们的模型可以通过LoRA高效且强大地进行适配。仅需10至100个经过分段标注的训练句子，性能即可显著提升。具体操作如下：\n\n克隆仓库并安装依赖项：\n\n```\ngit clone https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\ncd wtpsplit\npip install -r requirements.txt\npip install adapters==0.2.1 --no-dependencies\ncd ..\n```\n\n1. 按照以下格式创建数据：\n\n```python\nimport torch\n\ntorch.save(\n    {\n        \"language_code\": {\n            \"sentence\": {\n                \"dummy-dataset\": {\n                    \"meta\": {\n                        \"train_data\": [\"训练句1\", \"训练句2\"],\n                    },\n                    \"data\": [\n                        \"测试句1\",\n                        \"测试句2\",\n                    ]\n                }\n            }\n        }\n    },\n    \"dummy-dataset.pth\"\n)\n```\n\n请注意，单个句子内不应包含换行符！否则会引发错误。列表中的每个条目都应为一个完整的句子，且不得含有`\\n`字符。因此，您的语料库应已预先进行良好的分割。\n\n2. 创建或调整配置文件；通过`model_name_or_path`指定基础模型，并通过`text_path`提供训练数据的`.pth`文件：\n\n`configs\u002Flora\u002Flora_dummy_config.json`\n\n我们建议从该配置开始，必要时调整`model_name_or_path`、`output_dir`和`text_path`。您也可以根据需要进一步调整`adapter_config`及批量大小等参数，但这属于实验性内容。\n\n3. 训练LoRA：\n\n```\npython3 wtpsplit\u002Ftrain\u002Ftrain_lora.py configs\u002Flora\u002Flora_dummy_config.json\n```\n\n4. 训练完成后，将保存的模块路径提供给SaT：\n\n```python\n\nsat_lora_adapted = SaT(\"model-used\", lora_path=\"dummy_lora_path\")\nsat_lora_adapted.split(\"一些领域特定或风格化的文本\")\n```\n\n**重要提示：** 推理时使用的模型变体必须与训练时一致（例如，`sat-12l-sm`和`sat-12l`的配置不同，基于前者训练的适配器无法加载到后者上）。\n\n请根据您的需求调整上述代码中的数据集名称、语言和模型。\n\n## 复现论文结果\n\n`configs\u002F` 目录下包含了论文中关于基础模型、sm模型以及LoRA模块的运行配置文件。您可以按如下方式分别启动训练：\n\n```\npython3 wtpsplit\u002Ftrain\u002Ftrain.py configs\u002F\u003Cconfig_name>.json\npython3 wtpsplit\u002Ftrain\u002Ftrain_sm.py configs\u002F\u003Cconfig_name>.json\npython3 wtpsplit\u002Ftrain\u002Ftrain_lora.py configs\u002F\u003Cconfig_name>.json\n```\n\n此外：\n\n- `wtpsplit\u002Fdata_acquisition` 包含从mC4语料库获取评估数据和原始文本的代码。\n- `wtpsplit\u002Fevaluation` 包含以下代码：\n  - 通过 `intrinsic.py` 进行句子分割结果的内在评估。\n  - 通过 `intrinsic_pairwise.py` 进行短序列评估（即句子对\u002Fk-mer的分割结果）。\n  - LLM基线评估（`llm_sentence.py`）、法律基线评估（`legal_baselines.py`）。\n  - 基于PySBD、nltk等工具的基线评估结果见 `intrinsic_baselines.py` 和 `intrinsic_baselines_multi.py`。\n  - 原始评估结果以JSON格式存储在 `evaluation_results\u002F` 中。\n  - 统计显著性检验的代码及结果位于 `stat_tests\u002F`。\n  - 标点符号标注实验见 `punct_annotation.py` 和 `punct_annotation_wtp.py`（仅适用于WtP）。\n  - 机器翻译方面的外在评估见 `extrinsic.py`（仅适用于WtP）。\n\n请确保提前安装 `requirements.txt` 中列出的软件包。\n\n## 支持的语言\n\n\u003Cdetails>\n  \u003Csummary>支持语言表\u003C\u002Fsummary>\n\n| iso | 名称            |\n| :-- | :-------------- |\n| af  | 南非语       |\n| am  | 阿姆哈拉语         |\n| ar  | 阿拉伯语          |\n| az  | 阿塞拜疆语     |\n| be  | 白俄罗斯语      |\n| bg  | 保加利亚语       |\n| bn  | 孟加拉语         |\n| ca  | 加泰罗尼亚语     |\n| ceb | 宿务语         |\n| cs  | 捷克语           |\n| cy  | 威尔士语           |\n| da  | 丹麦语          |\n| de  | 德语          |\n| el  | 希腊语           |\n| en  | 英语         |\n| eo  | 世界语       |\n| es  | 西班牙语         |\n| et  | 爱沙尼亚语        |\n| eu  | 巴斯克语          |\n| fa  | 波斯语         |\n| fi  | 芬兰语         |\n| fr  | 法语          |\n| fy  | 西弗里斯兰语 |\n| ga  | 爱尔兰语           |\n| gd  | 苏格兰盖尔语 |\n| gl  | 加利西亚语        |\n| gu  | 古吉拉特语        |\n| ha  | 豪萨语           |\n| he  | 希伯来语          |\n| hi  | 印地语           |\n| hu  | 匈牙利语       |\n| hy  | 亚美尼亚语        |\n| id  | 印度尼西亚语      |\n| ig  | 伊博语            |\n| is  | 冰岛语       |\n| it  | 意大利语         |\n| ja  | 日本语        |\n| jv  | 爪哇语        |\n| ka  | 格鲁吉亚语        |\n| kk  | 哈萨克语          |\n| km  | 高棉语           |\n| kn  | 卡纳达语         |\n| ko  | 韩语          |\n| ku  | 库尔德语         |\n| ky  | 吉尔吉斯语         |\n| la  | 拉丁语           |\n| lt  | 立陶宛语      |\n| lv  | 拉脱维亚语         |\n| mg  | 马达加斯加语      |\n| mk  | 马其顿语      |\n| ml  | 马拉雅拉姆语       |\n| mn  | 蒙古语       |\n| mr  | 马拉地语         |\n| ms  | 马来语           |\n| mt  | 马耳他语         |\n| my  | 缅甸语           |\n| ne  | 尼泊尔语         |\n| nl  | 荷兰语           |\n| no  | 挪威语       |\n| pa  | 旁遮普语         |\n| pl  | 波兰语          |\n| ps  | 普什图语          |\n| pt  | 葡萄牙语      |\n| ro  | 罗马尼亚语        |\n| ru  | 俄语           |\n| si  | 僧伽罗语         |\n| sk  | 斯洛伐克语       |\n| sl  | 斯洛文尼亚语       |\n| sq  | 阿尔巴尼亚语        |\n| sr  | 塞尔维亚语       |\n| sv  | 瑞典语         |\n| ta  | 泰米尔语         |\n| te  | 泰卢固语          |\n| tg  | 塔吉克语           |\n| th  | 泰语            |\n| tr  | 土耳其语         |\n| uk  | 乌克兰语       |\n| ur  | 乌尔都语         |\n| uz  | 乌兹别克语         |\n| vi  | 越南语           |\n| xh  | 豪萨语           |\n| yi  | 意第绪语         |\n| yo  | 约鲁巴语          |\n| zh  | 中文         |\n| zu  | 祖鲁语            |\n\n\u003C\u002Fdetails>\n\n有关详细信息，请参阅我们的 [Segment any Text 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678)。\n\n## 社区移植版本\n\n- **Rust**: [wtsplit-rs](https:\u002F\u002Fgithub.com\u002F19h\u002Fwtsplit-rs) 由 [@19h](https:\u002F\u002Fgithub.com\u002F19h) 开发\n\n*注：社区移植版本由独立维护，可能具有不同的功能集或更新计划。*\n\n## 引用\n\n对于 `SaT` 模型，请引用我们的论文：\n\n```\n@inproceedings{frohmann-etal-2024-segment,\n    title = \"分割任意文本：一种鲁棒、高效且可适应的通用句子分割方法\",\n    author = \"Frohmann, Markus 与  Sterner, Igor 与  Vuli{\\'c}, Ivan 与  Minixhofer, Benjamin 与  Schedl, Markus\",\n    editor = \"Al-Onaizan, Yaser 与  Bansal, Mohit 与  Chen, Yun-Nung\",\n    booktitle = \"2024年自然语言处理经验方法会议论文集\",\n    month = nov,\n    year = \"2024\",\n    address = \"美国佛罗里达州迈阿密\",\n    publisher = \"计算语言学协会\",\n    url = \"https:\u002F\u002Faclanthology.org\u002F2024.emnlp-main.665\",\n    pages = \"11908--11941\"\n}\n```\n\n对于该库和 WtP 模型，请引用：\n\n```\n@inproceedings{minixhofer-etal-2023-wheres,\n    title = \"句号在哪里？自监督多语言无标点符号依赖的句子分割\",\n    author = \"Minixhofer, Benjamin 与  Pfeiffer, Jonas 与  Vuli{\\'c}, Ivan\",\n    booktitle = \"第61届计算语言学协会年会论文集（第一卷：长文）\",\n    month = jul,\n    year = \"2023\",\n    address = \"加拿大多伦多\",\n    publisher = \"计算语言学协会\",\n    url = \"https:\u002F\u002Faclanthology.org\u002F2023.acl-long.398\",\n    pages = \"7215--7235\"\n}\n```\n\n## 致谢\n\n本研究全部或部分由奥地利科学基金会（FWF）资助，项目编号为 P36413、P33526 和 DFH-23；同时，也得到了上奥地利州以及联邦教育、科学与研究部通过 LIT-2021-YOU-215 资助的支持。此外，Ivan Vulić 和 Benjamin Minixhofer 还获得了英国皇家学会大学研究 fellowship“面向真正多语世界的包容性与可持续语言技术”（编号 221137）的资助。本研究还得到了谷歌 TPU 研究云（TRC）提供的 Cloud TPU 支持。此外，本工作还获得了 Cohere For AI Research Grant 的算力资助，该资助旨在支持学术合作伙伴开展以发布有益的科研成果和数据为目标的研究。我们还要感谢 Simone Teufel 提供的富有成效的讨论。\n\n---\n\n如有任何问题，请提交 issue 或发送邮件至 markus.frohmann@gmail.com，我将尽快回复您。","# wtpsplit 快速上手指南\n\nwtpsplit 是一个强大且高效的文本分割工具，支持将任意文本稳健地分割为句子或段落。它基于最新的 **SaT** (Segment Any Text) 模型，在 85 种语言上实现了业界领先的性能，并支持通过 LoRA 进行领域和风格适配。\n\n## 环境准备\n\n*   **系统要求**：支持 Linux、macOS 和 Windows。\n*   **Python 版本**：建议 Python 3.8 及以上。\n*   **硬件加速（可选）**：\n    *   **GPU**：如需加速推理，请确保已安装 NVIDIA CUDA 驱动及对应的 PyTorch GPU 版本。\n    *   **ONNX Runtime**：若需极致推理速度，可安装 ONNX Runtime 版本。\n*   **依赖管理**：建议使用 `pip` 或 `conda` 管理环境。\n\n> **提示**：国内开发者如遇下载慢的问题，建议在安装命令中指定清华或阿里镜像源。\n\n## 安装步骤\n\n### 1. 基础安装\n安装标准版本（基于 PyTorch）：\n```bash\npip install wtpsplit -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 2. 启用 ONNX 加速（推荐用于生产环境）\n如果需要更快的推理速度（支持 CPU 或 GPU），请选择以下其一安装：\n\n**GPU 加速版：**\n```bash\npip install \"wtpsplit[onnx-gpu]\" -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n**CPU 加速版：**\n```bash\npip install \"wtpsplit[onnx-cpu]\" -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n\n### 1. 最简单的句子分割\n加载预训练模型并进行分割。默认推荐使用 `-sm` 后缀的模型以获得速度与精度的最佳平衡。\n\n```python\nfrom wtpsplit import SaT\n\n# 初始化模型 (推荐使用 sat-3l-sm)\nsat = SaT(\"sat-3l-sm\")\n\n# 可选：启用 GPU 加速 (如果可用)\n# sat.half().to(\"cuda\")\n\ntext = \"This is a test. This is another test.\"\nsentences = sat.split(text)\n\nprint(sentences)\n# 输出: ['This is a test. ', 'This is another test.']\n```\n\n### 2. 批量处理（高性能模式）\n为了获得最佳性能，请一次性传入文本列表，而不是循环调用。\n\n```python\ntexts = [\n    \"First sentence. Second sentence.\",\n    \"Another text block here.\"\n]\n\n# 返回一个迭代器，每个元素是对应输入文本的句子列表\nresults = sat.split(texts)\n\nfor res in results:\n    print(res)\n```\n\n### 3. 进阶功能速览\n\n**段落分割**\n利用模型预测换行符概率的能力，直接将文本分割为段落。\n```python\n# 返回段落列表，每个段落包含句子列表\nparagraphs = sat.split(text, do_paragraph_segmentation=True)\n```\n\n**长度约束分割 (v2.2+)**\n控制分割后的片段长度（例如限制最大字符数），适用于嵌入模型输入或存储限制场景。\n```python\n# 限制每个片段最大长度为 120 字符\nsegments = sat.split(text, max_length=120)\n```\n\n**领域\u002F语言适配 (LoRA)**\n加载针对特定语言或风格微调的 LoRA 模块，以获得更高精度。\n```python\n# 加载针对英语通用依赖树库 (UD) 适配的模型\nsat_adapted = SaT(\"sat-3l\", style_or_domain=\"ud\", language=\"en\")\nsat_adapted.split(\"Complex sentence structure...\")\n```\n\n### 模型选择建议\n*   **通用场景\u002F速度敏感**：使用 `sat-3l-sm` 或 `sat-1l-sm`。\n*   **高精度需求**：使用 `sat-12l-sm` 或带 LoRA 适配的模型（如 `sat-3l-lora`）。\n*   **多语言支持**：所有 `sat` 系列模型均原生支持 85 种语言。","某跨国舆情分析团队需要每日处理来自 85 种语言的社交媒体原始数据，将其切分为独立句子以进行情感打分和关键词提取。\n\n### 没有 wtpsplit 时\n- **多语言支持薄弱**：传统规则或旧模型难以应对非英语语种（如泰语、阿拉伯语），导致大量句子切分错误或完全失效。\n- **标点依赖严重**：面对社交媒体中普遍存在的缺失标点、滥用换行或表情符号分隔的文本，现有工具无法准确识别语义边界。\n- **处理效率低下**：在大规模数据流中，重型模型推理速度慢，且缺乏 GPU\u002FONNX 加速选项，造成数据积压和实时性差。\n- **领域适应性差**：通用模型难以理解特定行业术语或网络俚语，导致长句被错误截断或短句被强行合并，影响下游分析精度。\n\n### 使用 wtpsplit 后\n- **全球语言覆盖**：利用 SaT 模型原生支持 85 种语言的特性，统一处理流程，显著提升了小语种数据的切分准确率。\n- **无标点鲁棒分割**：基于自监督学习架构，wtpsplit 能精准识别无语义标点处的句子边界，完美适配嘈杂的社交文本。\n- **极致推理性能**：通过启用 ONNX GPU 加速，处理千条文本的速度提升约 50%，轻松满足实时舆情监控的低延迟需求。\n- **灵活领域适配**：借助 LoRA 模块，团队可快速加载针对“金融”或“电商”风格微调的权重，使切分结果更贴合业务语境。\n\nwtpsplit 凭借其对多语言、无标点文本的鲁棒分割能力及高效的推理速度，成为了构建高质量多语言 NLP 流水线的关键基石。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsegment-any-text_wtpsplit_9b821e86.png","segment-any-text","Segment any Text","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fsegment-any-text_a98c763b.png","",null,"https:\u002F\u002Fgithub.com\u002Fsegment-any-text",[79,83,87],{"name":80,"color":81,"percentage":82},"Python","#3572A5",78,{"name":84,"color":85,"percentage":86},"TeX","#3D6117",21.9,{"name":88,"color":89,"percentage":90},"Shell","#89e051",0.1,1266,82,"2026-04-05T15:25:29","MIT","未说明","非必需。支持 NVIDIA GPU (通过 'cuda' 或 ONNX 的 'CUDAExecutionProvider') 和 TPU ('xla:0')。具体型号和显存大小未说明，但示例测试环境为 RTX 3090。",{"notes":98,"python":95,"dependencies":99},"该工具核心依赖 PyTorch。若需启用更快的 ONNX 推理，需安装 'wtpsplit[onnx-gpu]' 或 'wtpsplit[onnx-cpu]'。模型文件托管在 Hugging Face，首次运行会自动下载。支持通过 LoRA 模块进行领域和语言适配。提供多种模型尺寸（1l 到 12l），层数越多性能越好但计算成本越高，用户可根据需求选择。",[100,101],"torch","onnxruntime (可选，用于 ONNX 加速)",[15,35,14],[104,105,106,107,108,109,110,111],"sentence-boundary-detection","python","deep-learning","machine-learning","pretrained-models","sentence-segmentation","sentence-segmenter","natural-language-processing","2026-03-27T02:49:30.150509","2026-04-11T18:31:39.334674",[115,120,125,130,135,140],{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},18073,"支持哪些语言？如何请求或训练新语言模型？","目前支持包括瑞典语、挪威语、法语、土耳其语、简体中文、俄语、乌克兰语、印尼语等在内的 80 多种语言。维护者暂时不再主动训练新模型，但欢迎用户自行训练。您可以参考官方提供的训练笔记本自行训练模型：https:\u002F\u002Fgithub.com\u002Fbminixhofer\u002Fnnsplit\u002Fblob\u002Fmain\u002Ftrain\u002Ftrain.ipynb。如果有特定语言需求，可以在相关 Issue 中留言，以便维护者在未来优先处理。","https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fissues\u002F11",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},18074,"模型是否依赖大小写来识别句子边界？如果文本没有大写（如自动生成的字幕）该怎么办？","早期的 WtP 模型确实严重依赖大写来检测句子边界，如果首字母小写，性能会显著下降。为了解决这个问题，项目引入了新的 SaT 模型（特别是 `-sm` 版本），通过在训练数据中加入大小写混乱等噪声进行了优化，能更好地处理不规则的大小写和标点符号。如果您处理的是 YouTube 自动生成的大写缺失的字幕，建议尝试使用 SaT 模型。","https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fissues\u002F101",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},18075,"如何在 Linux 或非 macOS 平台上安装 Python wheel？","从 v0.4.9 版本开始，项目已在 CI 中构建并发布了适用于 Windows、Linux 和 macOS 的 wheel。请注意，由于构建限制，Python 包版本号可能会显示为 `x-post0` 格式（例如 0.4.9-post0），这是正常现象，旨在避免版本冲突。用户只需确保安装最新版本即可在所有主流平台上使用。","https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fissues\u002F14",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},18076,"加载 LoRA 适配器进行推理时，提示\"未激活适配器\"或使用基础模型，如何解决？","当出现\"There are adapters available but none are activated\"或\"LoRA ... not found, using base model\"警告时，通常是因为适配器路径配置错误或未正确加载。请检查 `lora_path` 参数是否指向包含适配器权重的正确目录。此外，确保您的依赖库版本（如 transformers, accelerate, adapters）与项目要求兼容。如果修改了代码以适应不同版本的库，请确保这些修改不影响适配器的加载逻辑。建议对照官方文档重新检查训练和加载步骤。","https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fissues\u002F168",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},18077,"在哪里可以找到除英语和德语以外的其他语言预训练模型？","除了 README 中列出的英语和德语模型外，其他语言的模型信息通常汇总在语言愿望清单（Language wishlist）相关的 Issue 中（如 Issue #11）。如果找不到所需的预训练模型，您可以使用官方提供的训练脚本自行训练。训练指南位于：https:\u002F\u002Fgithub.com\u002Fbminixhofer\u002Fnnsplit\u002Fblob\u002Fmain\u002Ftrain\u002Ftrain.ipynb。社区也欢迎用户贡献自己训练的模型。","https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fissues\u002F5",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},18078,"在 Android 上运行 ONNX 模型时遇到\"DecodeError: Error parsing message\"错误怎么办？","该错误可能与模型输入数据类型有关。模型内部使用 uint8 输入直接馈送到嵌入层。虽然理论上 int8 输入在某些值（如 -10）下可能报错，但在实际转换中，ONNX 模型内部通常会进行适当的偏移处理（例如 0 对应 -127，255 对应 128 等），从而能够正确解析。如果在 Android 上遇到解析错误，请检查 ONNX 模型的输入类型定义是否与预期一致，或者尝试重新导出模型以确保兼容性。","https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fissues\u002F26",[146,151,156,161,166,171,176,181,186,191,196,201,206,211,216,221,226,231,236,241],{"id":147,"version":148,"summary_zh":149,"released_at":150},108538,"2.2.0","## 变更内容\n\n### 新功能\n* **长度约束分段** (#164，由 @harikesavan 提供)：通过 `min_length` 和 `max_length` 参数控制分段长度。支持使用维特比算法（最优）或贪心算法，并可配置先验分布（`uniform`、`gaussian`、`lognormal`、`clipped_polynomial`），同时提供语言感知的默认设置。此功能适用于嵌入流水线、存储限制，或任何需要固定大小块的下游任务。详情请参阅 [docs\u002FLENGTH_CONSTRAINTS.md](.\u002Fdocs\u002FLENGTH_CONSTRAINTS.md)。\n\n### 问题修复与改进\n* **兼容 Transformers ≥5** (#172，由 @markus583 提供)：全面支持 `transformers` v5，同时保持与 v4 的向后兼容性。此外，移除了 `adapters` 库作为硬依赖——现在无需安装该库即可合并 LoRA 权重。\n* **自动检测 sm 模型上的 LoRA `num_labels`** (#170，由 @markus583 提供)：修复了将 `num_labels > 1` 训练的 LoRA Adapter 加载到 `-sm` 模型上时出现的形状不匹配错误（#168）。\n\n### 其他\n* 最低支持的 Python 版本：3.9\n* 将 Python 3.13 添加到 CI 矩阵中\n* CI 现在会运行新的长度约束分段测试\n\n## 新贡献者\n* @harikesavan 在 #164 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fcompare\u002F2.1.7...2.2.0","2026-02-26T12:51:28",{"id":152,"version":153,"summary_zh":154,"released_at":155},108539,"2.1.7","- 在某些 Python 版本中抑制上游依赖的烦人警告\n- 增加不合并 LoRA 权重的选项（出于效率考虑，仍默认合并）\n\n**完整更新日志**：https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fcompare\u002F2.1.6...2.1.7","2025-11-19T08:41:18",{"id":157,"version":158,"summary_zh":159,"released_at":160},108540,"2.1.6","## 变更内容\n* 由 @kevinhu 在 https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fpull\u002F157 中提升了后处理效率\n\n## 新贡献者\n* @kevinhu 在 https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fpull\u002F157 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fcompare\u002F2.1.5...2.1.6","2025-06-23T03:49:37",{"id":162,"version":163,"summary_zh":164,"released_at":165},108541,"2.1.5","## 更改日志\n* 通过对分词器使用 `is None` 来避免不必要的 `__len__` 检查，从而带来*显著*的性能提升（https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fpull\u002F150）\n* 将 ONNX Runtime 的默认安装从 CPU 改为支持 GPU 和 CPU 的灵活安装（https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fpull\u002F152）\n* 允许使用预先下载的分词器，以便在离线状态下使用 SaT（https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fissues\u002F151）\n* 在设置 ONNX 模型对象时添加检查（https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fissues\u002F149）","2025-04-01T13:35:27",{"id":167,"version":168,"summary_zh":169,"released_at":170},108542,"2.1.4","- 由 @lsorber 引入可选的帽子权重机制\n- 澄清 LoRA 的适配方式\n- 澄清 `treat_newline_as_space` 参数：已重命名为 `split_on_input_newlines`。`treat_newline_as_space` 参数将在未来的版本中弃用。","2025-01-25T16:43:18",{"id":172,"version":173,"summary_zh":174,"released_at":175},108543,"2.1.2","- 修复 #142：当字符串仅由换行符、空白字符组成，或为空字符串时，出现 AssertionError。","2024-12-14T11:06:27",{"id":177,"version":178,"summary_zh":179,"released_at":180},108544,"2.1.1","- 修改 SaT.split 中对换行符的默认行为。  - 现在，虽然模型会忽略换行符，但它们仍会被用作简单的后处理步骤来实现文本分割。 - 修复了 LoRA 训练中的一些小 bug。 - 更新了高级用法的 README 文件。","2024-10-27T14:19:33",{"id":182,"version":183,"summary_zh":184,"released_at":185},108545,"2.1.0","- 为 SaT 模型添加了 ONNX 支持。\n  - 包括导出脚本和更新的 README 文件。\n  - 因此，在 GPU 上的推理速度提升了 50%。","2024-09-24T21:37:49",{"id":187,"version":188,"summary_zh":189,"released_at":190},108546,"2.0.8","- 修复将短序列拆分为单个字符的问题 (#127) ","2024-09-09T10:49:52",{"id":192,"version":193,"summary_zh":194,"released_at":195},108547,"2.0.7","- 允许 numpy>=2.0\n- 修复适配代码\n- 添加一些注释","2024-09-02T13:26:28",{"id":197,"version":198,"summary_zh":199,"released_at":200},108548,"2.0.5","- Fixes potential CUDA device error when the input has exactly 511 tokens (#121).","2024-07-08T07:41:49",{"id":202,"version":203,"summary_zh":204,"released_at":205},108549,"2.0.4","- Fix a speed issue with SaT (https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit\u002Fissues\u002F118). Now it is (as expected) ~6x faster than WtP.","2024-07-01T09:32:54",{"id":207,"version":208,"summary_zh":209,"released_at":210},108550,"2.0.3","Implement SaT (https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.16678) and switch the default models to SaT🚀\r\n\r\nThe previous WtP models are still available but SaT is strictly better in accuracy and speed. See the updated README for details: https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit.\r\n\r\nSaT was implemented and developed by @markus583 @igorsterner.","2024-06-26T08:05:08",{"id":212,"version":213,"summary_zh":214,"released_at":215},108551,"1.3.0","- Fix a bug affecting some hash embeddings of the `canine-*` models which reduced accuracy (please upgrade to this version!).\r\n- Add a guide on adapting to your custom data: https:\u002F\u002Fgithub.com\u002Fbminixhofer\u002Fwtpsplit#advanced-usage.","2024-01-22T15:30:25",{"id":217,"version":218,"summary_zh":219,"released_at":220},108552,"1.2.3","- fix error with text where length is not a multiple of 4 and shorter than 512 characters in `canine-s-*` models (#98).","2023-07-18T13:47:50",{"id":222,"version":223,"summary_zh":224,"released_at":225},108553,"1.2.2","- add `strip_whitespace` flag.\r\n- fix bug with some zero-length sentences being returned if there is lots of trailing whitespace.","2023-07-14T15:55:06",{"id":227,"version":228,"summary_zh":229,"released_at":230},108554,"1.2.1","- fix argument propagation from model wrapper (#95 #97)","2023-07-11T18:19:08",{"id":232,"version":233,"summary_zh":234,"released_at":235},108555,"1.2.0","- Speed up pre- & postprocessing via better vectorization (#94).\r\n- Proper onnxruntime support for the `wtp-bert-*` models, although onnx models are currently not much faster (or even slower) than PyTorch models for some reason. Will continue to look into that.\r\n- Adds missing `pandas` requirement (fixing #92).\r\n- Lower bounds on `transformers` and other requirements to make sure all the functionality we need is there.\r\n- Removes `torch` from requirements since users will want to install it themselves depending on their hardware setup, and it's not required anymore when using only the onnx models.","2023-07-07T09:43:04",{"id":237,"version":238,"summary_zh":239,"released_at":240},108556,"1.1.0","- Added missing `get_threshold` function\r\n- `wtp.split` adapted to some style now also allows changing the  threshold via `wtp.split(..., threshold=threshold)`. Was previously overwritten by the default.","2023-06-17T09:58:18",{"id":242,"version":243,"summary_zh":244,"released_at":245},108557,"1.0.1","A major revamp of this library, now called `wtpsplit`! \r\n\r\nSee the Readme for details.","2023-05-31T11:32:41"]