[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-alasdairforsythe--tokenmonster":3,"tool-alasdairforsythe--tokenmonster":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",147882,2,"2026-04-09T11:32:47",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108111,"2026-04-08T11:23:26",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":75,"owner_location":75,"owner_email":75,"owner_twitter":75,"owner_website":75,"owner_url":76,"languages":77,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":32,"env_os":98,"env_gpu":99,"env_ram":100,"env_deps":101,"category_tags":106,"github_topics":107,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":117,"updated_at":118,"faqs":119,"releases":149},5877,"alasdairforsythe\u002Ftokenmonster","tokenmonster","Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript","TokenMonster 是一款专为 Python、Go 和 JavaScript 设计的非贪婪子词分词器与词表训练工具。它旨在解决大型语言模型因词表庞大且低效而导致的计算资源浪费问题，帮助模型运行得更快、成本更低，并支持生成更长的文本流。\n\n通过采用独特的“非贪婪”分词算法，TokenMonster 能同时探索多条分支路径，从而选出最优的词汇组合。相比传统方法，它在同等词表规模下可减少约 37.5% 的 token 数量，或在保持性能不变的情况下将词表体积缩小 75% 以上。这意味着开发者可以用更少的资源训练出更聪明的模型，或显著延长模型的上下文窗口。此外，它还支持导入现有主流模型（如 GPT-2、LLaMa）的词表，让用户在保留原有训练成果的同时享受更高效的推理速度。\n\nTokenMonster 非常适合 AI 研究人员、大模型开发者以及需要优化推理效率的工程团队使用。它不仅提供了 400 多个预训练词表供直接调用，还允许用户在普通台式机上快速针对特定数据集训练自定义词表。其内置的多种优化模式和对特殊字符、HTML 标签的友好处理，使其成为提升语言模型整体效能的实用利器。","# TokenMonster\n\n**UPDATE:** [Benchmark results from pretraining 16 language models on different tokenizers.](.\u002Fbenchmark\u002Fpretrain.md)\n\nTokenMonster is an ungreedy subword tokenizer and vocabulary generator, enabling language models to run faster, cheaper, smarter and generate longer streams of text.\n\n\u003Cimg width=\"661\" alt=\"tokenmonster\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Falasdairforsythe_tokenmonster_readme_d9a0d7c38ee5.png\">\n\nLarge and sub-optimal vocabularies lead to the waste of computational and memory resources in language models. By switching to TokenMonster, you can potentially achieve the same or better performance with a vocabulary that is [less than a quarter of the size](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002Fbenchmark.html?a=tiktoken%20cl100k_base&b=englishcode-24000-clean-v1&c=englishcode-16000-unfiltered-v1).\n\nTokenMonster can train and generate an optimal vocabulary on a 1 GB dataset within 24 hours on a typical desktop. 442 [pretrained vocabularies](#pretrained-vocabularies) are provided, as well as tools to train your own vocabularies & implementations in Go, Python & Javascript for tokenization and detokenization using the pretrained or your own vocabularies.\n\nYou can [test TokenMonster in your browser here](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002F), tokenizing live in native Javascript.\n\nTokenMonster is a novel approach to tokenization with broad-ranging use potential, but its primary motivation is to improve the training, inference and context-length of large language models. By using a more optimal vocabulary and ungreedy tokenization algorithm, text can be represented with [37.5% fewer tokens at the same vocabulary size](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002Fbenchmark.html?a=gpt2%20tokenmonster&b=tiktoken%20p50k_base&c=englishcode-50256-clean-v1) compared to other modern tokenizing methods, increasing the speed of inference, training and the length of text. And\u002For the vocabulary size can be [reduced by 75% or more](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002Fbenchmark.html?a=tiktoken%20cl100k_base&b=tiktoken%20p50k_base&c=englishcode-24000-clean-v1), freeing resources that can be used to make the model smarter and faster.\n\nYou can also import existing vocabularies from other tokenizers, allowing you to take advantage of TokenMonster's fast, ungreedy tokenization whilst still using the existing vocabulary your model was trained for. TokenMonster vocabularies for GPT2 Tokenizer and LLaMa Tokenizer are included.\n\n## Features\n- Outperforms other tokenization algorithms in every area ([benchmark](.\u002Fbenchmark))\n- Selects the optimal vocabulary for a given dataset\n- 5 [optimization modes](#optimization-modes) to choose from: `unfiltered`, `clean`, `balanced`, `consistent`, `strict`\n- Ungreedy: follows up to 6 parallel branches at a time\n- Fast: follows 6 branches faster than other algorithms can follow 1 ([benchmark](.\u002Fbenchmark))\n- Utilizes [capcode](#capcode) marker tokens to encode uppercasing and forward delete\n- Successfully identifies words, subwords, common phrases and figures of speech by itself\n- Works with HTML tags, sequential spaces, tabs, etc. without wasting context\n- Can be trained on any language\n- Achieves up to 7 chr\u002Ftoken (depending on vocabulary size & optimization mode)\n- Vocabularies can be modified and resized after training\n- Full support for \"special\" and \"single-byte\" tokens\n- Import and export vocabularies to and from human-readable YAML format\n- 422 pretrained vocabularies ready for use\n\n## Table of Contents\n\n* Usage [Go](.\u002Fgo\u002F) | [Python](.\u002Fpython\u002F) | [Javascript](.\u002Fjavascript\u002F) | [Training](.\u002Ftraining\u002F)\n* [Benchmark](.\u002Fbenchmark)\n* [Pretrained Vocabularies](#pretrained-vocabularies)\n* [Optimization Modes](#optimization-modes)\n* [Vocabulary Selection Guidance](#vocabulary-selection-guidance)\n* [Capcode](#capcode)\n* [Normalization](#normalization)\n* [How does it work and how is it different from BPE?](#how-does-it-work-and-how-is-it-different-from-bpe)\n* [The Ungreedy Tokenization Algorithm](#the-ungreedy-tokenization-algorithm)\n* [Datasets](#datasets)\n* [Support & Consultation](#support--consultation)\n\n## Pretrained Vocabularies\n\n442 vocabularies are planned or have already been built. Download them from [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Falasdairforsythe\u002Ftokenmonster), or in the Python library you can simply specify them by name and they'll be downloaded automatically. (Note: the pretrained vocabularies are still being trained, check [here](https:\u002F\u002Fhuggingface.co\u002Falasdairforsythe\u002Ftokenmonster\u002Ftree\u002Fmain\u002Fvocabs) to see which are currently available.)\n\n- Choose a dataset from: `code` `english` `englishcode` `fiction`\n- Choose a vocab size from: `1024` `2048` `4096` `8000` `16000` `24000` `32000` `40000` `50256` `65536` `100256`\n- Choose an [optimization mode](#optimization-modes) from: `unfiltered` `clean` `balanced` `consistent` `strict`\n- For a [capcode](#capcode) disabled vocabulary add: `nocapcode`\n- Finally add the version number: `v1`\n\nExamples: `fiction-24000-strict-v1` `code-4096-clean-nocapcode-v1`\n\nUsage:\n```python\nimport tokenmonster\nvocab = tokenmonster.load(\"englishcode-32000-consistent-v1\")\ntokens = vocab.tokenize(\"This is a test.\")\n```\nThere are 2 additional pre-built vocabularies: `gpt2` and `llama`. These are imports of GPT2 Tokenizer and LLaMa Tokenizer from Hugging Face Transformers into TokenMonster. The tokens and IDs are identical, however they do not always tokenize the text in exactly the same way. For example, LLaMa Tokenizer on Hugging Face tokenizes \" decoded\" as ` dec` `oded`, whilst TokenMonster tokenizes [correctly] to ` decode` `d`. TokenMonster trained vocabularies are massively more efficient, so only use `gpt2` and `llama` if you have to. The scripts used to import them into TokenMonster are [here](.\u002Fyaml_guide).\n```python\nvocab = tokenmonster.load(\"gpt2\")\n```\n\n## Optimization Modes\n\nAll the optimization modes are lossless. The stricter the optimization mode (higher number), the more tokens will be used to tokenize the same text, but it'll be much easier for the language model to learn because the grammar is simpler. Less strict (lower number), more text can be represented with fewer tokens, but the language model will have to learn a more complicated grammar.\n\n`0 unfiltered` allows the training process to freely determine the tokens. `clean` is preferred in almost every case, because `unfiltered` tends to result in overfitting, especially for code as it results in tokens for things like `\\n\\t\\t\\t\\tif (`. Use `unfiltered` for tokenizing language or data that does not use spaces as word boundaries.\n\n`1 clean` introduces filters to avoid overfitting. It forces the vocabulary to begin words with a space, and limits the way in which whitespace can be combined with other characters.\n\n`2 balanced` prioritizes whole words and attempts to dissuade the vocabulary from doing things that are difficult to learn.\n\n`3 consistent` is a looser version of `strict`. It aims to limit the number of different tokens that can represent the same word or phrase, and doesn't allow for open-close delimeters to be combined with words or each other. Numbers also become limited to fewer variants.\n\n`4 strict` aims to have only 1 token per word, no matter how it is encoded. For example `However`, ` however,` and `HOWEVER!` will all use the same ` however` token, in combination with other tokens that indicate it's spacing and capitalization.\n\n## Vocabulary Selection Guidance\n\nView [TokenMonster Vocabulary Comparison](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002Fline.html), to see a line chart of the relationship between vocab size, optimization mode and characters\u002Ftoken. From this chart I can stipulate the rule of thumb that **every doubling of vocabulary size inscreases the characters\u002Ftoken by 0.5**. This pattern starts from vocab size 4096 and consistent up to 100256.\n\nIt's tempting to use large vocabularies, which has been norm, but you can see on the [TokenMonster Tester](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002F) and [Interactive Benchmark](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002Fbenchmark.html) that reducing the vocabulary by 50 - 75% can often result in only a relatively minor increase to the number of tokens required to tokenize it. Even the very general `englishcode` vocabularies, which are for all intents and purposes multi-lingual, do very well at vocab size `24000`. Story or article writing models can go as low as `4096` vocabulary size [and still tokenize at 4 characters per token](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002Fbenchmark.html?a=fiction-8000-balanced-v1&b=fiction-4096-balanced-v1&c=fiction-2048-balanced-v1).\n\nTokenMonster works well with small vocabularies because it's using an optimal selection process. In most cases it's simply not necessary to use vocabulary sizes greater than `32000`, unless it's a multi-lingual vocabulary. More is not better. Using a vocabulary that is excessively large can lead to inefficient usage of embeddings, not to mention an over-complicated grammar. The embeddings for all those unneeded tokens occupy memory and computational resources that could be used more efficiently.\n\nIn my opinion, the 100K vocabulary size is excessive and wasteful, unless your aim is to support at least three languages in the same vocabulary. With a 100K size, you have \"spare\" tokens. By \"spare\", I mean that the vocabulary starts assigning tokens to lengthy, specific sequences like \"limitations under the License\" and \"#### According to\", suggesting that the vocabulary has reached its optimal size and is now just compressing frequently occurring strings.\n\nMy advice is to find the smallest vocabulary size that meets your requirements. With this, you can either be content with a smaller, faster model, or opt to augment the size of the embeddings accordingly, or find a balance between the two.\n\nIn regards to optimization modes, `strict` is the one to go for if your model is limited by its size or largely undertrained. If it's a small model that isn't particularly smart, and you want to get the most out of it, choose `strict` because it'll probably result in a smarter model given that the simpler grammar is quicker to learn (words, punctuation and modifiers are all separate tokens.) On the other hand, if you're training something serious with enough training data so that each token is exposed to a variety of contexts in order to learn it's more complex grammar, you probably want to go for `clean` or `balanced`.\n\n`strict` performs very well with longform natural text, such as novels and articles, but it's too strict for code. `consistent` will give the best balance of consistency for tokenizing code whilst keeping the grammar simple. `balanced` and `clean` are excellent at compressing code into fewer tokens, but this comes with the trade-off of more complex grammar. That said, a smaller vocabulary implies a simpler grammar (less possible combinations), so it may be in your interest to aim for `balanced` with a fairly small vocabulary size, such as `16000`. All of this you can determine by playing around with [TokenMonster Tester](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002F).\n\n## Capcode\n\n[Capcode](https:\u002F\u002Fgithub.com\u002Falasdairforsythe\u002Fcapcode) is an alternative encoding for uppercase in UTF-8 text, supporting all UTF-8 characters. It's completely lossless, changing the way in which capital letters are encoded so they can share tokens with lowercase letters but without losing any information. In theory, capcode makes it easier for a model to learn the meaning of words. Additionally, capcode makes for more efficient tokenization because it frees up so many tokens that would otherwise be used for uppercase variants of already existing lowercase tokens.\n\n## Normalization\n\nTokenMonster is designed to be plug-and-play, taking care of normalization concerns for you. UTF-8 and UTF-16 vocabularies are automatically NFD normalized and encoded Little Endian regardless of architecture. When tokenizing, the exact same transformations are applied transparently, so you can pass a string to either UTF-8 or UTF-16 vocabularies, with or without capcode, and on either Little or Big Endian architecture, and it will be processed correctly.\n\nNo normalizations are applied to charset \"None\" vocabularies. If you're not sure which to choose, UTF-8 is preferred.\n\n## How does it work and how is it different from BPE?\n\nByte-Pair-Encoding starts with single byte tokens and merges frequently occuring tokens together iteratively, growing the vocabulary out of single characters. TokenMonster takes an entirely different approach, beginning with all possible tokens, and distilling the vocabulary down to the vocab size using a method inspired by chemical distillation. TokenMonster thereby does not run into the issue BPE has, that once a branch is chosen, it's assumed to be beneficial, and although it can later be pruned, the alternative branch that might have performed better has already been lost.\n\nThe secret sauce that enables TokenMonster to outperform other algorithms is made from:\n1. The distillation method is an effective means of separating that which is wanted from that which is not, without losing any of the cream.\n2. The training process targets the tokenization method being used. The vocabulary is generated to be optimal for the specific tokenization algorithm and dataset, which is a necessary step for optimal tokenization.\n\nIn simplified terms it does the following:\n- Generates all possible tokens in the dataset (40 billion in 1 GB of text)\n- Deletes all tokens that have no more than 100 occurrences (4 million)\n- Generates random vocabularies of vocab_size\n- Tokenizes the dataset using the target tokenization algorithm with the random vocabulary\n- Deletes the 1% \"worst\" scoring tokens\n- Repeat hundreds of thousands of times\n- When vocab_size is reached, resurrect potential tokens\n- Keep doing this until a more optimal vocabulary cannot be found 1000 times in a row\n\nTokenMonster does not need any information about the language or structure, and results in a neat list of words, subwords and common phrases. Sample:\n```\na number of \na series of \na wonderful \nability and \nable to get \nabout being \nabout their \naccount for \nacknowledge \nacquisition \naddition to \naddress the \nadvertising \naffected by \nafter being \nagainst the \n```\n\n## The Ungreedy Tokenization Algorithm\n\nTokenMonster uses an ungreedy tokenization method in which each token has up to 2 alternatives selected during training, which are subwords of itself. First the longest token that matches the next segment of text is selected in a greedy fashion. The alternative tokens are looked up on an index that is included in the vocabulary file. The longest token matching the following text segment is found for the original and its alternatives, giving 3 possible branches. If any of those do not end on a word boundary, a further branch is followed utilizing a forward delete token, which allows for words beginning with a space to be used as parts of other words. The 6 total branches are scored based on various rules, the optimal branch is chosen and the tokenization continues along that branch.\n\nBecause the training process targets the tokenization algorithm, the training is not only selecting for tokens but selecting for the relationship between tokens in the vocabulary.\n\n## Datasets\n\nThe datasets used for generating the pretrained vocabularies are all available on [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Falasdairforsythe\u002Ftext-english-code-fiction-nonfiction). The sources and scripts used to generate these datasets are included in the training directory.\n\nThe training data mostly came from Red Pajamas [1B Token Sample](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftogethercomputer\u002FRedPajama-Data-1T-Sample). However, to reduce formal English and emphasize other languages, informal writing and code, c4_sample & cc_sample were cropped to 100MB, and [Reddit conversations](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FSophieTr\u002Freddit_clean) data were added (also cropped to 100MB.)\n\nAdditionally, equally weighted code samples of 2MB per language (code_2mb) and 10MB per language (code_10mb) were added for 30 different programming languages to ensure all programming languages have representation. The source of this is [codeparrot\u002Fgithub-code](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcodeparrot\u002Fgithub-code). To ensure a range of coding styles, I allowed only 1 file per GitHub repository, and per file a maximum of 200 lines selected from the middle of the file.\n\nGiven the evolving nature of writing styles, I felt that book_sample.txt, which consists of out-of-copyright books, was not a good representation of contemporary fiction. To better represent a more modern style, I curated fiction.txt and fiction_100mb.txt by throwing together a few other datasets and cleaning it up.\n\nNote: fiction_100mb.txt is a subset of fiction.txt, and code_2mb.txt is a subset of code_10mb.txt.\n\n### english\n\n| Filename                 | Filesize  |\n|--------------------------|-----------|\n| arxiv_sample.txt         | 88,925,569  |\n| book_sample.txt          | 108,069,616 |\n| c4_sample.txt            | 100,560,318 |\n| cc_2023-06_sample.txt    | 100,852,231 |\n| fiction_100mb.txt        | 94,235,489  |\n| stackexchange_sample.txt | 71,940,138  |\n| wikipedia_sample.txt     | 79,181,873  |\n| reddit.txt               | 100,027,565 |\n|                          | **743,792,799** |\n\n### englishcode\n\n| Filename                 | Filesize  |\n|--------------------------|-----------|\n| arxiv_sample.txt         | 88,925,569  |\n| book_sample.txt          | 108,069,616 |\n| c4_sample.txt            | 100,560,318 |\n| cc_2023-06_sample.txt    | 100,852,231 |\n| code_2mb.txt             | 62,895,904  |\n| fiction_100mb.txt        | 94,235,489  |\n| github_sample.txt        | 191,123,094 |\n| stackexchange_sample.txt | 71,940,138  |\n| wikipedia_sample.txt     | 79,181,873  |\n| reddit.txt               | 100,027,565 |\n|                          | **997,811,797** |\n\n### fiction\n\n| Filename                 | Filesize  |\n|--------------------------|-----------|\n| book_sample.txt          | 108,069,616 |\n| fiction.txt              | 357,119,086  |\n| reddit.txt               | 100,027,565 |\n|                          | **565,216,267** |\n\n### code\n\n| Filename                 | Filesize  |\n|--------------------------|-----------|\n| code_10mb.txt            | 314,006,799 |\n| github_sample.txt        | 191,123,094 |\n| stackexchange_sample.txt | 71,940,138  |\n|                          | **577,070,031** |\n\n\nThe following programming and markup languages are represented in both \"englishcode\" and \"code\" vocabularies:\n\n| Language     |      |      |      |      |\n|--------------|--------------|--------------|--------------|--------------|\n| Assembly     | Batchfile    | C            | C#           | C++          |\n| CMake        | CSS          | Dockerfile   | FORTRAN      | Go           |\n| Haskell      | HTML         | Java         | JavaScript   | Julia        |\n| Lua          | Makefile     | Markdown     | PHP          | Perl         |\n| PowerShell   | Python       | Ruby         | Rust         | SQL          |\n| Scala        | Shell        | TypeScript   | TeX          | Visual Basic |\n\n## Support & Consultation\n\nUse the \"Discussions\" tab for free support on how to use TokenMonster. You can also hire me for a paid consultation on how to get the best out of TokenMonster, or to generate a vocabulary for you according to your specific requirements.\n\n.\n","# TokenMonster\n\n**更新：** [在不同分词器上预训练16个语言模型的基准测试结果。](.\u002Fbenchmark\u002Fpretrain.md)\n\nTokenMonster 是一种非贪婪的子词分词器和词汇表生成器，能够让语言模型运行得更快、更经济、更智能，并生成更长的文本流。\n\n\u003Cimg width=\"661\" alt=\"tokenmonster\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Falasdairforsythe_tokenmonster_readme_d9a0d7c38ee5.png\">\n\n大型且次优的词汇表会导致语言模型在计算和内存资源上的浪费。通过切换到 TokenMonster，你有可能以[不到四分之一大小](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002Fbenchmark.html?a=tiktoken%20cl100k_base&b=englishcode-24000-clean-v1&c=englishcode-16000-unfiltered-v1)的词汇表实现相同或更好的性能。\n\nTokenMonster 可以在一台普通台式机上，用 1 GB 的数据集，在 24 小时内训练并生成一个最优的词汇表。我们提供了 442 个[预训练词汇表](#pretrained-vocabularies)，以及用于训练自定义词汇表的工具，还有使用预训练或自定义词汇表进行分词和反分词的 Go、Python 和 JavaScript 实现。\n\n你可以在[这里用浏览器测试 TokenMonster](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002F)，用原生 JavaScript 实时分词。\n\nTokenMonster 是一种新颖的分词方法，具有广泛的应用潜力，但其主要动机是提升大型语言模型的训练、推理效率以及上下文长度。通过使用更优的词汇表和非贪婪的分词算法，与其他现代分词方法相比，在相同词汇表大小下，文本可以用[减少 37.5% 的标记数](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002Fbenchmark.html?a=gpt2%20tokenmonster&b=tiktoken%20p50k_base&c=englishcode-50256-clean-v1)来表示，从而提高推理和训练的速度，并增加文本长度。或者，词汇表大小可以被[减少 75% 或更多](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002Fbenchmark.html?a=tiktoken%20cl100k_base&b=tiktoken%20p50k_base&c=englishcode-24000-clean-v1)，释放出的资源可以用来让模型变得更智能、更快。\n\n你还可以从其他分词器导入现有的词汇表，这样就可以在继续使用模型原本训练所用的词汇表的同时，享受到 TokenMonster 快速、非贪婪分词的优势。其中包含了适用于 GPT2 分词器和 LLaMa 分词器的 TokenMonster 词汇表。\n\n## 特性\n- 在各个方面都优于其他分词算法（[基准测试](.\u002Fbenchmark)）\n- 为给定的数据集选择最优的词汇表\n- 提供 5 种[优化模式](#optimization-modes)可供选择：`unfiltered`、`clean`、`balanced`、`consistent`、`strict`\n- 非贪婪：一次最多跟踪 6 条并行分支\n- 快速：同时跟踪 6 条分支的速度比其他算法跟踪 1 条还要快（[基准测试](.\u002Fbenchmark)）\n- 利用[capcode](#capcode) 标记令牌来编码大写和退格键操作\n- 能够自行成功识别单词、子词、常用短语和修辞手法\n- 可以处理 HTML 标签、连续空格、制表符等，而不会浪费上下文\n- 可以针对任何语言进行训练\n- 最高可达 7 个字符\u002F标记（取决于词汇表大小和优化模式）\n- 训练后的词汇表可以修改和调整大小\n- 完全支持“特殊”和“单字节”标记\n- 支持将词汇表导入导出为人类可读的 YAML 格式\n- 422 个预训练词汇表已准备就绪，可直接使用\n\n## 目录\n\n* 使用方法 [Go](.\u002Fgo\u002F) | [Python](.\u002Fpython\u002F) | [Javascript](.\u002Fjavascript\u002F) | [训练](.\u002Ftraining\u002F)\n* [基准测试](.\u002Fbenchmark)\n* [预训练词汇表](#pretrained-vocabularies)\n* [优化模式](#optimization-modes)\n* [词汇表选择指南](#vocabulary-selection-guidance)\n* [Capcode](#capcode)\n* [归一化](#normalization)\n* [它是如何工作的？与 BPE 有何不同？](#how-does-it-work-and-how-is-it-different-from-bpe)\n* [非贪婪分词算法](#the-ungreedy-tokenization-algorithm)\n* [数据集](#datasets)\n* [支持与咨询](#support--consultation)\n\n## 预训练词汇表\n\n计划或已经构建了 442 个词汇表。你可以从 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Falasdairforsythe\u002Ftokenmonster) 下载它们，或者在 Python 库中只需指定词汇表名称，它就会自动下载。（注意：这些预训练词汇表仍在训练中，请查看 [这里](https:\u002F\u002Fhuggingface.co\u002Falasdairforsythe\u002Ftokenmonster\u002Ftree\u002Fmain\u002Fvocabs) 了解当前可用的词汇表。）\n\n- 从以下数据集类型中选择：`code` `english` `englishcode` `fiction`\n- 从以下词汇表大小中选择：`1024` `2048` `4096` `8000` `16000` `24000` `32000` `40000` `50256` `65536` `100256`\n- 从以下[优化模式](#optimization-modes)中选择：`unfiltered` `clean` `balanced` `consistent` `strict`\n- 如果需要禁用[capcode](#capcode)，则添加：`nocapcode`\n- 最后加上版本号：`v1`\n\n示例：`fiction-24000-strict-v1` `code-4096-clean-nocapcode-v1`\n\n使用方法：\n```python\nimport tokenmonster\nvocab = tokenmonster.load(\"englishcode-32000-consistent-v1\")\ntokens = vocab.tokenize(\"This is a test.\")\n```\n此外，还有 2 个预先构建的词汇表：`gpt2` 和 `llama`。它们是从 Hugging Face Transformers 导入到 TokenMonster 中的 GPT2 分词器和 LLaMa 分词器。虽然标记和 ID 是完全相同的，但它们并不总是以完全相同的方式对文本进行分词。例如，Hugging Face 上的 LLaMa 分词器会将“decoded”分词为 ` dec` `oded`，而 TokenMonster 则会正确地分词为 ` decode` `d`。TokenMonster 训练的词汇表效率要高得多，因此只有在必要时才使用 `gpt2` 和 `llama`。用于将它们导入 TokenMonster 的脚本[在这里](.\u002Fyaml_guide)。\n```python\nvocab = tokenmonster.load(\"gpt2\")\n```\n\n## 优化模式\n\n所有优化模式都是无损的。优化模式越严格（数字越高），相同的文本会被拆分成更多的标记，但语言模型会更容易学习，因为语法更简单。优化模式越宽松（数字越低），可以用更少的标记表示更多的文本，但语言模型需要学习更复杂的语法。\n\n`0 unfiltered` 允许训练过程自由决定标记。`clean` 几乎在所有情况下都是首选，因为 `unfiltered` 容易导致过拟合，尤其是在处理代码时，它会产生诸如 `\\n\\t\\t\\t\\tif (` 这样的标记。对于不以空格作为词边界的语言或数据，可以使用 `unfiltered` 来进行分词。\n\n`1 clean` 引入了过滤器以避免过拟合。它强制词汇表中的单词以空格开头，并限制空白字符与其他字符的组合方式。\n\n`2 balanced` 优先考虑完整单词，并尽量避免词汇表中出现难以学习的标记。\n\n`3 consistent` 是 `strict` 的一个较为宽松的版本。它的目标是限制能够表示同一个单词或短语的不同标记数量，并且不允许开放-封闭的分隔符与单词或其他分隔符结合使用。数字也会被限制为较少的变体形式。\n\n`4 strict` 的目标是让每个单词只对应一个标记，无论其编码方式如何。例如，`However`、` however,` 和 `HOWEVER!` 都将使用同一个 ` however` 标记，再结合其他标记来表示其空格和大小写信息。\n\n## 词汇表选择指南\n\n请查看 [TokenMonster 词汇表比较](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002Fline.html)，其中展示了词汇表大小、优化模式与每标记所含字符数之间的关系折线图。从这张图中，我可以总结出一条经验法则：**词汇表大小每翻倍一次，每标记所含字符数就会增加 0.5**。这一规律从词汇表大小 4096 开始，一直持续到 100256。\n\n使用大词汇表虽然一直是常态，但您可以在 [TokenMonster 测试器](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002F) 和 [交互式基准测试](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002Fbenchmark.html) 中看到，将词汇表规模减少 50% 至 75% 往往只会带来标记数量的相对小幅增加。即使是用途广泛的 `englishcode` 词汇表，本质上属于多语言词汇，其在 24000 的规模下表现依然出色。用于故事或文章写作的模型甚至可以将词汇表规模降至 4096，[仍然保持每标记 4 个字符](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002Fbenchmark.html?a=fiction-8000-balanced-v1&b=fiction-4096-balanced-v1&c=fiction-2048-balanced-v1)。\n\nTokenMonster 在小词汇表上也能表现出色，因为它采用了最优的选取策略。大多数情况下，除非是多语言词汇，否则没有必要使用超过 32000 的词汇表规模。更多并不意味着更好。使用过大的词汇表会导致嵌入向量的利用效率低下，更不用说会使语法变得过于复杂。那些不必要的标记对应的嵌入向量会占用本可更高效利用的内存和计算资源。\n\n在我看来，10 万规模的词汇表是过度且浪费的，除非您的目标是在同一词汇表中支持至少三种语言。当词汇表达到 10 万规模时，就会出现“冗余”标记。所谓“冗余”，是指词汇表开始为一些冗长、特定的序列分配标记，比如“limitations under the License”和“#### According to”，这表明词汇表已经达到了其最佳规模，现在只是在压缩频繁出现的字符串。\n\n我的建议是找到满足您需求的最小词汇表规模。这样，您可以选择使用较小、更快的模型，也可以相应地增加嵌入向量的维度，或者在这两者之间找到平衡。\n\n关于优化模式，如果您使用的模型规模有限或训练不足，那么应选择 `strict` 模式。如果是一个不太智能的小型模型，而您希望充分发挥其潜力，那么就选择 `strict`，因为它通常能生成更智能的模型，因为简单的语法更容易学习（单词、标点符号和修饰词都被拆分为独立的标记）。另一方面，如果您正在训练一个较为复杂的模型，并且有足够的训练数据使每个标记都能接触到多种上下文，从而学习更复杂的语法，那么您可能更适合选择 `clean` 或 `balanced`。\n\n`strict` 模式在长篇自然文本（如小说和文章）中表现非常好，但在代码处理方面则过于严格。`consistent` 模式能够在保持语法简单的同时，为代码分词提供最佳的一致性。`balanced` 和 `clean` 模式则擅长将代码压缩成更少的标记，但代价是语法会变得更复杂。不过，较小的词汇表意味着更简单的语法（更少的组合可能性），因此您或许可以选择在较小的词汇表规模下使用 `balanced` 模式，比如 16000。这些都可以通过玩转 [TokenMonster 测试器](https:\u002F\u002Falasdair.com\u002Ftokenmonster\u002F) 来确定。\n\n## Capcode\n\n[Capcode](https:\u002F\u002Fgithub.com\u002Falasdairforsythe\u002Fcapcode) 是一种针对 UTF-8 文本中大写字母的替代编码方式，支持所有 UTF-8 字符。它是完全无损的，通过改变大写字母的编码方式，使其能够与小写字母共享标记，同时不会丢失任何信息。理论上，capcode 能使模型更容易理解单词的意义。此外，capcode 还能使分词更加高效，因为它释放了大量原本用于现有小写字母的大写变体的标记。\n\n## 规范化\n\nTokenMonster 设计为即插即用，会自动为您处理规范化问题。无论是 UTF-8 还是 UTF-16 词汇表，都会自动进行 NFD 规范化，并以小端序编码，无论架构如何。在分词时，相同的转换会透明地应用，因此您可以将字符串传递给 UTF-8 或 UTF-16 词汇表，无论是否使用 capcode，也不论是小端还是大端架构，系统都会正确处理。\n\n对于字符集为“None”的词汇表，则不会进行任何规范化操作。如果您不确定该选择哪种，请优先使用 UTF-8。\n\n## 它是如何工作的？与 BPE 有何不同？\n\n字节对编码（BPE）从单字节标记开始，通过迭代合并频繁出现的标记来扩展词汇表，最终由单个字符逐步构建出完整的词汇。而 TokenMonster 则采用了完全不同的方法：它从所有可能的标记出发，利用受化学蒸馏启发的技术，逐步精炼词汇表，直至达到预设的词汇大小。因此，TokenMonster 避免了 BPE 的一个关键问题：一旦选择了某个分支，就会假定它是有益的；即便后续可以将其修剪掉，但原本可能表现更好的替代分支却已不可复得。\n\nTokenMonster 能够超越其他算法的关键在于以下两点：\n\n1. 蒸馏方法是一种高效的方式，能够在不损失精华的前提下，将所需内容与不需要的内容分离。\n2. 训练过程专门针对所使用的分词方法进行优化。生成的词汇表是为特定的分词算法和数据集量身定制的，这是实现最佳分词效果的必要步骤。\n\n简单来说，TokenMonster 的工作流程如下：\n\n- 生成数据集中所有可能的标记（例如，在 1GB 文本中可产生约 400 亿个标记）。\n- 删除出现次数不超过 100 次的标记（约 400 万个）。\n- 随机生成指定大小的词汇表。\n- 使用随机词汇表和目标分词算法对数据集进行分词。\n- 删除得分最低的 1% 标记。\n- 重复上述步骤数十万次。\n- 当词汇表大小达到目标时，恢复潜在的优质标记。\n- 继续这一过程，直到连续 1000 次都无法找到更优的词汇表为止。\n\nTokenMonster 不需要任何关于语言或结构的信息，最终会生成一个整齐的单词、子词和常用短语列表。示例如下：\n```\na number of \na series of \na wonderful \nability and \nable to get \nabout being \nabout their \naccount for \nacknowledge \nacquisition \naddition to \naddress the \nadvertising \naffected by \nafter being \nagainst the \n```\n\n## 非贪婪分词算法\n\nTokenMonster 使用一种非贪婪的分词方法：在训练过程中，每个标记最多可以选择两个与其相关的子词作为备选。首先，按照贪心策略选择与文本下一截匹配的最长标记；然后，在词汇表文件中包含的索引中查找这些备选标记。对于原始标记及其备选标记，分别找出与后续文本片段匹配的最长标记，从而形成三条可能的分支。如果其中任何一条分支没有以词边界结束，则会进一步沿着一条使用“前向删除标记”的分支继续处理，这允许以空格开头的词被用作其他词的一部分。最终，根据一系列规则对全部六条分支进行评分，选择最优的分支，并沿该分支继续分词。\n\n由于训练过程专门针对分词算法设计，训练不仅是在挑选优质的标记，还在优化词汇表中各标记之间的关系。\n\n## 数据集\n\n用于生成预训练词汇表的数据集均可在 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Falasdairforsythe\u002Ftext-english-code-fiction-nonfiction) 上获取。这些数据集的来源及生成脚本均包含在训练目录中。\n\n训练数据主要来自 Red Pajamas [1B Token Sample](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftogethercomputer\u002FRedPajama-Data-1T-Sample)。然而，为了减少正式英语的比例并突出其他语言、非正式写作和代码，我们将 c4_sample 和 cc_sample 剪辑至 100MB，并加入了 [Reddit 对话](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FSophieTr\u002Freddit_clean) 数据（同样剪辑至 100MB）。\n\n此外，还为 30 种不同的编程语言分别添加了等权重的代码样本：每种语言 2MB（code_2mb）和 10MB（code_10mb），以确保所有编程语言都有代表性。这些数据来源于 [codeparrot\u002Fgithub-code](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcodeparrot\u002Fgithub-code)。为了涵盖多种编码风格，我规定每个 GitHub 仓库只选取 1 个文件，并且每个文件仅保留中间部分最多 200 行代码。\n\n鉴于写作风格的不断演变，我认为由已过版权保护期的书籍组成的 book_sample.txt 并不能很好地代表当代小说。为了更好地反映现代风格，我整理了 fiction.txt 和 fiction_100mb.txt，将多个其他数据集拼接在一起并进行了清理。\n\n注：fiction_100mb.txt 是 fiction.txt 的子集，code_2mb.txt 则是 code_10mb.txt 的子集。\n\n### 英语\n| 文件名                 | 文件大小  |\n|--------------------------|-----------|\n| arxiv_sample.txt         | 88,925,569  |\n| book_sample.txt          | 108,069,616 |\n| c4_sample.txt            | 100,560,318 |\n| cc_2023-06_sample.txt    | 100,852,231 |\n| fiction_100mb.txt        | 94,235,489  |\n| stackexchange_sample.txt | 71,940,138  |\n| wikipedia_sample.txt     | 79,181,873  |\n| reddit.txt               | 100,027,565 |\n|                          | **743,792,799** |\n\n### 英语代码\n| 文件名                 | 文件大小  |\n|--------------------------|-----------|\n| arxiv_sample.txt         | 88,925,569  |\n| book_sample.txt          | 108,069,616 |\n| c4_sample.txt            | 100,560,318 |\n| cc_2023-06_sample.txt    | 100,852,231 |\n| code_2mb.txt             | 62,895,904  |\n| fiction_100mb.txt        | 94,235,489  |\n| github_sample.txt        | 191,123,094 |\n| stackexchange_sample.txt | 71,940,138  |\n| wikipedia_sample.txt     | 79,181,873  |\n| reddit.txt               | 100,027,565 |\n|                          | **997,811,797** |\n\n### 小说\n| 文件名                 | 文件大小  |\n|--------------------------|-----------|\n| book_sample.txt          | 108,069,616 |\n| fiction.txt              | 357,119,086  |\n| reddit.txt               | 100,027,565 |\n|                          | **565,216,267** |\n\n### 代码\n| 文件名                 | 文件大小  |\n|--------------------------|-----------|\n| code_10mb.txt            | 314,006,799 |\n| github_sample.txt        | 191,123,094 |\n| stackexchange_sample.txt | 71,940,138  |\n|                          | **577,070,031** |\n\n\n以下编程和标记语言同时出现在“英语代码”和“代码”词汇表中：\n\n| 语言     |      |      |      |      |\n|--------------|--------------|--------------|--------------|--------------|\n| Assembly     | Batchfile    | C            | C#           | C++          |\n| CMake        | CSS          | Dockerfile   | FORTRAN      | Go           |\n| Haskell      | HTML         | Java         | JavaScript   | Julia        |\n| Lua          | Makefile     | Markdown     | PHP          | Perl         |\n| PowerShell   | Python       | Ruby         | Rust         | SQL          |\n| Scala        | Shell        | TypeScript   | TeX          | Visual Basic |\n\n## 支持与咨询\n\n请使用“讨论”标签页获取关于如何使用 TokenMonster 的免费支持。您也可以聘请我进行付费咨询，以帮助您充分发挥 TokenMonster 的潜力，或根据您的具体需求为您生成词汇表。\n\n。","# TokenMonster 快速上手指南\n\nTokenMonster 是一种非贪婪（ungreedy）的子词分词器和词表生成工具。相比传统方法，它能显著减少 token 数量（同等词表大小下减少约 37.5%），或在保持性能的同时将词表大小缩减至原来的 1\u002F4，从而提升大语言模型的训练速度、推理效率及上下文长度。\n\n## 环境准备\n\n*   **操作系统**：Linux, macOS, Windows (支持典型桌面环境)\n*   **编程语言**：Python (本指南侧重 Python 用法，同时也支持 Go 和 JavaScript)\n*   **依赖要求**：\n    *   Python 3.x\n    *   `tokenmonster` Python 库\n    *   网络连接（用于自动下载预训练词表，若网络受限需配置代理或手动下载）\n\n## 安装步骤\n\n使用 pip 直接安装官方库：\n\n```bash\npip install tokenmonster\n```\n\n> **注意**：首次加载预训练词表时，库会自动从 Hugging Face 下载相关文件。如果国内访问 Hugging Face 较慢，建议配置镜像源或使用科学上网工具。\n\n## 基本使用\n\n### 1. 加载预训练词表并分词\n\nTokenMonster 提供了多种针对不同数据集（如代码、英文、小说等）和优化模式的预训练词表。以下是最简单的使用示例：\n\n```python\nimport tokenmonster\n\n# 加载预训练词表\n# 命名格式：{数据集}-{词表大小}-{优化模式}-v1\n# 示例：englishcode-32000-consistent-v1\nvocab = tokenmonster.load(\"englishcode-32000-consistent-v1\")\n\n# 进行分词\ntext = \"This is a test.\"\ntokens = vocab.tokenize(text)\n\nprint(tokens)\n# 输出示例：[154, 23, 64, 892, 12] (具体 ID 取决于词表)\n\n# 反向解码 (Detokenize)\noriginal_text = vocab.detokenize(tokens)\nprint(original_text)\n```\n\n### 2. 使用现有模型词表 (GPT-2 \u002F LLaMa)\n\n如果你需要兼容已训练的模型（如 GPT-2 或 LLaMa），TokenMonster 支持导入这些词表，并利用其更高效的非贪婪算法进行分词：\n\n```python\nimport tokenmonster\n\n# 加载 GPT-2 词表\nvocab = tokenmonster.load(\"gpt2\")\ntokens = vocab.tokenize(\"Hello world\")\n\n# 加载 LLaMa 词表\n# vocab = tokenmonster.load(\"llama\")\n```\n\n### 3. 选择适合的词表策略\n\n根据需求选择合适的词表名称参数：\n\n*   **数据集类型**：`code`, `english`, `englishcode`, `fiction`\n*   **词表大小**：`4096`, `8000`, `16000`, `24000`, `32000` 等\n*   **优化模式**：\n    *   `clean` (推荐): 防止过拟合，通用性最强。\n    *   `strict`: 语法最简单，适合小模型或长文本生成，每个单词尽量对应一个 Token。\n    *   `balanced`: 平衡整体单词和子词。\n    *   `consistent`: 适合代码，保持一致性。\n    *   `unfiltered`: 仅用于不以空格为边界的特殊语言数据。\n\n**示例：加载一个针对代码优化的较小词表**\n```python\n# 4096 大小，clean 模式，禁用 Capcode (如需)\nvocab = tokenmonster.load(\"code-4096-clean-nocapcode-v1\")\n```\n\n通过减小词表体积并使用 TokenMonster 的高效算法，你可以在不牺牲太多压缩率的前提下，大幅降低模型的显存占用并提升推理速度。","某初创团队正在开发一款面向长文档分析的垂直领域大模型，需要在有限的显存资源下处理大量法律合同文本。\n\n### 没有 tokenmonster 时\n- **上下文长度受限**：传统分词器（如 BPE）生成的 token 数量过多，导致模型在处理长篇合同时频繁截断，无法捕捉完整的条款逻辑。\n- **训练成本高昂**：为了覆盖专业术语不得不维持庞大的词表（超过 10 万），显著增加了显存占用和训练时间。\n- **推理速度缓慢**：贪婪匹配算法在面对复杂句式时效率低下，实时回答用户关于合同细节的查询时延迟较高。\n- **特殊格式浪费资源**：合同中的缩进、换行及 HTML 标签被拆分为大量无意义的独立 token，挤占了宝贵的上下文窗口。\n\n### 使用 tokenmonster 后\n- **上下文容量倍增**：凭借非贪婪（Ungreedy）算法和多分支匹配，相同词汇量下 token 数量减少 37.5%，模型能一次性读完更长的合同全文。\n- **资源利用率优化**：在保持甚至提升性能的前提下，将词表规模压缩至原来的四分之一以下，大幅降低了显存需求和训练开销。\n- **响应速度显著提升**：更快的分词与解码流程使得单次推理耗时缩短，用户能近乎实时地获取法律条款的分析结果。\n- **结构信息完整保留**：智能识别并高效编码空格、制表符及标签，确保合同排版结构不丢失，同时不浪费上下文额度。\n\ntokenmonster 通过更优的词表构建和非贪婪分词策略，帮助团队在低成本硬件上实现了更长上下文、更快速度的高质量法律模型部署。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Falasdairforsythe_tokenmonster_d9a0d7c3.png","alasdairforsythe","Alasdair","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Falasdairforsythe_120ec183.png",null,"https:\u002F\u002Fgithub.com\u002Falasdairforsythe",[78,82,86,90],{"name":79,"color":80,"percentage":81},"Go","#00ADD8",74.9,{"name":83,"color":84,"percentage":85},"Python","#3572A5",14.5,{"name":87,"color":88,"percentage":89},"JavaScript","#f1e05a",7.9,{"name":91,"color":92,"percentage":93},"HTML","#e34c26",2.7,621,21,"2026-03-28T01:43:03","MIT","未说明 (支持 Go, Python, Javascript 实现，通常意味着跨平台)","未说明 (主要用于词汇表生成和推理，非深度学习训练框架，通常仅需 CPU)","未说明 (提及可在典型台式机上处理 1GB 数据集)",{"notes":102,"python":103,"dependencies":104},"该工具是一个子词分词器和词汇表生成器，而非大型语言模型本身。它可以在典型台式机上运行，训练 1GB 数据集的词汇表需约 24 小时。支持通过 Hugging Face 自动下载预训练词汇表。提供 GPT2 和 LLaMa 词汇表的导入功能。主要优势在于使用更小的词汇表（可减少 75% 以上）实现更高的压缩率和推理速度。","未说明",[105],"未说明 (提供 Go, Python, Javascript 原生实现，未列出特定第三方库依赖)",[35,14],[108,109,110,111,112,113,114,115,116],"tokenisation","tokenization","tokenize","tokenizer","tokenizing","vocabulary","vocabulary-builder","text-tokenization","vocabulary-generator","2026-03-27T02:49:30.150509","2026-04-09T21:35:28.549661",[120,125,130,134,139,144],{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},26684,"遇到模糊的\"Cannot open or save vocabulary file\"错误时如何排查？","该错误信息较为模糊，因为 Python 封装层无法捕获底层 Go 子进程的具体错误代码。\n排查建议：\n1. 直接使用 Go 原生实现：绕过 Python 包装器，直接运行原始的 Go 命令行工具（如 `tokenmonsterserver` 或其他相关可执行文件），这样可以获得更详细、具体的错误信息（如具体的权限拒绝或格式解析错误）。\n2. 检查文件格式：确认输入的词汇表文件是否为 TokenMonster 支持的 `.vocab` 或 YAML 格式，而非其他不支持的文本格式。","https:\u002F\u002Fgithub.com\u002Falasdairforsythe\u002Ftokenmonster\u002Fissues\u002F21",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},26679,"如何处理生物序列（如 DNA、RNA、蛋白质）等仅包含少量字符类型的文本分词？","处理此类数据时需注意以下几点：\n1. 参数一致性：`-capcode` 参数必须在所有步骤（getalltokens, trainvocab, exporttokens）中统一使用或统一不使用。对于生物序列，建议完全不使用 `-capcode`。\n2. 参数同步：`-max-token-length` 必须同时在 `getalltokens` 和 `trainvocab` 命令中设置，不能只在一个中设置。\n3. 调整最小出现次数：`min-occur` 默认值适用于 1GB 数据集，对于较小的生物序列数据集，需要手动设置合适的值。\n4. 仔细检查每个应用的参数，确保针对特定数据类型选择了适当的值。","https:\u002F\u002Fgithub.com\u002Falasdairforsythe\u002Ftokenmonster\u002Fissues\u002F8",{"id":131,"question_zh":132,"answer_zh":133,"source_url":124},26680,"TokenMonster 支持哪些词汇表文件格式？为什么加载 RWKV 的 txt 格式词汇表会报错？","TokenMonster 不支持纯文本格式的词汇表文件。它仅支持以下两种格式：\n1. TokenMonster 原生的 `.vocab` 二进制格式。\n2. YAML 格式文件（参考示例：https:\u002F\u002Fgithub.com\u002Falasdairforsythe\u002Ftokenmonster\u002Fblob\u002Fmain\u002Fyaml_guide\u002Fexample.yaml）。\n\n解决方案：编写一个简单的脚本将现有的 TXT 词汇表转换为 YAML 格式，然后使用 `exportvocab` 可执行文件或 Go\u002FPython 库将其导入为 TokenMonster 支持的格式。",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},26681,"为什么模型生成的文本中全大写字母的比例异常高？这与 Capcode 有关吗？","是的，这通常是由 `capcode` 功能引起的。Capcode 使用 \"B\" (begin) 和 \"E\" (end) 标记来标识连续 3 个或以上的大写单词序列，这可能导致模型学习到错误的大写状态切换行为。\n\n解决方案：\n1. 临时方案：使用不包含 `capcode` 功能的词汇表进行训练。\n2. 长期方案：等待新版本更新，维护者正在修改 capcode 实现，计划移除 B\u002FE 标记，仅使用单个单词修饰符，以消除此类副作用。","https:\u002F\u002Fgithub.com\u002Falasdairforsythe\u002Ftokenmonster\u002Fissues\u002F6",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},26682,"TokenMonster 的最新版本在性能和分词策略上有哪些改进？","最新版本（1.0 及后续更新）带来了以下主要改进：\n1. 空格策略优化：所有词汇表现在优先将空格分割在单词前方（space+word），这在英文数据上带来了至少 10% 的性能提升和更高的一致性。\n2. 速度大幅提升：Go 实现的分词速度提升了 15 倍，Python 实现提升了 60 倍。优化了查找逻辑，使其接近循环遍历文本的理论极限速度。\n3. 词汇效率：通过统一空格策略，减少了因空格、制表符和换行符组合不同而浪费的 token ID，使 \"balanced\" 词汇表在英文表现上优于 \"compressed\" 词汇表。","https:\u002F\u002Fgithub.com\u002Falasdairforsythe\u002Ftokenmonster\u002Fissues\u002F10",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},26683,"如何在资源受限的情况下处理超大规模数据集（如 250GB+）？","虽然具体命令在讨论中被截断，但根据 Issue 描述和常规最佳实践，处理超大数据集的建议流程如下：\n1. 分块处理提取：由于内存限制，无法一次性运行 `getalltokens`。应将数据集分割成较小的块，分别运行 `getalltokens`，然后使用 `mergetokens` 合并生成的 token 文件。\n2. 训练策略：对于 `trainvocab`，如果担心单次运行时间过长或资源不足，同样可以考虑分块处理或采样策略，但需确保统计信息的代表性。维护者通常建议在合并后的 token 列表上进行训练，以保证全局频率统计的准确性。","https:\u002F\u002Fgithub.com\u002Falasdairforsythe\u002Ftokenmonster\u002Fissues\u002F40",[]]