[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-karpathy--minbpe":3,"similar-karpathy--minbpe":67},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":8,"readme_en":9,"readme_zh":10,"quickstart_zh":11,"use_case_zh":12,"hero_image_url":13,"owner_login":14,"owner_name":15,"owner_avatar_url":16,"owner_bio":17,"owner_company":18,"owner_location":19,"owner_email":20,"owner_twitter":18,"owner_website":21,"owner_url":22,"languages":23,"stars":28,"forks":29,"last_commit_at":30,"license":31,"difficulty_score":32,"env_os":33,"env_gpu":34,"env_ram":35,"env_deps":36,"category_tags":39,"github_topics":18,"view_count":41,"oss_zip_url":18,"oss_zip_packed_at":18,"status":42,"created_at":43,"updated_at":44,"faqs":45,"releases":66},1738,"karpathy\u002Fminbpe","minbpe","Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.","minbpe是一个轻量级的字节对编码（BPE）分词器实现，专为大语言模型（LLM）设计。它用简洁的Python代码直接处理UTF-8字节流，避免传统分词器的复杂依赖。仓库包含基础版、正则版和GPT-4兼容版三种分词器，其中RegexTokenizer能精确复现GPT-4的分词效果，确保跨类别边界不合并。支持训练、编码、解码三大核心功能，每个文件都附带详细示例，适合研究人员和开发者学习BPE原理或快速定制分词器。代码注释清晰，适用于需要自定义分词逻辑的项目，同时保持代码的可读性和可维护性。","# minbpe\n\nMinimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is \"byte-level\" because it runs on UTF-8 encoded strings.\n\nThis algorithm was popularized for LLMs by the [GPT-2 paper](https:\u002F\u002Fd4mucfpksywv.cloudfront.net\u002Fbetter-language-models\u002Flanguage_models_are_unsupervised_multitask_learners.pdf) and the associated GPT-2 [code release](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgpt-2) from OpenAI. [Sennrich et al. 2015](https:\u002F\u002Farxiv.org\u002Fabs\u002F1508.07909) is cited as the original reference for the use of BPE in NLP applications. Today, all modern LLMs (e.g. GPT, Llama, Mistral) use this algorithm to train their tokenizers.\n\nThere are two Tokenizers in this repository, both of which can perform the 3 primary functions of a Tokenizer: 1) train the tokenizer vocabulary and merges on a given text, 2) encode from text to tokens, 3) decode from tokens to text. The files of the repo are as follows:\n\n1. [minbpe\u002Fbase.py](minbpe\u002Fbase.py): Implements the `Tokenizer` class, which is the base class. It contains the `train`, `encode`, and `decode` stubs, save\u002Fload functionality, and there are also a few common utility functions. This class is not meant to be used directly, but rather to be inherited from.\n2. [minbpe\u002Fbasic.py](minbpe\u002Fbasic.py): Implements the `BasicTokenizer`, the simplest implementation of the BPE algorithm that runs directly on text.\n3. [minbpe\u002Fregex.py](minbpe\u002Fregex.py): Implements the `RegexTokenizer` that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4. This class also handles special tokens, if any.\n4. [minbpe\u002Fgpt4.py](minbpe\u002Fgpt4.py): Implements the `GPT4Tokenizer`. This class is a light wrapper around the `RegexTokenizer` (2, above) that exactly reproduces the tokenization of GPT-4 in the [tiktoken](https:\u002F\u002Fgithub.com\u002Fopenai\u002Ftiktoken) library. The wrapping handles some details around recovering the exact merges in the tokenizer, and the handling of some unfortunate (and likely historical?) 1-byte token permutations.\n\nFinally, the script [train.py](train.py) trains the two major tokenizers on the input text [tests\u002Ftaylorswift.txt](tests\u002Ftaylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.\n\nAll of the files above are very short and thoroughly commented, and also contain a usage example on the bottom of the file.\n\n## quick start\n\nAs the simplest example, we can reproduce the [Wikipedia article on BPE](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FByte_pair_encoding) as follows:\n\n```python\nfrom minbpe import BasicTokenizer\ntokenizer = BasicTokenizer()\ntext = \"aaabdaaabac\"\ntokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 merges\nprint(tokenizer.encode(text))\n# [258, 100, 258, 97, 99]\nprint(tokenizer.decode([258, 100, 258, 97, 99]))\n# aaabdaaabac\ntokenizer.save(\"toy\")\n# writes two files: toy.model (for loading) and toy.vocab (for viewing)\n```\n\nAccording to Wikipedia, running bpe on the input string: \"aaabdaaabac\" for 3 merges results in the string: \"XdXac\" where  X=ZY, Y=ab, and Z=aa. The tricky thing to note is that minbpe always allocates the 256 individual bytes as tokens, and then merges bytes as needed from there. So for us a=97, b=98, c=99, d=100 (their [ASCII](https:\u002F\u002Fwww.asciitable.com) values). Then when (a,a) is merged to Z, Z will become 256. Likewise Y will become 257 and X 258. So we start with the 256 bytes, and do 3 merges to get to the result above, with the expected output of [258, 100, 258, 97, 99].\n\n## inference: GPT-4 comparison\n\nWe can verify that the `RegexTokenizer` has feature parity with the GPT-4 tokenizer from [tiktoken](https:\u002F\u002Fgithub.com\u002Fopenai\u002Ftiktoken) as follows:\n\n```python\ntext = \"hello123!!!? (안녕하세요!) 😉\"\n\n# tiktoken\nimport tiktoken\nenc = tiktoken.get_encoding(\"cl100k_base\")\nprint(enc.encode(text))\n# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]\n\n# ours\nfrom minbpe import GPT4Tokenizer\ntokenizer = GPT4Tokenizer()\nprint(tokenizer.encode(text))\n# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]\n```\n\n(you'll have to `pip install tiktoken` to run). Under the hood, the `GPT4Tokenizer` is just a light wrapper around `RegexTokenizer`, passing in the merges and the special tokens of GPT-4. We can also ensure the special tokens are handled correctly:\n\n```python\ntext = \"\u003C|endoftext|>hello world\"\n\n# tiktoken\nimport tiktoken\nenc = tiktoken.get_encoding(\"cl100k_base\")\nprint(enc.encode(text, allowed_special=\"all\"))\n# [100257, 15339, 1917]\n\n# ours\nfrom minbpe import GPT4Tokenizer\ntokenizer = GPT4Tokenizer()\nprint(tokenizer.encode(text, allowed_special=\"all\"))\n# [100257, 15339, 1917]\n```\n\nNote that just like tiktoken, we have to explicitly declare our intent to use and parse special tokens in the call to encode. Otherwise this can become a major footgun, unintentionally tokenizing attacker-controlled data (e.g. user prompts) with special tokens. The `allowed_special` parameter can be set to \"all\", \"none\", or a list of special tokens to allow.\n\n## training\n\nUnlike tiktoken, this code allows you to train your own tokenizer. In principle and to my knowledge, if you train the `RegexTokenizer` on a large dataset with a vocabulary size of 100K, you would reproduce the GPT-4 tokenizer.\n\nThere are two paths you can follow. First, you can decide that you don't want the complexity of splitting and preprocessing text with regex patterns, and you also don't care for special tokens. In that case, reach for the `BasicTokenizer`. You can train it, and then encode and decode for example as follows:\n\n```python\nfrom minbpe import BasicTokenizer\ntokenizer = BasicTokenizer()\ntokenizer.train(very_long_training_string, vocab_size=4096)\ntokenizer.encode(\"hello world\") # string -> tokens\ntokenizer.decode([1000, 2000, 3000]) # tokens -> string\ntokenizer.save(\"mymodel\") # writes mymodel.model and mymodel.vocab\ntokenizer.load(\"mymodel.model\") # loads the model back, the vocab is just for vis\n```\n\nIf you instead want to follow along with OpenAI did for their text tokenizer, it's a good idea to adopt their approach of using regex pattern to split the text by categories. The GPT-4 pattern is a default with the `RegexTokenizer`, so you'd simple do something like:\n\n```python\nfrom minbpe import RegexTokenizer\ntokenizer = RegexTokenizer()\ntokenizer.train(very_long_training_string, vocab_size=32768)\ntokenizer.encode(\"hello world\") # string -> tokens\ntokenizer.decode([1000, 2000, 3000]) # tokens -> string\ntokenizer.save(\"tok32k\") # writes tok32k.model and tok32k.vocab\ntokenizer.load(\"tok32k.model\") # loads the model back from disk\n```\n\nWhere, of course, you'd want to change around the vocabulary size depending on the size of your dataset.\n\n**Special tokens**. Finally, you might wish to add special tokens to your tokenizer. Register these using the `register_special_tokens` function. For example if you train with vocab_size of 32768, then the first 256 tokens are raw byte tokens, the next 32768-256 are merge tokens, and after those you can add the special tokens. The last \"real\" merge token will have id of 32767 (vocab_size - 1), so your first special token should come right after that, with an id of exactly 32768. So:\n\n```python\nfrom minbpe import RegexTokenizer\ntokenizer = RegexTokenizer()\ntokenizer.train(very_long_training_string, vocab_size=32768)\ntokenizer.register_special_tokens({\"\u003C|endoftext|>\": 32768})\ntokenizer.encode(\"\u003C|endoftext|>hello world\", allowed_special=\"all\")\n```\n\nYou can of course add more tokens after that as well, as you like. Finally, I'd like to stress that I tried hard to keep the code itself clean, readable and hackable. You should not have feel scared to read the code and understand how it works. The tests are also a nice place to look for more usage examples. That reminds me:\n\n## tests\n\nWe use the pytest library for tests. All of them are located in the `tests\u002F` directory. First `pip install pytest` if you haven't already, then:\n\n```bash\n$ pytest -v .\n```\n\nto run the tests. (-v is verbose, slightly prettier).\n\n## community extensions\n\n* [gnp\u002Fminbpe-rs](https:\u002F\u002Fgithub.com\u002Fgnp\u002Fminbpe-rs): A Rust implementation of `minbpe` providing (near) one-to-one correspondence with the Python version\n\n## exercise\n\nFor those trying to study BPE, here is the advised progression exercise for how you can build your own minbpe step by step. See [exercise.md](exercise.md).\n\n## lecture\n\nI built the code in this repository in this [YouTube video](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=zduSFxRajkE). You can also find this lecture in text form in [lecture.md](lecture.md).\n\n## todos\n\n- write a more optimized Python version that could run over large files and big vocabs\n- write an even more optimized C or Rust version (think through)\n- rename GPT4Tokenizer to GPTTokenizer and support GPT-2\u002FGPT-3\u002FGPT-3.5 as well?\n- write a LlamaTokenizer similar to GPT4Tokenizer (i.e. attempt sentencepiece equivalent)\n\n## License\n\nMIT\n","# minbpe\n\n用于大型语言模型分词中常用的字节对编码（BPE）算法的极简、整洁代码。BPE算法之所以称为“字节级”，是因为它直接作用于UTF-8编码的字符串。\n\n这一算法由OpenAI的[GPT-2论文](https:\u002F\u002Fd4mucfpksywv.cloudfront.net\u002Fbetter-language-models\u002Flanguage_models_are_unsupervised_multitask_learners.pdf)及其配套的GPT-2[代码发布](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgpt-2)推广至大型语言模型领域。[Sennrich等人，2015年](https:\u002F\u002Farxiv.org\u002Fabs\u002F1508.07909)被引用为在自然语言处理应用中使用BPE的原始参考文献。如今，所有现代大型语言模型（例如GPT、Llama、Mistral）都采用该算法来训练其分词器。\n\n本仓库包含两种分词器，它们都能实现分词器的三大基本功能：1）基于给定文本训练分词器词汇表和合并规则；2）将文本编码为标记；3）将标记解码回文本。仓库中的文件如下：\n\n1. [minbpe\u002Fbase.py](minbpe\u002Fbase.py)：实现了`Tokenizer`类，这是基础类。它包含了`train`、`encode`和`decode`的空实现，以及保存\u002F加载功能，还有一些常用工具函数。这个类并不打算直接使用，而是作为其他类的基类来继承。\n2. [minbpe\u002Fbasic.py](minbpe\u002Fbasic.py)：实现了`BasicTokenizer`，这是最简单的BPE算法实现，直接作用于文本。\n3. [minbpe\u002Fregex.py](minbpe\u002Fregex.py)：实现了`RegexTokenizer`，它进一步通过正则表达式模式分割输入文本，这一步属于预处理阶段，先按类别（如字母、数字、标点符号）对输入文本进行拆分，然后再进行分词。这样可以确保不会跨类别边界发生合并。这一方法最初见于GPT-2论文，至今仍在GPT-4中沿用。该类还负责处理特殊标记（如有）。\n4. [minbpe\u002Fgpt4.py](minbpe\u002Fgpt4.py)：实现了`GPT4Tokenizer`。这个类是对`RegexTokenizer`（上文第2项）的轻量封装，完全复现了[tiktoken](https:\u002F\u002Fgithub.com\u002Fopenai\u002Ftiktoken)库中GPT-4的分词方式。封装处理了一些细节问题，比如如何精确恢复分词器中的合并规则，以及一些不幸的（可能是历史遗留的）单字节标记排列问题。\n\n最后，脚本[train.py](train.py)会基于输入文本[tests\u002Ftaylorswift.txt](tests\u002Ftaylorswift.txt)（这是她的维基百科词条，也是个梗）训练这两种主要分词器，并将词汇表保存到磁盘以便可视化。在我的(M1)MacBook上，这个脚本大约运行25秒。\n\n以上所有文件都非常简短且注释详尽，每份文件底部还附有使用示例。\n\n## 快速入门\n\n作为最简单的例子，我们可以重现[维基百科关于BPE的文章](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FByte_pair_encoding)，如下所示：\n\n```python\nfrom minbpe import BasicTokenizer\ntokenizer = BasicTokenizer()\ntext = \"aaabdaaabac\"\ntokenizer.train(text, 256 + 3) # 256是字节标记，然后做3次合并\nprint(tokenizer.encode(text))\n# [258, 100, 258, 97, 99]\nprint(tokenizer.decode([258, 100, 258, 97, 99]))\n# aaabdaaabac\ntokenizer.save(\"toy\")\n# 写入两个文件：toy.model（用于加载）和toy.vocab（用于查看）\n```\n\n根据维基百科，对输入字符串“aaabdaaabac”进行3次合并后，得到的结果是“XdXac”，其中X=ZY，Y=ab，Z=aa。需要注意的是，minbpe总是先分配256个单独的字节作为标记，然后根据需要再进行字节合并。因此，对于a=97，b=98，c=99，d=100（它们的[ASCII](https:\u002F\u002Fwww.asciitable.com)值）。当(a,a)合并为Z时，Z会变成256。同样，Y会变成257，X会变成258。所以我们从256个字节开始，经过3次合并得到上述结果，预期输出为[258, 100, 258, 97, 99]。\n\n## 推理：与GPT-4对比\n\n我们可以验证`RegexTokenizer`与[tiktoken](https:\u002F\u002Fgithub.com\u002Fopenai\u002Ftiktoken)中的GPT-4分词器功能一致，如下所示：\n\n```python\ntext = \"hello123!!!? (안녕하세요!) 😉\"\n\n# tiktoken\nimport tiktoken\nenc = tiktoken.get_encoding(\"cl100k_base\")\nprint(enc.encode(text))\n# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]\n\n# 我们的\nfrom minbpe import GPT4Tokenizer\ntokenizer = GPT4Tokenizer()\nprint(tokenizer.encode(text))\n# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]\n```\n\n（你需要先`pip install tiktoken`才能运行）。在底层，`GPT4Tokenizer`只是对`RegexTokenizer`的轻量封装，传入了GPT-4的合并规则和特殊标记。我们还可以确保特殊标记被正确处理：\n\n```python\ntext = \"\u003C|endoftext|>hello world\"\n\n# tiktoken\nimport tiktoken\nenc = tiktoken.get_encoding(\"cl100k_base\")\nprint(enc.encode(text, allowed_special=\"all\"))\n# [100257, 15339, 1917]\n\n# 我们的\nfrom minbpe import GPT4Tokenizer\ntokenizer = GPT4Tokenizer()\nprint(tokenizer.encode(text, allowed_special=\"all\"))\n# [100257, 15339, 1917]\n```\n\n注意，和tiktoken一样，我们必须在调用encode时显式声明要使用和解析特殊标记。否则可能会成为重大隐患，无意中用特殊标记对攻击者控制的数据（例如用户提示）进行分词。`allowed_special`参数可以设置为“all”、“none”，或者允许使用的特殊标记列表。\n\n## 训练\n\n与tiktoken不同，这段代码允许你训练自己的分词器。原则上，据我所知，如果你用一个包含10万个词汇的大数据集来训练`RegexTokenizer`，就能重现GPT-4的分词器。\n\n有两种路径可供选择。首先，如果你不想处理正则表达式模式对文本进行拆分和预处理的复杂性，也不在意特殊标记，那么可以选用`BasicTokenizer`。你可以训练它，然后像下面这样进行编码和解码：\n\n```python\nfrom minbpe import BasicTokenizer\ntokenizer = BasicTokenizer()\ntokenizer.train(very_long_training_string, vocab_size=4096)\ntokenizer.encode(\"hello world\") # 字符串 -> tokens\ntokenizer.decode([1000, 2000, 3000]) # tokens -> 字符串\ntokenizer.save(\"mymodel\") # 保存模型文件mymodel.model和词汇表mymodel.vocab\ntokenizer.load(\"mymodel.model\") # 从磁盘加载模型，词汇表仅用于可视化\n```\n\n如果你希望模仿OpenAI文本分词器的做法，采用正则表达式模式按类别拆分文本是个不错的选择。GPT-4的模式是`RegexTokenizer`的默认设置，因此你可以简单地这样做：\n\n```python\nfrom minbpe import RegexTokenizer\ntokenizer = RegexTokenizer()\ntokenizer.train(very_long_training_string, vocab_size=32768)\ntokenizer.encode(\"hello world\") # 字符串 -> tokens\ntokenizer.decode([1000, 2000, 3000]) # tokens -> 字符串\ntokenizer.save(\"tok32k\") # 保存模型文件tok32k.model和词汇表tok32k.vocab\ntokenizer.load(\"tok32k.model\") # 从磁盘加载模型\n```\n\n当然，你需要根据数据集的大小调整词汇表的大小。\n\n**特殊标记**。最后，你可能希望在分词器中添加一些特殊标记。使用`register_special_tokens`函数注册这些标记。例如，如果你用vocab_size=32768训练，前256个标记是原始字节标记，接下来的32768-256个是合并标记，之后你可以添加特殊标记。最后一个“真正”的合并标记的ID是32767（vocab_size - 1），所以你的第一个特殊标记应该紧接其后，ID正好是32768。因此：\n\n```python\nfrom minbpe import RegexTokenizer\ntokenizer = RegexTokenizer()\ntokenizer.train(very_long_training_string, vocab_size=32768)\ntokenizer.register_special_tokens({\"\u003C|endoftext|>\": 32768})\ntokenizer.encode(\"\u003C|endoftext|>hello world\", allowed_special=\"all\")\n```\n\n当然，你还可以根据需要继续添加更多标记。最后，我想强调一下，我尽力让代码本身保持干净、易读且可扩展。你不应该害怕阅读代码并理解它的运作原理。测试用例也是寻找更多使用示例的好地方。说到这里：\n\n## 测试\n\n我们使用pytest库进行测试。所有测试都位于`tests\u002F`目录下。如果你还没有安装pytest，先执行`pip install pytest`，然后：\n\n```bash\n$ pytest -v .\n```\n\n运行测试。（-v表示详细输出，看起来更美观）\n\n## 社区扩展\n\n* [gnp\u002Fminbpe-rs](https:\u002F\u002Fgithub.com\u002Fgnp\u002Fminbpe-rs)：`minbpe`的Rust实现，与Python版本几乎一一对应\n\n## 练习\n\n对于那些想学习BPE的人，这里提供了一个循序渐进的练习，教你如何一步步构建自己的minbpe。请参阅[exercise.md](exercise.md)。\n\n## 讲座\n\n我在本仓库中的代码是在这个[YouTube视频](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=zduSFxRajkE)中讲解的。你也可以在[lecture.md](lecture.md)中找到这篇讲座的文字版。\n\n## 待办事项\n\n- 编写一个更优化的Python版本，能够处理大文件和大规模词汇表\n- 编写一个更优化的C或Rust版本（仔细思考）\n- 将GPT4Tokenizer更名为GPTTokenizer，并支持GPT-2\u002FGPT-3\u002FGPT-3.5？\n- 编写一个类似于GPT4Tokenizer的LlamaTokenizer（即尝试实现sentencepiece的等效功能）\n\n## 许可证\n\nMIT","# minbpe 快速上手指南\n\n## 环境准备\n- Python 3.7+\n- pip\n- git（通常已预装）\n\n## 安装步骤\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fminbpe -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n```python\nfrom minbpe import BasicTokenizer\ntokenizer = BasicTokenizer()\ntext = \"aaabdaaabac\"\ntokenizer.train(text, 256 + 3)  # 256字节令牌 + 3次合并\nprint(tokenizer.encode(text))  # [258, 100, 258, 97, 99]\nprint(tokenizer.decode([258, 100, 258, 97, 99]))  # aaabdaaabac\ntokenizer.save(\"toy\")  # 生成 toy.model 和 toy.vocab\n```","某AI工程师在开发专注于中文技术文档的轻量级语言模型，需处理中英文混合文本并自定义分词规则。\n\n### 没有 minbpe 时\n- 依赖Hugging Face tokenizers库，安装需编译C++扩展，环境配置复杂且易出错\n- 修改BPE合并逻辑时需深入复杂源码，调试困难，难以快速验证效果\n- 生成的词汇表文件格式不直观，难以快速检查关键术语的合并情况\n- 处理中英文混合文本时，正则表达式配置繁琐，试错成本高\n\n### 使用 minbpe 后\n- 仅需导入minbpe的RegexTokenizer，无需额外依赖，代码仅需20行即可完成训练\n- 通过正则表达式精确分割中文、英文和标点，例如用`r'[\\u4e00-\\u9fff]|[a-zA-Z0-9]+'`，快速验证分割效果\n- 训练仅需几秒，可快速迭代调整合并次数，即时查看效果\n- 生成的vocab文件清晰展示每个token的来源，轻松确认“AI”等术语是否被正确合并\n\nminbpe让开发者能快速构建和调试自定义分词器，尤其适合需要灵活调整BPE策略的轻量级模型开发场景。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkarpathy_minbpe_4d47bc13.png","karpathy","Andrej","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fkarpathy_75f033eb.jpg","I like to train Deep Neural Nets on large datasets.",null,"Stanford","andrej.karpathy@gmail.com","https:\u002F\u002Ftwitter.com\u002Fkarpathy","https:\u002F\u002Fgithub.com\u002Fkarpathy",[24],{"name":25,"color":26,"percentage":27},"Python","#3572A5",100,10409,1029,"2026-04-04T22:48:12","MIT",1,"Linux, macOS","不需要","未说明",{"notes":37,"python":35,"dependencies":38},"纯 Python 实现，无外部依赖，适合学习 BPE 算法",[],[40],"语言模型",2,"ready","2026-03-27T02:49:30.150509","2026-04-06T05:32:15.135559",[46,51,56,61],{"id":47,"question_zh":48,"answer_zh":49,"source_url":50},8848,"如何为多语言配置 tokenizer？","regex.py 是特定于 GPTs 的，不是通用 tokenization。你可以在自己的应用中修改 regex 以支持多语言。参考维护者回复：'Totally agree with you by the way. OpenAI does this, so we match it here. You can (and might well want to) change the regex in your own applications.'","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fminbpe\u002Fissues\u002F31",{"id":52,"question_zh":53,"answer_zh":54,"source_url":55},8849,"如何处理多个输入文件和特殊令牌？","在训练时，不要在文本中包含 `\u003C|endoftext|>` 字符串，而是直接在文档之间插入 token ids 作为整数。参考维护者回复：'When you're training you don't care about special tokens usually, to create training data you'd directly insert the token ids as integers in between documents instead of changing the text to contain \u003C|endoftext|> or something like that.'","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fminbpe\u002Fissues\u002F47",{"id":57,"question_zh":58,"answer_zh":59,"source_url":60},8850,"如何保存和加载 tokenizer？","保存原始合并对和排名，而不是使用 base64 编码。具体实现时，可以将合并对和排名写入文本文件，例如每个行存储合并对和排名。参考维护者回复：'tiktoken only saves the \"merged\" pairs, which I don't love. I think it makes more sense to save the raw pairs and the ranks.'","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fminbpe\u002Fissues\u002F2",{"id":62,"question_zh":63,"answer_zh":64,"source_url":65},8851,"如何学习 minbpe 相关内容？","使用 Karpathy 的 Zero To Hero 网站（https:\u002F\u002Fkarpathy.ai\u002Fzero-to-hero.html），它包含多个讲座和评论中的练习，以及 Discord 社区进行讨论。参考维护者回复：'Zero To Hero is already several \"lectures\" in, and most have exercises in the comments, there is a Discord community all learning together which is a bit like the discussion boards.'","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fminbpe\u002Fissues\u002F19",[],[68,78,86,100,109,117],{"id":69,"name":70,"github_repo":71,"description_zh":72,"stars":73,"difficulty_score":41,"last_commit_at":74,"category_tags":75,"status":42},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,"2026-04-05T11:33:21",[76,77,40],"开发框架","Agent",{"id":79,"name":80,"github_repo":81,"description_zh":82,"stars":83,"difficulty_score":41,"last_commit_at":84,"category_tags":85,"status":42},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[76,40],{"id":87,"name":88,"github_repo":89,"description_zh":90,"stars":91,"difficulty_score":41,"last_commit_at":92,"category_tags":93,"status":42},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[94,95,96,97,77,98,40,76,99],"图像","数据工具","视频","插件","其他","音频",{"id":101,"name":102,"github_repo":103,"description_zh":104,"stars":105,"difficulty_score":106,"last_commit_at":107,"category_tags":108,"status":42},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[77,94,76,40,98],{"id":110,"name":111,"github_repo":112,"description_zh":113,"stars":114,"difficulty_score":106,"last_commit_at":115,"category_tags":116,"status":42},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[40,94,76,98],{"id":118,"name":119,"github_repo":120,"description_zh":121,"stars":122,"difficulty_score":106,"last_commit_at":123,"category_tags":124,"status":42},2181,"OpenHands","OpenHands\u002FOpenHands","OpenHands 是一个专注于 AI 驱动开发的开源平台，旨在让智能体（Agent）像人类开发者一样理解、编写和调试代码。它解决了传统编程中重复性劳动多、环境配置复杂以及人机协作效率低等痛点，通过自动化流程显著提升开发速度。\n\n无论是希望提升编码效率的软件工程师、探索智能体技术的研究人员，还是需要快速原型验证的技术团队，都能从中受益。OpenHands 提供了灵活多样的使用方式：既可以通过命令行（CLI）或本地图形界面在个人电脑上轻松上手，体验类似 Devin 的流畅交互；也能利用其强大的 Python SDK 自定义智能体逻辑，甚至在云端大规模部署上千个智能体并行工作。\n\n其核心技术亮点在于模块化的软件智能体 SDK，这不仅构成了平台的引擎，还支持高度可组合的开发模式。此外，OpenHands 在 SWE-bench 基准测试中取得了 77.6% 的优异成绩，证明了其解决真实世界软件工程问题的能力。平台还具备完善的企业级功能，支持与 Slack、Jira 等工具集成，并提供细粒度的权限管理，适合从个人开发者到大型企业的各类用户场景。",70612,"2026-04-05T11:12:22",[40,77,76,97]]