[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-life4--textdistance":3,"similar-life4--textdistance":137},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":8,"readme_en":9,"readme_zh":10,"quickstart_zh":11,"use_case_zh":12,"hero_image_url":13,"owner_login":14,"owner_name":15,"owner_avatar_url":16,"owner_bio":17,"owner_company":18,"owner_location":18,"owner_email":18,"owner_twitter":18,"owner_website":19,"owner_url":20,"languages":21,"stars":34,"forks":35,"last_commit_at":36,"license":37,"difficulty_score":38,"env_os":39,"env_gpu":40,"env_ram":39,"env_deps":41,"category_tags":51,"github_topics":53,"view_count":66,"oss_zip_url":18,"oss_zip_packed_at":18,"status":67,"created_at":68,"updated_at":69,"faqs":70,"releases":71},9106,"life4\u002Ftextdistance","textdistance","📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.","TextDistance 是一款专为 Python 开发者打造的文本序列距离计算库，旨在帮助用户量化两个或多个字符串之间的相似度与差异。在数据清洗、模糊匹配、拼写纠错及生物信息学分析等场景中，如何准确衡量文本间的“距离”往往是个难题，而 TextDistance 通过提供统一的调用接口，轻松解决了这一痛点。\n\n该库最大的亮点在于其丰富的算法储备，内置了超过 30 种经典算法，涵盖基于编辑距离（如莱文斯坦、Jaro-Winkler）、基于令牌（如 Jaccard、余弦相似度）以及基于序列比对（如 Smith-Waterman）等多种策略。它不仅支持纯 Python 实现，确保环境部署简单无忧，还可选配 NumPy 以提升大规模计算时的运行速度。此外，TextDistance 突破了传统库仅能对比两个序列的限制，支持同时比较多个序列，且部分算法在同一个类中提供了多种实现方式，灵活性极高。\n\n无论是需要快速原型验证的软件工程师，还是从事自然语言处理或基因序列研究的数据科学家，TextDistance 都能成为你手中高效、可靠的得力助手，让复杂的文本对比工作变得简单直观。","# TextDistance\n\n![TextDistance logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Flife4_textdistance_readme_211908648851.png)\n\n[![Build Status](https:\u002F\u002Ftravis-ci.org\u002Flife4\u002Ftextdistance.svg?branch=master)](https:\u002F\u002Ftravis-ci.org\u002Flife4\u002Ftextdistance) [![PyPI version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ftextdistance.svg)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Ftextdistance) [![Status](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fstatus\u002Ftextdistance.svg)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Ftextdistance) [![License](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fl\u002Ftextdistance.svg)](LICENSE)\n\n**TextDistance** -- python library for comparing distance between two or more sequences by many algorithms.\n\nFeatures:\n\n- 30+ algorithms\n- Pure python implementation\n- Simple usage\n- More than two sequences comparing\n- Some algorithms have more than one implementation in one class.\n- Optional numpy usage for maximum speed.\n\n## Algorithms\n\n### Edit based\n\n| Algorithm                                                                                 | Class                | Functions              |\n|-------------------------------------------------------------------------------------------|----------------------|------------------------|\n| [Hamming](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHamming_distance)                                 | `Hamming`            | `hamming`              |\n| [MLIPNS](http:\u002F\u002Fwww.sial.iias.spb.su\u002Ffiles\u002F386-386-1-PB.pdf)                              | `MLIPNS`             | `mlipns`               |\n| [Levenshtein](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLevenshtein_distance)                         | `Levenshtein`        | `levenshtein`          |\n| [Damerau-Levenshtein](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDamerau%E2%80%93Levenshtein_distance) | `DamerauLevenshtein` | `damerau_levenshtein`  |\n| [Jaro-Winkler](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FJaro%E2%80%93Winkler_distance)               | `JaroWinkler`        | `jaro_winkler`, `jaro` |\n| [Strcmp95](http:\u002F\u002Fcpansearch.perl.org\u002Fsrc\u002FSCW\u002FText-JaroWinkler-0.1\u002Fstrcmp95.c)            | `StrCmp95`           | `strcmp95`             |\n| [Needleman-Wunsch](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNeedleman%E2%80%93Wunsch_algorithm)      | `NeedlemanWunsch`    | `needleman_wunsch`     |\n| [Gotoh](http:\u002F\u002Fbioinfo.ict.ac.cn\u002F~dbu\u002FAlgorithmCourses\u002FLectures\u002FLOA\u002FLec6-Sequence-Alignment-Affine-Gaps-Gotoh1982.pdf) | `Gotoh`              | `gotoh`                |\n| [Smith-Waterman](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSmith%E2%80%93Waterman_algorithm)          | `SmithWaterman`      | `smith_waterman`       |\n\n### Token based\n\n| Algorithm                                                                                 | Class                | Functions     |\n|-------------------------------------------------------------------------------------------|----------------------|---------------|\n| [Jaccard index](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FJaccard_index)                              | `Jaccard`            | `jaccard`     |\n| [Sørensen–Dice coefficient](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FS%C3%B8rensen%E2%80%93Dice_coefficient) | `Sorensen`   | `sorensen`, `sorensen_dice`, `dice` |\n| [Tversky index](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTversky_index)                              | `Tversky`            | `tversky`    |\n| [Overlap coefficient](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FOverlap_coefficient)                  | `Overlap`            | `overlap`    |\n| [Tanimoto distance](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FJaccard_index#Tanimoto_similarity_and_distance) | `Tanimoto`   | `tanimoto`   |\n| [Cosine similarity](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCosine_similarity)                      | `Cosine`             | `cosine`     |\n| [Monge-Elkan](https:\u002F\u002Fwww.academia.edu\u002F200314\u002FGeneralized_Monge-Elkan_Method_for_Approximate_Text_String_Comparison) | `MongeElkan` | `monge_elkan` |\n| [Bag distance](https:\u002F\u002Fgithub.com\u002FYomguithereal\u002Ftalisman\u002Fblob\u002Fmaster\u002Fsrc\u002Fmetrics\u002Fbag.js) | `Bag`        | `bag`        |\n\n### Sequence based\n\n| Algorithm | Class | Functions |\n|-----------|-------|-----------|\n| [longest common subsequence similarity](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLongest_common_subsequence_problem)          | `LCSSeq` | `lcsseq` |\n| [longest common substring similarity](https:\u002F\u002Fdocs.python.org\u002F2\u002Flibrary\u002Fdifflib.html#difflib.SequenceMatcher)      | `LCSStr` | `lcsstr` |\n| [Ratcliff-Obershelp similarity](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FGestalt_Pattern_Matching) | `RatcliffObershelp` | `ratcliff_obershelp` |\n\n### Compression based\n\n[Normalized compression distance](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNormalized_compression_distance#Normalized_compression_distance) with different compression algorithms.\n\nClassic compression algorithms:\n\n| Algorithm                                                                  | Class       | Function     |\n|----------------------------------------------------------------------------|-------------|--------------|\n| [Arithmetic coding](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FArithmetic_coding)       | `ArithNCD`  | `arith_ncd`  |\n| [RLE](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FRun-length_encoding)                   | `RLENCD`    | `rle_ncd`    |\n| [BWT RLE](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBurrows%E2%80%93Wheeler_transform) | `BWTRLENCD` | `bwtrle_ncd` |\n\nNormal compression algorithms:\n\n| Algorithm                                                                  | Class        | Function      |\n|----------------------------------------------------------------------------|--------------|---------------|\n| Square Root                                                                | `SqrtNCD`    | `sqrt_ncd`    |\n| [Entropy](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FEntropy_(information_theory))      | `EntropyNCD` | `entropy_ncd` |\n\nWork in progress algorithms that compare two strings as array of bits:\n\n| Algorithm                                  | Class     | Function   |\n|--------------------------------------------|-----------|------------|\n| [BZ2](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBzip2) | `BZ2NCD`  | `bz2_ncd`  |\n| [LZMA](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLZMA) | `LZMANCD` | `lzma_ncd` |\n| [ZLib](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FZlib) | `ZLIBNCD` | `zlib_ncd` |\n\nSee [blog post](https:\u002F\u002Farticles.life4web.ru\u002Fother\u002Fncd\u002F) for more details about NCD.\n\n### Phonetic\n\n| Algorithm                                                                    | Class    | Functions |\n|------------------------------------------------------------------------------|----------|-----------|\n| [MRA](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMatch_rating_approach)                   | `MRA`    | `mra`     |\n| [Editex](https:\u002F\u002Fanhaidgroup.github.io\u002Fpy_stringmatching\u002Fv0.3.x\u002FEditex.html) | `Editex` | `editex`  |\n\n### Simple\n\n| Algorithm           | Class      | Functions  |\n|---------------------|------------|------------|\n| Prefix similarity   | `Prefix`   | `prefix`   |\n| Postfix similarity  | `Postfix`  | `postfix`  |\n| Length distance     | `Length`   | `length`   |\n| Identity similarity | `Identity` | `identity` |\n| Matrix similarity   | `Matrix`   | `matrix`   |\n\n## Installation\n\n### Stable\n\nOnly pure python implementation:\n\n```bash\npip install textdistance\n```\n\nWith extra libraries for maximum speed:\n\n```bash\npip install \"textdistance[extras]\"\n```\n\nWith all libraries (required for [benchmarking](#benchmarks) and [testing](#running-tests)):\n\n```bash\npip install \"textdistance[benchmark]\"\n```\n\nWith algorithm specific extras:\n\n```bash\npip install \"textdistance[Hamming]\"\n```\n\nAlgorithms with available extras: `DamerauLevenshtein`, `Hamming`, `Jaro`, `JaroWinkler`, `Levenshtein`.\n\n### Dev\n\nVia pip:\n\n```bash\npip install -e git+https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance.git#egg=textdistance\n```\n\nOr clone repo and install with some extras:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance.git\npip install -e \".[benchmark]\"\n```\n\n## Usage\n\nAll algorithms have 2 interfaces:\n\n1. Class with algorithm-specific params for customizing.\n1. Class instance with default params for quick and simple usage.\n\nAll algorithms have some common methods:\n\n1. `.distance(*sequences)` -- calculate distance between sequences.\n1. `.similarity(*sequences)` -- calculate similarity for sequences.\n1. `.maximum(*sequences)` -- maximum possible value for distance and similarity. For any sequence: `distance + similarity == maximum`.\n1. `.normalized_distance(*sequences)` -- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.\n1. `.normalized_similarity(*sequences)` -- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.\n\nMost common init arguments:\n\n1. `qval` -- q-value for split sequences into q-grams. Possible values:\n    - 1 (default) -- compare sequences by chars.\n    - 2 or more -- transform sequences to q-grams.\n    - None -- split sequences by words.\n1. `as_set` -- for token-based algorithms:\n    - True -- `t` and `ttt` is equal.\n    - False (default) -- `t` and `ttt` is different.\n\n## Examples\n\nFor example, [Hamming distance](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHamming_distance):\n\n```python\nimport textdistance\n\ntextdistance.hamming('test', 'text')\n# 1\n\ntextdistance.hamming.distance('test', 'text')\n# 1\n\ntextdistance.hamming.similarity('test', 'text')\n# 3\n\ntextdistance.hamming.normalized_distance('test', 'text')\n# 0.25\n\ntextdistance.hamming.normalized_similarity('test', 'text')\n# 0.75\n\ntextdistance.Hamming(qval=2).distance('test', 'text')\n# 2\n\n```\n\nAny other algorithms have same interface.\n\n## Articles\n\nA few articles with examples how to use textdistance in the real world:\n\n- [Guide to Fuzzy Matching with Python](http:\u002F\u002Ftheautomatic.net\u002F2019\u002F11\u002F13\u002Fguide-to-fuzzy-matching-with-python\u002F)\n- [String similarity — the basic know your algorithms guide!](https:\u002F\u002Fitnext.io\u002Fstring-similarity-the-basic-know-your-algorithms-guide-3de3d7346227)\n- [Normalized compression distance](https:\u002F\u002Farticles.life4web.ru\u002Fother\u002Fncd\u002F)\n\n## Extra libraries\n\nFor main algorithms textdistance try to call known external libraries (fastest first) if available (installed in your system) and possible (this implementation can compare this type of sequences). [Install](#installation) textdistance with extras for this feature.\n\nYou can disable this by passing `external=False` argument on init:\n\n```python3\nimport textdistance\nhamming = textdistance.Hamming(external=False)\nhamming('text', 'testit')\n# 3\n```\n\nSupported libraries:\n\n1. [jellyfish](https:\u002F\u002Fgithub.com\u002Fjamesturk\u002Fjellyfish)\n1. [py_stringmatching](https:\u002F\u002Fgithub.com\u002Fanhaidgroup\u002Fpy_stringmatching)\n1. [pylev](https:\u002F\u002Fgithub.com\u002Ftoastdriven\u002Fpylev)\n1. [Levenshtein](https:\u002F\u002Fgithub.com\u002Fmaxbachmann\u002FLevenshtein)\n1. [pyxDamerauLevenshtein](https:\u002F\u002Fgithub.com\u002Fgfairchild\u002FpyxDamerauLevenshtein)\n\nAlgorithms:\n\n1. DamerauLevenshtein\n1. Hamming\n1. Jaro\n1. JaroWinkler\n1. Levenshtein\n\n## Benchmarks\n\nWithout extras installation:\n\n| algorithm          | library               |    time |\n|--------------------|-----------------------|---------|\n| DamerauLevenshtein | rapidfuzz             | 0.00312 |\n| DamerauLevenshtein | jellyfish             | 0.00591 |\n| DamerauLevenshtein | pyxdameraulevenshtein | 0.03335 |\n| DamerauLevenshtein | **textdistance**      | 0.83524 |\n| Hamming            | Levenshtein           | 0.00038 |\n| Hamming            | rapidfuzz             | 0.00044 |\n| Hamming            | jellyfish             | 0.00091 |\n| Hamming            | **textdistance**      | 0.03531 |\n| Jaro               | rapidfuzz             | 0.00092 |\n| Jaro               | jellyfish             | 0.00191 |\n| Jaro               | **textdistance**      | 0.07365 |\n| JaroWinkler        | rapidfuzz             | 0.00094 |\n| JaroWinkler        | jellyfish             | 0.00195 |\n| JaroWinkler        | **textdistance**      | 0.07501 |\n| Levenshtein        | rapidfuzz             | 0.00099 |\n| Levenshtein        | Levenshtein           | 0.00122 |\n| Levenshtein        | jellyfish             | 0.00254 |\n| Levenshtein        | pylev                 | 0.15688 |\n| Levenshtein        | **textdistance**      | 0.53902 |\n\nTotal: 24 libs.\n\nYeah, so slow. Use TextDistance on production only with extras.\n\nTextdistance use benchmark's results for algorithm's optimization and try to call fastest external lib first (if possible).\n\nYou can run benchmark manually on your system:\n\n```bash\npip install textdistance[benchmark]\npython3 -m textdistance.benchmark\n```\n\nTextDistance show benchmarks results table for your system and save libraries priorities into `libraries.json` file in TextDistance's folder. This file will be used by textdistance for calling fastest algorithm implementation. Default [libraries.json](textdistance\u002Flibraries.json) already included in package.\n\n## Running tests\n\nAll you need is [task](https:\u002F\u002Ftaskfile.dev\u002F). See [Taskfile.yml](.\u002FTaskfile.yml) for the list of available commands. For example, to run tests including third-party libraries usage, execute `task pytest-external:run`.\n\n## Contributing\n\nPRs are welcome!\n\n- Found a bug? Fix it!\n- Want to add more algorithms? Sure! Just make it with the same interface as other algorithms in the lib and add some tests.\n- Can make something faster? Great! Just avoid external dependencies and remember that everything should work not only with strings.\n- Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).\n- Have no time to code? Tell your friends and subscribers about `textdistance`. More users, more contributions, more amazing features.\n\nThank you :heart:\n","# 文本距离\n\n![TextDistance logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Flife4_textdistance_readme_211908648851.png)\n\n[![构建状态](https:\u002F\u002Ftravis-ci.org\u002Flife4\u002Ftextdistance.svg?branch=master)](https:\u002F\u002Ftravis-ci.org\u002Flife4\u002Ftextdistance) [![PyPI版本](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ftextdistance.svg)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Ftextdistance) [![状态](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fstatus\u002Ftextdistance.svg)](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Ftextdistance) [![许可证](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fl\u002Ftextdistance.svg)](LICENSE)\n\n**TextDistance** —— 一个用于通过多种算法比较两个或多个序列之间距离的 Python 库。\n\n特性：\n\n- 30 多种算法\n- 纯 Python 实现\n- 使用简单\n- 支持多于两个序列的比较\n- 部分算法在一个类中提供多种实现\n- 可选使用 NumPy 以获得最大速度。\n\n## 算法\n\n### 基于编辑距离\n\n| 算法                                                                                 | 类                | 函数              |\n|-------------------------------------------------------------------------------------------|----------------------|------------------------|\n| [汉明距离](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHamming_distance)                                 | `Hamming`            | `hamming`              |\n| [MLIPNS](http:\u002F\u002Fwww.sial.iias.spb.su\u002Ffiles\u002F386-386-1-PB.pdf)                              | `MLIPNS`             | `mlipns`               |\n| [莱文斯坦距离](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLevenshtein_distance)                         | `Levenshtein`        | `levenshtein`          |\n| [达默劳-莱文斯坦距离](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDamerau%E2%80%93Levenshtein_distance) | `DamerauLevenshtein` | `damerau_levenshtein`  |\n| [贾罗-温克勒距离](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FJaro%E2%80%93Winkler_distance)               | `JaroWinkler`        | `jaro_winkler`, `jaro` |\n| [Strcmp95](http:\u002F\u002Fcpansearch.perl.org\u002Fsrc\u002FSCW\u002FText-JaroWinkler-0.1\u002Fstrcmp95.c)            | `StrCmp95`           | `strcmp95`             |\n| [尼德尔曼-温施算法](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNeedleman%E2%80%93Wunsch_algorithm)      | `NeedlemanWunsch`    | `needleman_wunsch`     |\n| [后藤算法](http:\u002F\u002Fbioinfo.ict.ac.cn\u002F~dbu\u002FAlgorithmCourses\u002FLectures\u002FLOA\u002FLec6-Sequence-Alignment-Affine-Gaps-Gotoh1982.pdf) | `Gotoh`              | `gotoh`                |\n| [史密斯-沃特曼算法](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSmith%E2%80%93Waterman_algorithm)          | `SmithWaterman`      | `smith_waterman`       |\n\n### 基于标记\n\n| 算法                                                                                 | 类                | 函数     |\n|-------------------------------------------------------------------------------------------|----------------------|---------------|\n| [雅卡尔指数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FJaccard_index)                              | `Jaccard`            | `jaccard`     |\n| [索伦森-戴斯系数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FS%C3%B8rensen%E2%80%93Dice_coefficient) | `Sorensen`   | `sorensen`, `sorensen_dice`, `dice` |\n| [特弗斯基指数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTversky_index)                              | `Tversky`            | `tversky`    |\n| [重叠系数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FOverlap_coefficient)                  | `Overlap`            | `overlap`    |\n| [塔尼莫托距离](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FJaccard_index#Tanimoto_similarity_and_distance) | `Tanimoto`   | `tanimoto`   |\n| [余弦相似度](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCosine_similarity)                      | `Cosine`             | `cosine`     |\n| [蒙格-埃尔坎](https:\u002F\u002Fwww.academia.edu\u002F200314\u002FGeneralized_Monge-Elkan_Method_for_Approximate_Text_String_Comparison) | `MongeElkan` | `monge_elkan` |\n| [袋距离](https:\u002F\u002Fgithub.com\u002FYomguithereal\u002Ftalisman\u002Fblob\u002Fmaster\u002Fsrc\u002Fmetrics\u002Fbag.js) | `Bag`        | `bag`        |\n\n### 基于序列\n\n| 算法 | 类 | 函数 |\n|-----------|-------|-----------|\n| [最长公共子序列相似度](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLongest_common_subsequence_problem)          | `LCSSeq` | `lcsseq` |\n| [最长公共子串相似度](https:\u002F\u002Fdocs.python.org\u002F2\u002Flibrary\u002Fdifflib.html#difflib.SequenceMatcher)      | `LCSStr` | `lcsstr` |\n| [拉特克利夫-奥伯斯赫普相似度](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FGestalt_Pattern_Matching) | `RatcliffObershelp` | `ratcliff_obershelp` |\n\n### 基于压缩\n\n[归一化压缩距离](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNormalized_compression_distance#Normalized_compression_distance) 使用不同的压缩算法。\n\n经典压缩算法：\n\n| 算法                                                                  | 类       | 函数     |\n|----------------------------------------------------------------------------|-------------|--------------|\n| [算术编码](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FArithmetic_coding)       | `ArithNCD`  | `arith_ncd`  |\n| [游程编码](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FRun-length_encoding)                   | `RLENCD`    | `rle_ncd`    |\n| [BWT 游程编码](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBurrows%E2%80%93Wheeler_transform) | `BWTRLENCD` | `bwtrle_ncd` |\n\n普通压缩算法：\n\n| 算法                                                                  | 类        | 函数      |\n|----------------------------------------------------------------------------|--------------|---------------|\n| 平方根                                                                | `SqrtNCD`    | `sqrt_ncd`    |\n| [熵](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FEntropy_(information_theory))      | `EntropyNCD` | `entropy_ncd` |\n\n正在进行中的算法，将两个字符串视为位数组进行比较：\n\n| 算法                                  | 类     | 函数   |\n|--------------------------------------------|-----------|------------|\n| [BZ2](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBzip2) | `BZ2NCD`  | `bz2_ncd`  |\n| [LZMA](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLZMA) | `LZMANCD` | `lzma_ncd` |\n| [ZLib](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FZlib) | `ZLIBNCD` | `zlib_ncd` |\n\n更多关于 NCD 的详细信息，请参阅 [博客文章](https:\u002F\u002Farticles.life4web.ru\u002Fother\u002Fncd\u002F)。\n\n### 音韵学\n\n| 算法                                                                    | 类    | 函数 |\n|------------------------------------------------------------------------------|----------|-----------|\n| [MRA](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMatch_rating_approach)                   | `MRA`    | `mra`     |\n| [Editex](https:\u002F\u002Fanhaidgroup.github.io\u002Fpy_stringmatching\u002Fv0.3.x\u002FEditex.html) | `Editex` | `editex`  |\n\n### 简单\n\n| 算法           | 类      | 函数  |\n|---------------------|------------|------------|\n| 前缀相似度   | `Prefix`   | `prefix`   |\n| 后缀相似度  | `Postfix`  | `postfix`  |\n| 长度距离     | `Length`   | `length`   |\n| 身份相似度  | `Identity` | `identity` |\n| 矩阵相似度   | `Matrix`   | `matrix`   |\n\n## 安装\n\n### 稳定版\n\n仅使用纯 Python 实现：\n\n```bash\npip install textdistance\n```\n\n若需借助额外库以获得最佳性能：\n\n```bash\npip install \"textdistance[extras]\"\n```\n\n若需所有依赖库（用于[基准测试](#benchmarks)和[运行测试](#running-tests)）：\n\n```bash\npip install \"textdistance[benchmark]\"\n```\n\n若仅需特定算法的扩展包：\n\n```bash\npip install \"textdistance[Hamming]\"\n```\n\n支持扩展的算法包括：`DamerauLevenshtein`、`Hamming`、`Jaro`、`JaroWinkler`、`Levenshtein`。\n\n### 开发版\n\n通过 pip 安装：\n\n```bash\npip install -e git+https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance.git#egg=textdistance\n```\n\n或者克隆仓库并安装包含部分扩展的版本：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance.git\npip install -e \".[benchmark]\"\n```\n\n## 使用方法\n\n所有算法均提供两种接口：\n\n1. 带有算法特有参数的类，用于自定义配置。\n1. 具有默认参数的类实例，适合快速简便的使用场景。\n\n所有算法都具有一些通用方法：\n\n1. `.distance(*sequences)` -- 计算序列之间的距离。\n1. `.similarity(*sequences)` -- 计算序列之间的相似度。\n1. `.maximum(*sequences)` -- 距离与相似度的最大可能值。对于任意序列，均有 `distance + similarity == maximum`。\n1. `.normalized_distance(*sequences)` -- 归一化后的序列间距离。返回值为 0 到 1 之间的浮点数，其中 0 表示完全相同，1 表示完全不同。\n1. `.normalized_similarity(*sequences)` -- 归一化后的序列间相似度。返回值为 0 到 1 之间的浮点数，其中 0 表示完全不同，1 表示完全相同。\n\n最常见的初始化参数：\n\n1. `qval` -- 用于将序列分割成 q-gram 的 q 值。可选值：\n    - 1（默认）-- 按字符比较序列。\n    - 2 或更高 -- 将序列转换为 q-gram。\n    - None -- 按单词分割序列。\n1. `as_set` -- 适用于基于标记的算法：\n    - True -- `t` 和 `ttt` 视为相等。\n    - False（默认）-- `t` 和 `ttt` 视为不同。\n\n## 示例\n\n以 [汉明距离](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHamming_distance) 为例：\n\n```python\nimport textdistance\n\ntextdistance.hamming('test', 'text')\n# 1\n\ntextdistance.hamming.distance('test', 'text')\n# 1\n\ntextdistance.hamming.similarity('test', 'text')\n# 3\n\ntextdistance.hamming.normalized_distance('test', 'text')\n# 0.25\n\ntextdistance.hamming.normalized_similarity('test', 'text')\n# 0.75\n\ntextdistance.Hamming(qval=2).distance('test', 'text')\n# 2\n\n```\n\n其他算法也具有相同的接口。\n\n## 文章\n\n以下是一些介绍如何在实际中使用 textdistance 的文章：\n\n- [Python 模糊匹配指南](http:\u002F\u002Ftheautomatic.net\u002F2019\u002F11\u002F13\u002Fguide-to-fuzzy-matching-with-python\u002F)\n- [字符串相似度——基础算法指南！](https:\u002F\u002Fitnext.io\u002Fstring-similarity-the-basic-know-your-algorithms-guide-3de3d7346227)\n- [归一化压缩距离](https:\u002F\u002Farticles.life4web.ru\u002Fother\u002Fncd\u002F)\n\n## 额外库\n\n对于主要算法，textdistance 会优先尝试调用已知的外部库（按速度从快到慢排序），前提是这些库已安装在您的系统中且适用（即该实现可以处理当前类型的序列）。请通过安装带扩展的 textdistance 来启用此功能。\n\n您也可以通过在初始化时传入 `external=False` 参数来禁用此功能：\n\n```python3\nimport textdistance\nhamming = textdistance.Hamming(external=False)\nhamming('text', 'testit')\n# 3\n```\n\n支持的库包括：\n\n1. [jellyfish](https:\u002F\u002Fgithub.com\u002Fjamesturk\u002Fjellyfish)\n1. [py_stringmatching](https:\u002F\u002Fgithub.com\u002Fanhaidgroup\u002Fpy_stringmatching)\n1. [pylev](https:\u002F\u002Fgithub.com\u002Ftoastdriven\u002Fpylev)\n1. [Levenshtein](https:\u002F\u002Fgithub.com\u002Fmaxbachmann\u002FLevenshtein)\n1. [pyxDamerauLevenshtein](https:\u002F\u002Fgithub.com\u002Fgfairchild\u002FpyxDamerauLevenshtein)\n\n支持的算法：\n\n1. DamerauLevenshtein\n1. Hamming\n1. Jaro\n1. JaroWinkler\n1. Levenshtein\n\n## 基准测试\n\n未安装扩展时的结果如下：\n\n| 算法          | 库               | 时间 |\n|--------------------|-----------------------|---------|\n| DamerauLevenshtein | rapidfuzz             | 0.00312 |\n| DamerauLevenshtein | jellyfish             | 0.00591 |\n| DamerauLevenshtein | pyxdameraulevenshtein | 0.03335 |\n| DamerauLevenshtein | **textdistance**      | 0.83524 |\n| Hamming            | Levenshtein           | 0.00038 |\n| Hamming            | rapidfuzz             | 0.00044 |\n| Hamming            | jellyfish             | 0.00091 |\n| Hamming            | **textdistance**      | 0.03531 |\n| Jaro               | rapidfuzz             | 0.00092 |\n| Jaro               | jellyfish             | 0.00191 |\n| Jaro               | **textdistance**      | 0.07365 |\n| JaroWinkler        | rapidfuzz             | 0.00094 |\n| JaroWinkler        | jellyfish             | 0.00195 |\n| JaroWinkler        | **textdistance**      | 0.07501 |\n| Levenshtein        | rapidfuzz             | 0.00099 |\n| Levenshtein        | Levenshtein           | 0.00122 |\n| Levenshtein        | jellyfish             | 0.00254 |\n| Levenshtein        | pylev                 | 0.15688 |\n| Levenshtein        | **textdistance**      | 0.53902 |\n\n总计：24 个库。\n\n确实很慢。请仅在安装了扩展的情况下将 TextDistance 用于生产环境。\n\nTextdistance 会根据基准测试结果优化算法，并优先调用最快的外部库（如果可行）。\n\n您也可以在本地手动运行基准测试：\n\n```bash\npip install textdistance[benchmark]\npython3 -m textdistance.benchmark\n```\n\nTextdistance 会显示针对您系统的基准测试结果表，并将库的优先级保存到 Textdistance 文件夹中的 `libraries.json` 文件中。此文件将用于选择最快的算法实现。默认的 [libraries.json](textdistance\u002Flibraries.json) 已包含在软件包中。\n\n## 运行测试\n\n您只需要 [task](https:\u002F\u002Ftaskfile.dev\u002F) 工具即可。有关可用命令列表，请参阅 [Taskfile.yml](.\u002FTaskfile.yml)。例如，要运行包含第三方库的测试，可以执行 `task pytest-external:run`。\n\n## 贡献\n\n欢迎提交 PR！\n\n- 发现了 bug？请修复它！\n- 想添加更多算法？当然可以！只需确保新算法与库中现有算法采用相同的接口，并编写相应的测试用例。\n- 能让某些功能更快吗？太好了！请尽量避免引入外部依赖，并记住所有功能不仅应适用于字符串。\n- 还有其他您认为有益的想法吗？尽管去做吧！只要确保 CI 测试通过，并且 README 中的所有内容仍然适用（接口、特性等）。\n- 没时间写代码？那就告诉您的朋友和关注者关于 `textdistance` 吧！用户越多，贡献越多，就越能带来更多令人惊叹的功能。\n\n感谢您的支持 :heart:","# TextDistance 快速上手指南\n\nTextDistance 是一个功能强大的 Python 库，提供了 30 多种算法来计算两个或多个序列（字符串、列表等）之间的距离和相似度。它支持编辑距离、基于令牌、基于序列、压缩距离及语音算法等多种类型，且接口统一，使用简单。\n\n## 环境准备\n\n- **操作系统**：Windows, macOS, Linux\n- **Python 版本**：Python 3.6+\n- **前置依赖**：无强制外部依赖（纯 Python 实现），但为了获得最佳性能，建议安装额外的加速库。\n\n## 安装步骤\n\n### 1. 基础安装（纯 Python 实现）\n适用于快速体验或对性能要求不高的场景：\n```bash\npip install textdistance\n```\n*国内用户推荐使用清华源加速：*\n```bash\npip install textdistance -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 2. 高性能安装（推荐）\n安装额外依赖库（如 `rapidfuzz`, `jellyfish` 等），可显著提升计算速度（生产环境必备）：\n```bash\npip install \"textdistance[extras]\"\n```\n*国内用户推荐使用清华源加速：*\n```bash\npip install \"textdistance[extras]\" -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 3. 完整安装（含基准测试工具）\n如果你需要进行性能基准测试或开发贡献：\n```bash\npip install \"textdistance[benchmark]\"\n```\n\n## 基本使用\n\nTextDistance 的所有算法都提供统一的接口，既可以直接调用快捷函数，也可以实例化类以自定义参数。\n\n### 1. 最简用法（快捷函数）\n直接导入库并调用算法名称作为函数，默认参数即可满足大多数需求。\n\n```python\nimport textdistance\n\n# 计算汉明距离 (Hamming Distance)\ndist = textdistance.hamming('test', 'text')\nprint(dist) \n# 输出: 1\n\n# 计算相似度\nsim = textdistance.hamming.similarity('test', 'text')\nprint(sim) \n# 输出: 3\n\n# 计算归一化距离 (0.0 表示完全相同，1.0 表示完全不同)\nnorm_dist = textdistance.hamming.normalized_distance('test', 'text')\nprint(norm_dist) \n# 输出: 0.25\n\n# 计算归一化相似度 (0.0 表示完全不同，1.0 表示完全相同)\nnorm_sim = textdistance.hamming.normalized_similarity('test', 'text')\nprint(norm_sim) \n# 输出: 0.75\n```\n\n### 2. 进阶用法（类实例化与参数定制）\n通过实例化类，可以调整算法参数，例如设置 q-gram 大小或将输入按单词分割。\n\n```python\nimport textdistance\n\n# 初始化汉明距离算法，设置 qval=2 (将字符串分为双字符组进行比较)\nhamming_algo = textdistance.Hamming(qval=2)\n\nresult = hamming_algo.distance('test', 'text')\nprint(result) \n# 输出: 2 (因为 'te'=='te', 'es'!='ex', 'st'!='xt')\n\n# 禁用外部加速库，强制使用纯 Python 实现\nhamming_pure = textdistance.Hamming(external=False)\nresult_pure = hamming_pure('text', 'testit')\nprint(result_pure)\n```\n\n### 常用通用方法\n所有算法类实例均支持以下方法：\n- `.distance(*sequences)`: 计算距离。\n- `.similarity(*sequences)`: 计算相似度。\n- `.normalized_distance(*sequences)`: 返回 0~1 之间的归一化距离。\n- `.normalized_similarity(*sequences)`: 返回 0~1 之间的归一化相似度。\n- `.maximum(*sequences)`: 返回该序列对可能的最大距离\u002F相似度值。","某电商公司的数据团队正在清洗来自不同渠道的用户评论，需要将新收到的评论与数据库中的历史差评进行匹配，以自动识别重复投诉或恶意刷单行为。\n\n### 没有 textdistance 时\n- 开发人员需手动查找并分别安装多个独立的算法库（如专门计算 Levenshtein 或 Jaro-Winkler 的包），导致依赖管理混乱且环境臃肿。\n- 面对“拼写错误”、“词序颠倒”或“部分重合”等不同噪声场景，难以快速切换算法验证哪种相似度计算最准确，试错成本极高。\n- 代码中充斥着各异的函数调用方式，缺乏统一接口，使得批量对比成千上万条评论的逻辑复杂且难以维护。\n- 纯 Python 实现的处理速度在大数据量下成为瓶颈，而引入 NumPy 加速往往需要重写大量底层逻辑。\n\n### 使用 textdistance 后\n- 仅需安装一个库即可调用 30 多种算法（从编辑距离到余弦相似度），极大简化了项目依赖和环境配置。\n- 通过统一的接口轻松切换 Hamming、Levenshtein 或 Monge-Elkan 等算法，迅速找到最适合中文评论模糊匹配的策略。\n- 利用其支持多序列对比的特性，用简洁的代码一次性将新评论与历史库进行高效比对，显著提升了开发效率。\n- 按需启用可选的 NumPy 后端支持，在不修改业务逻辑的前提下，将大规模文本距离计算的吞吐量提升数倍。\n\ntextdistance 通过提供统一、多样且高效的算法接口，让复杂的文本相似度分析变得像调用标准数学函数一样简单快捷。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Flife4_textdistance_21190864.png","life4","Life4","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Flife4_d8479791.png","Big and cool projects by @orsinium. See also @orsinium-labs.",null,"https:\u002F\u002Forsinium.dev\u002F","https:\u002F\u002Fgithub.com\u002Flife4",[22,26,30],{"name":23,"color":24,"percentage":25},"Python","#3572A5",98.2,{"name":27,"color":28,"percentage":29},"Jupyter Notebook","#DA5B0B",1.6,{"name":31,"color":32,"percentage":33},"Shell","#89e051",0.2,3526,257,"2026-04-13T15:14:29","MIT",1,"未说明","不需要 GPU",{"notes":42,"python":39,"dependencies":43},"该库为纯 Python 实现，无需特殊硬件。默认安装仅包含纯 Python 代码，速度较慢；建议通过 'pip install textdistance[extras]' 安装可选依赖以调用外部 C\u002FC++ 库获得最大速度。部分算法（如基于压缩的距离）依赖系统自带的压缩工具（如 bz2, lzma, zlib）。",[44,45,46,47,48,49,50],"numpy (可选，用于加速)","jellyfish (可选，外部加速库)","py_stringmatching (可选，外部加速库)","pylev (可选，外部加速库)","Levenshtein (可选，外部加速库)","pyxDamerauLevenshtein (可选，外部加速库)","rapidfuzz (基准测试中提到的高性能替代库)",[52],"图像",[54,55,56,6,57,58,59,60,61,62,63,64,65],"distance","algorithm","python","hamming-distance","levenshtein-distance","damerau-levenshtein","damerau-levenshtein-distance","algorithms","distance-calculation","jellyfish","diff","levenshtein",2,"ready","2026-03-27T02:49:30.150509","2026-04-18T22:35:29.052180",[],[72,77,82,87,92,97,102,107,112,117,122,127,132],{"id":73,"version":74,"summary_zh":75,"released_at":76},324426,"4.6.3","## 变更内容\n* 由 @ChristofKaufmann 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F93 中修复了拼写错误并添加了缺失的导入语句。\n\n## 新贡献者\n* @ChristofKaufmann 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F93 中完成了他们的首次贡献。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fcompare\u002F4.6.2...4.6.3","2024-07-16T09:36:19",{"id":78,"version":79,"summary_zh":80,"released_at":81},324427,"4.6.2","## 变更内容\n* Levenstein：由 @KKGBiz 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F92 中确保返回类型为 int\n\n## 新贡献者\n* @KKGBiz 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F92 中完成了他们的首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fcompare\u002F4.6.1...4.6.2","2024-04-24T11:36:32",{"id":83,"version":84,"summary_zh":85,"released_at":86},324428,"4.6.1","## 变更内容\n* 由 @juliangilbey 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F91 中从 libraries.json 中移除了 abydos 相关引用\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fcompare\u002F4.6.0...4.6.1","2023-12-29T09:27:05",{"id":88,"version":89,"summary_zh":90,"released_at":91},324429,"4.6.0","## 变更内容\n\n可能破坏兼容性：\n* 由 @orsinium 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F90 中移除 abydos 库\n\n其余变更：\n* 由 @musicinmybrain 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F85 中移除 README.md 文件的可执行权限位\n* 由 @musicinmybrain 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F87 中将 setup.cfg 中的 deprecated license_file 替换为 license_files\n* 由 @orsinium 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F89 中添加 GitHub Actions 工作流\n* 由 @musicinmybrain 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F86 中更新 python-Levenshtein 的 URL 和名称\n* 由 @orsinium 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F88 中进行了一些小修复\n\n## 新贡献者\n* @musicinmybrain 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F85 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fcompare\u002F4.5.0...4.6.0","2023-09-28T08:32:07",{"id":93,"version":94,"summary_zh":95,"released_at":96},324430,"4.5.0","## 变更内容\n* 在 CI 中运行 Python 3.10 测试，由 @orsinium 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F80 中完成\n* 类型注解，由 @orsinium 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F82 中完成\n* 新增 DamerauLevenshtein... 类，由 @juliangilbey 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F84 中完成\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fcompare\u002F4.4.0...4.5.0","2022-09-18T07:47:16",{"id":98,"version":99,"summary_zh":100,"released_at":101},324431,"4.4.0","## 变更内容\n* 由 @maxbachmann 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F83 中更新了 rapidfuzz\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fcompare\u002F4.3.0...4.4.0","2022-08-21T07:00:39",{"id":103,"version":104,"summary_zh":105,"released_at":106},324432,"4.3.0","## 变更内容\n* 确保最大归一化距离不超过 1，并…，由 @juliangilbey 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F78 中完成\n* 忽略部分比较测试中不一致的计时，由 @juliangilbey 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F79 中完成\n* 添加对 rapidfuzz 的支持，由 @maxbachmann 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F77 中完成\n\n## 新贡献者\n* @maxbachmann 在 https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fpull\u002F77 中完成了他们的首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Flife4\u002Ftextdistance\u002Fcompare\u002F4.2.2...4.3.0","2022-06-29T12:49:51",{"id":108,"version":109,"summary_zh":110,"released_at":111},324433,"v.4.2.1","#70 ","2021-01-29T09:04:46",{"id":113,"version":114,"summary_zh":115,"released_at":116},324434,"v.4.2.0","+ 停止对 Python 2 的支持。我们遵循官方的 Python 发布周期。目前 CI 已切换为在 Python 3.6 及以上版本上运行。对于 3.4 和 3.5，现有功能应该仍然可用，但建议逐步迁移，迁移工作并不复杂。\n+ 我们已将测试迁移到 pytest + hypothesis 上，这帮助我们发现了许多 bug。\n+ 一些修复：修正了 Damerau-Levenshtein 算法中的一个 bug，改进了 Smith-Waterman 算法中的归一化处理，并修复了 Soundex 对部分 Unicode 字符的支持问题。\n+ 所有类现在都接受 `external` 参数，即使它们目前不依赖任何外部库。\n+ CI 和发布流程由 [DepHell](https:\u002F\u002Fgithub.com\u002Fdephell\u002Fdephell) 提供支持。","2020-04-13T10:17:59",{"id":118,"version":119,"summary_zh":120,"released_at":121},324435,"v.4.1.5","#38 #42 #43 #44","2019-10-03T11:38:44",{"id":123,"version":124,"summary_zh":125,"released_at":126},324436,"v4.1.0","+ [Normalized compression distance](https:\u002F\u002Farticles.life4web.ru\u002Feng\u002Fncd\u002F) algorithms that really works. Realization [was discussed](https:\u002F\u002Fgithub.com\u002Forsinium\u002Ftextdistance\u002Fissues\u002F21) with NCD author.\r\n+ MIT license\r\n+ All code tested on PyPy 3.5\r\n+ Strict Flake8 rules to make source code reading easier for you.\r\n","2019-03-09T18:14:03",{"id":128,"version":129,"summary_zh":130,"released_at":131},324437,"3.0.0","+ optional dependencies support\r\n+ benchmarks\r\n+ faster Levenshtein and Damerau-Levenshtein\r\n+ tox powered tests\r\n","2018-03-31T08:22:21",{"id":133,"version":134,"summary_zh":135,"released_at":136},324438,"2.0.1","Global project update! Old interface has been saved, but deprecated.\r\n\r\nNow project contain 30+ algorithms separated by categories and with common intuitive interfaces and default parameters.","2018-02-10T10:52:39",[138,150,158,166,175,184],{"id":139,"name":140,"github_repo":141,"description_zh":142,"stars":143,"difficulty_score":144,"last_commit_at":145,"category_tags":146,"status":67},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[147,148,52,149],"Agent","开发框架","数据工具",{"id":151,"name":152,"github_repo":153,"description_zh":154,"stars":155,"difficulty_score":144,"last_commit_at":156,"category_tags":157,"status":67},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[148,52,147],{"id":159,"name":160,"github_repo":161,"description_zh":162,"stars":163,"difficulty_score":66,"last_commit_at":164,"category_tags":165,"status":67},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[148,52,147],{"id":167,"name":168,"github_repo":169,"description_zh":170,"stars":171,"difficulty_score":66,"last_commit_at":172,"category_tags":173,"status":67},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[174,147,52,148],"插件",{"id":176,"name":177,"github_repo":178,"description_zh":179,"stars":180,"difficulty_score":144,"last_commit_at":181,"category_tags":182,"status":67},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[183,52,147,148],"语言模型",{"id":185,"name":186,"github_repo":187,"description_zh":188,"stars":189,"difficulty_score":144,"last_commit_at":190,"category_tags":191,"status":67},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[148,52,147,192],"视频"]