[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-ekzhu--datasketch":3,"tool-ekzhu--datasketch":61},[4,18,28,36,45,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":24,"last_commit_at":25,"category_tags":26,"status":17},9989,"n8n","n8n-io\u002Fn8n","n8n 是一款面向技术团队的公平代码（fair-code）工作流自动化平台，旨在让用户在享受低代码快速构建便利的同时，保留编写自定义代码的灵活性。它主要解决了传统自动化工具要么过于封闭难以扩展、要么完全依赖手写代码效率低下的痛点，帮助用户轻松连接 400 多种应用与服务，实现复杂业务流程的自动化。\n\nn8n 特别适合开发者、工程师以及具备一定技术背景的业务人员使用。其核心亮点在于“按需编码”：既可以通过直观的可视化界面拖拽节点搭建流程，也能随时插入 JavaScript 或 Python 代码、调用 npm 包来处理复杂逻辑。此外，n8n 原生集成了基于 LangChain 的 AI 能力，支持用户利用自有数据和模型构建智能体工作流。在部署方面，n8n 提供极高的自由度，支持完全自托管以保障数据隐私和控制权，也提供云端服务选项。凭借活跃的社区生态和数百个现成模板，n8n 让构建强大且可控的自动化系统变得简单高效。",184740,2,"2026-04-19T23:22:26",[16,14,13,15,27],"插件",{"id":29,"name":30,"github_repo":31,"description_zh":32,"stars":33,"difficulty_score":10,"last_commit_at":34,"category_tags":35,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":24,"last_commit_at":42,"category_tags":43,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",161147,"2026-04-19T23:31:47",[14,13,44],"语言模型",{"id":46,"name":47,"github_repo":48,"description_zh":49,"stars":50,"difficulty_score":24,"last_commit_at":51,"category_tags":52,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":24,"last_commit_at":59,"category_tags":60,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[27,13,15,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":76,"owner_twitter":72,"owner_website":78,"owner_url":79,"languages":80,"stars":85,"forks":86,"last_commit_at":87,"license":88,"difficulty_score":89,"env_os":90,"env_gpu":90,"env_ram":90,"env_deps":91,"category_tags":100,"github_topics":102,"view_count":24,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":117,"updated_at":118,"faqs":119,"releases":149},9873,"ekzhu\u002Fdatasketch","datasketch","MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW","datasketch 是一个专为处理海量数据而设计的 Python 库，它让大数据看起来更小、更易管理。面对亿级数据时，传统方法往往因计算量过大而变得缓慢甚至不可行，datasketch 通过引入概率数据结构，在几乎不牺牲准确性的前提下，实现了超高速的数据处理与搜索能力。\n\n该工具核心解决了大规模数据去重、相似度估算及基数统计的难题。它内置了 MinHash、Weighted MinHash、HyperLogLog 等经典算法，能快速估算集合间的杰卡德相似度或唯一元素数量。更独特的是，datasketch 提供了多种高效索引机制（如 LSH、LSH Forest、HNSW），支持亚线性时间的查询速度，并能无缝对接 Redis 或 Cassandra 等分布式存储系统，轻松应对工业级规模的数据挑战。\n\ndatasketch 非常适合需要处理大规模数据集的后端开发者、数据工程师及算法研究人员。无论是构建推荐系统中的相似物品查找，还是在日志分析中进行快速去重统计，它都能提供坚实的技术支撑。只需简单的 pip 安装即可上手，是提升大数据处理效率的得力助手。","datasketch: Big Data Looks Small\n================================\n\n.. image:: https:\u002F\u002Fstatic.pepy.tech\u002Fbadge\u002Fdatasketch\u002Fmonth\n    :target: https:\u002F\u002Fpepy.tech\u002Fproject\u002Fdatasketch\n\n.. image:: https:\u002F\u002Fzenodo.org\u002Fbadge\u002FDOI\u002F10.5281\u002Fzenodo.598238.svg\n   :target: https:\u002F\u002Fzenodo.org\u002Fdoi\u002F10.5281\u002Fzenodo.598238\n\n.. image:: https:\u002F\u002Fcodecov.io\u002Fgh\u002Fekzhu\u002Fdatasketch\u002Fbranch\u002Fmaster\u002Fgraph\u002Fbadge.svg\n    :target: https:\u002F\u002Fcodecov.io\u002Fgh\u002Fekzhu\u002Fdatasketch\n\ndatasketch gives you probabilistic data structures that can process and\nsearch very large amount of data super fast, with little loss of\naccuracy.\n\nThis package contains the following data sketches:\n\n+-------------------------+-----------------------------------------------+\n| Data Sketch             | Usage                                         |\n+=========================+===============================================+\n| `MinHash`_              | estimate Jaccard similarity and cardinality   |\n+-------------------------+-----------------------------------------------+\n| `Weighted MinHash`_     | estimate weighted Jaccard similarity          |\n+-------------------------+-----------------------------------------------+\n| `HyperLogLog`_          | estimate cardinality                          |\n+-------------------------+-----------------------------------------------+\n| `HyperLogLog++`_        | estimate cardinality                          |\n+-------------------------+-----------------------------------------------+\n\nThe following indexes for data sketches are provided to support\nsub-linear query time:\n\n+---------------------------+-----------------------------+------------------------+\n| Index                     | For Data Sketch             | Supported Query Type   |\n+===========================+=============================+========================+\n| `MinHash LSH`_            | MinHash, Weighted MinHash   | Jaccard Threshold      |\n+---------------------------+-----------------------------+------------------------+\n| `LSHBloom`_               | MinHash, Weighted MinHash   | Jaccard Threshold      |\n+---------------------------+-----------------------------+------------------------+\n| `MinHash LSH Forest`_     | MinHash, Weighted MinHash   | Jaccard Top-K          |\n+---------------------------+-----------------------------+------------------------+\n| `MinHash LSH Ensemble`_   | MinHash                     | Containment Threshold  |\n+---------------------------+-----------------------------+------------------------+\n| `HNSW`_                   | Any                         | Custom Metric Top-K    |\n+---------------------------+-----------------------------+------------------------+\n\ndatasketch must be used with Python 3.9 or above, NumPy 1.11 or above, and Scipy.\n\nNote that `MinHash LSH`_ and `MinHash LSH Ensemble`_ also support Redis and Cassandra \nstorage layer (see `MinHash LSH at Scale`_).\n\nInstall\n-------\n\nTo install datasketch using ``pip``:\n\n.. code-block:: bash\n\n    pip install datasketch\n\nThis will also install NumPy as dependency.\n\nTo install with Redis dependency:\n\n.. code-block:: bash\n\n    pip install datasketch[redis]\n\nTo install with Cassandra dependency:\n\n.. code-block:: bash\n\n    pip install datasketch[cassandra]\n\nTo install with Bloom filter dependency:\n\n.. code-block:: bash\n\n    pip install datasketch[bloom]\n\n.. _`MinHash`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Fminhash.html\n.. _`Weighted MinHash`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Fweightedminhash.html\n.. _`HyperLogLog`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Fhyperloglog.html\n.. _`HyperLogLog++`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Fhyperloglog.html#hyperloglog-plusplus\n.. _`MinHash LSH`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Flsh.html\n.. _`MinHash LSH Forest`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Flshforest.html\n.. _`MinHash LSH Ensemble`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Flshensemble.html\n.. _`LSHBloom`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Flshbloom.html\n.. _`Minhash LSH at Scale`: http:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Flsh.html#minhash-lsh-at-scale\n.. _`HNSW`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Fdocumentation.html#hnsw\n\nContributing\n------------\n\nWe welcome contributions from everyone. Whether you're fixing bugs, adding features, improving documentation, or helping with tests, your contributions are valuable.\n\nDevelopment Setup\n^^^^^^^^^^^^^^^^^\n\nThe project uses `uv` for fast and reliable Python package management. Follow these steps to set up your development environment:\n\n1. **Install uv**: Follow the official installation guide at https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002Fgetting-started\u002Finstallation\u002F\n\n2. **Clone the repository**:\n\n   .. code-block:: bash\n\n       git clone https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch.git\n       cd datasketch\n\n3. **Set up the environment**:\n\n   .. code-block:: bash\n\n       # Create a virtual environment\n       # (Optional: specify Python version with --python 3.x)\n       uv venv\n       # Activate the virtual environment (optional, uv run commands work without it)\n       source .venv\u002Fbin\u002Factivate\n\n       # Install all dependencies\n       uv sync\n\n4. **Verify installation**:\n\n   .. code-block:: bash\n\n       # Run tests to ensure everything works\n       uv run pytest\n\n5. **Optional dependencies** (for specific development needs):\n\n   .. code-block:: bash\n\n       # For testing\n       uv sync --extra test\n\n       # For Cassandra support\n       uv sync --extra cassandra\n\n       # For Redis support\n       uv sync --extra redis\n\n       # For all extras\n       uv sync --all-extras\n\nLearn more about `uv` at https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F\n\nDevelopment Workflow\n^^^^^^^^^^^^^^^^^^^^\n\n1. **Fork the repository** on GitHub if you haven't already.\n\n2. **Create a feature branch** for your changes:\n\n   .. code-block:: bash\n\n       git checkout -b feature\u002Fyour-feature-name\n       # Or for bug fixes:\n       git checkout -b fix\u002Fissue-description\n\n3. **Make your changes** following the project's coding standards.\n\n4. **Run the tests** to ensure nothing is broken:\n\n   .. code-block:: bash\n\n       uv run pytest\n\n5. **Check code quality** with ruff:\n\n   .. code-block:: bash\n\n       # Check for issues\n       uvx ruff check .\n\n       # Auto-fix formatting issues\n       uvx ruff format .\n\n6. **Commit your changes** with a clear, descriptive commit message:\n\n   .. code-block:: bash\n\n       git commit -m \"Add feature: brief description of what was changed\"\n\n7. **Push to your fork** and create a pull request on GitHub:\n\n   .. code-block:: bash\n\n       git push origin your-branch-name\n\n8. **Respond to feedback** from maintainers and iterate on your changes.\n\nGuidelines\n^^^^^^^^^^\n\n- Follow PEP 8 style guidelines\n- Write tests for new features\n- Update documentation as needed\n- Keep commits focused and atomic\n- Be respectful in discussions\n\nFor more information, check the `GitHub issues \u003Chttps:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fissues>`_ for current priorities or areas needing help. You can also join the discussion on `project roadmap and priorities \u003Chttps:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fdiscussions\u002F252>`_.","datasketch：大数据看起来很小\n================================\n\n.. image:: https:\u002F\u002Fstatic.pepy.tech\u002Fbadge\u002Fdatasketch\u002Fmonth\n    :target: https:\u002F\u002Fpepy.tech\u002Fproject\u002Fdatasketch\n\n.. image:: https:\u002F\u002Fzenodo.org\u002Fbadge\u002FDOI\u002F10.5281\u002Fzenodo.598238.svg\n   :target: https:\u002F\u002Fzenodo.org\u002Fdoi\u002F10.5281\u002Fzenodo.598238\n\n.. image:: https:\u002F\u002Fcodecov.io\u002Fgh\u002Fekzhu\u002Fdatasketch\u002Fbranch\u002Fmaster\u002Fgraph\u002Fbadge.svg\n    :target: https:\u002F\u002Fcodecov.io\u002Fgh\u002Fekzhu\u002Fdatasketch\n\ndatasketch 提供了概率数据结构，能够在几乎不损失精度的情况下，以极快的速度处理和搜索海量数据。\n\n该包包含以下数据草图：\n\n+-------------------------+-----------------------------------------------+\n| 数据草图             | 用途                                         |\n+=========================+===============================================+\n| `MinHash`_              | 估计 Jaccard 相似度和基数                   |\n+-------------------------+-----------------------------------------------+\n| `Weighted MinHash`_     | 估计加权 Jaccard 相似度                      |\n+-------------------------+-----------------------------------------------+\n| `HyperLogLog`_          | 估计基数                                    |\n+-------------------------+-----------------------------------------------+\n| `HyperLogLog++`_        | 估计基数                                    |\n+-------------------------+-----------------------------------------------+\n\n为了支持亚线性查询时间，提供了以下数据草图的索引：\n\n+---------------------------+-----------------------------+------------------------+\n| 索引                     | 适用于数据草图             | 支持的查询类型         |\n+===========================+=============================+========================+\n| `MinHash LSH`_            | MinHash、Weighted MinHash   | Jaccard 阈值           |\n+---------------------------+-----------------------------+------------------------+\n| `LSHBloom`_               | MinHash、Weighted MinHash   | Jaccard 阈值           |\n+---------------------------+-----------------------------+------------------------+\n| `MinHash LSH Forest`_     | MinHash、Weighted MinHash   | Jaccard Top-K          |\n+---------------------------+-----------------------------+------------------------+\n| `MinHash LSH Ensemble`_   | MinHash                     | 包含性阈值             |\n+---------------------------+-----------------------------+------------------------+\n| `HNSW`_                   | 任何                        | 自定义度量 Top-K       |\n+---------------------------+-----------------------------+------------------------+\n\ndatasketch 必须与 Python 3.9 或更高版本、NumPy 1.11 或更高版本以及 Scipy 一起使用。\n\n请注意，`MinHash LSH`_ 和 `MinHash LSH Ensemble`_ 还支持 Redis 和 Cassandra 存储层（参见 `MinHash LSH at Scale`_）。\n\n安装\n-------\n\n要使用 ``pip`` 安装 datasketch：\n\n.. code-block:: bash\n\n    pip install datasketch\n\n这也会将 NumPy 作为依赖项一起安装。\n\n若需安装 Redis 依赖项：\n\n.. code-block:: bash\n\n    pip install datasketch[redis]\n\n若需安装 Cassandra 依赖项：\n\n.. code-block:: bash\n\n    pip install datasketch[cassandra]\n\n若需安装 Bloom 过滤器依赖项：\n\n.. code-block:: bash\n\n    pip install datasketch[bloom]\n\n.. _`MinHash`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Fminhash.html\n.. _`Weighted MinHash`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Fweightedminhash.html\n.. _`HyperLogLog`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Fhyperloglog.html\n.. _`HyperLogLog++`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Fhyperloglog.html#hyperloglog-plusplus\n.. _`MinHash LSH`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Flsh.html\n.. _`MinHash LSH Forest`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Flshforest.html\n.. _`MinHash LSH Ensemble`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Flshensemble.html\n.. _`LSHBloom`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Flshbloom.html\n.. _`Minhash LSH at Scale`: http:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Flsh.html#minhash-lsh-at-scale\n.. _`HNSW`: https:\u002F\u002Fekzhu.github.io\u002Fdatasketch\u002Fdocumentation.html#hnsw\n\n贡献\n------------\n\n我们欢迎所有人的贡献。无论您是修复错误、添加功能、改进文档，还是协助测试，您的贡献都十分宝贵。\n\n开发环境设置\n^^^^^^^^^^^^^^^^^\n\n该项目使用 `uv` 进行快速可靠的 Python 包管理。请按照以下步骤设置您的开发环境：\n\n1. **安装 uv**：请按照官方安装指南 https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002Fgetting-started\u002Finstallation\u002F 进行安装。\n\n2. **克隆仓库**：\n\n   .. code-block:: bash\n\n       git clone https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch.git\n       cd datasketch\n\n3. **设置环境**：\n\n   .. code-block:: bash\n\n       # 创建虚拟环境\n       # （可选：指定 Python 版本，如 --python 3.x）\n       uv venv\n       # 激活虚拟环境（可选，uv run 命令无需激活即可运行）\n       source .venv\u002Fbin\u002Factivate\n\n       # 安装所有依赖项\n       uv sync\n\n4. **验证安装**：\n\n   .. code-block:: bash\n\n       # 运行测试以确保一切正常\n       uv run pytest\n\n5. **可选依赖项**（用于特定开发需求）：\n\n   .. code-block:: bash\n\n       # 用于测试\n       uv sync --extra test\n\n       # 用于 Cassandra 支持\n       uv sync --extra cassandra\n\n       # 用于 Redis 支持\n       uv sync --extra redis\n\n       # 安装所有额外依赖项\n       uv sync --all-extras\n\n更多关于 `uv` 的信息，请访问 https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F\n\n开发流程\n^^^^^^^^^^^^^^^^\n\n1. 如果您尚未在 GitHub 上 fork 该仓库，请先进行 fork。\n\n2. **为您的更改创建特性分支**：\n\n   .. code-block:: bash\n\n       git checkout -b feature\u002Fyour-feature-name\n       # 或者用于修复 bug：\n       git checkout -b fix\u002Fissue-description\n\n3. **按照项目的编码规范进行更改**。\n\n4. **运行测试**，以确保没有破坏现有功能：\n\n   .. code-block:: bash\n\n       uv run pytest\n\n5. **使用 ruff 检查代码质量**：\n\n   .. code-block:: bash\n\n       # 检查问题\n       uvx ruff check .\n\n       # 自动修复格式问题\n       uvx ruff format .\n\n6. **提交更改**，并附上清晰、描述性的提交信息：\n\n   .. code-block:: bash\n\n       git commit -m \"添加功能：简要说明更改内容\"\n\n7. **推送到您的 fork，并在 GitHub 上创建拉取请求**：\n\n   .. code-block:: bash\n\n       git push origin your-branch-name\n\n8. **响应维护者的反馈，并迭代修改您的更改**。\n\n指导原则\n^^^^^^^^^^\n\n- 遵循 PEP 8 代码风格指南\n- 为新功能编写测试\n- 根据需要更新文档\n- 保持提交内容专注且原子化\n- 在讨论中保持尊重\n\n如需更多信息，请查看 `GitHub issues \u003Chttps:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fissues>`_，了解当前的优先级或需要帮助的领域。您也可以加入关于 `项目路线图和优先级 \u003Chttps:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fdiscussions\u002F252>`_ 的讨论。","# datasketch 快速上手指南\n\ndatasketch 是一个提供概率数据结构的 Python 库，能够以极快的速度处理海量数据，同时保持极高的准确度。它适用于估算集合相似度（Jaccard）、基数统计等场景。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **Python 版本**：3.9 或更高版本\n*   **核心依赖**：\n    *   NumPy >= 1.11\n    *   Scipy\n*   **可选依赖**（根据需求安装）：\n    *   Redis（用于分布式存储）\n    *   Cassandra（用于分布式存储）\n    *   Bloom Filter（用于特定索引）\n\n## 安装步骤\n\n推荐使用 `pip` 进行安装。国内开发者建议使用清华或阿里镜像源以加速下载。\n\n### 1. 基础安装\n安装核心功能及 NumPy 依赖：\n```bash\npip install datasketch -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 2. 安装可选扩展\n如果您需要使用特定的存储后端或功能，请使用以下命令：\n\n*   **支持 Redis**：\n    ```bash\n    pip install \"datasketch[redis]\" -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n*   **支持 Cassandra**：\n    ```bash\n    pip install \"datasketch[cassandra]\" -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n*   **支持 Bloom Filter**：\n    ```bash\n    pip install \"datasketch[bloom]\" -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n## 基本使用\n\n以下示例展示如何使用 `MinHash` 估算两个集合的 Jaccard 相似度。这是 datasketch 最核心的功能之一。\n\n```python\nfrom datasketch import MinHash\n\n# 定义两个数据集（模拟文档或集合）\ndata1 = ['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for', 'estimating', 'the', 'similarity', 'between', 'datasets']\ndata2 = ['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for', 'estimating', 'the', 'similarity', 'between', 'documents']\n\n# 初始化 MinHash 对象 (num_perm 表示哈希函数的数量，通常设为 128)\nm1 = MinHash(num_perm=128)\nm2 = MinHash(num_perm=128)\n\n# 更新数据\nfor d in data1:\n    m1.update(d.encode('utf8'))\n\nfor d in data2:\n    m2.update(d.encode('utf8'))\n\n# 估算 Jaccard 相似度\nsimilarity = m1.jaccard(m2)\n\nprint(f\"Estimated Jaccard similarity: {similarity}\")\n```\n\n**说明：**\n*   `update()` 方法用于逐个添加数据项（需编码为 bytes）。\n*   `jaccard()` 方法返回两个 sketch 之间的估算相似度（0.0 到 1.0 之间）。\n*   对于海量数据，该方法比精确计算快数个数量级，且内存占用极低。","某大型新闻聚合平台每天需处理数百万篇新文章，核心需求是实时识别并合并内容高度重复的转载稿件，以避免用户看到冗余信息。\n\n### 没有 datasketch 时\n- **计算资源爆炸**：传统两两比对算法的时间复杂度呈平方级增长，面对千万级文档库，单次全量去重任务需耗时数小时甚至导致集群内存溢出。\n- **响应严重滞后**：由于离线批处理耗时过长，新发布的文章往往在数小时后才能被标记为重复，无法实现实时的流式去重。\n- **存储成本高昂**：为了加速查询，不得不将完整的文档指纹或原始文本全部载入内存，导致服务器硬件成本居高不下。\n- **扩展性差**：随着数据量线性增加，计算时间呈指数级上升，单纯增加机器数量难以解决根本的性能瓶颈。\n\n### 使用 datasketch 后\n- **秒级完成检索**：利用 MinHash 和 LSH（局部敏感哈希）索引，将相似度搜索复杂度降至亚线性级别，亿级数据下的重复检测可在秒级内完成。\n- **实现实时流处理**：借助其极低的计算开销，系统可直接嵌入实时数据流中，新文章到达瞬间即可完成重复性判断并拦截。\n- **内存占用极低**：datasketch 将长篇文档压缩为极小的概率数据结构（Sketch），仅需原数据千分之一的内存即可维持高精度估算。\n- **弹性支撑海量数据**：结合 Redis 或 Cassandra 后端存储，轻松横向扩展以支持从百万到百亿级文档规模的增长，性能依然稳定。\n\ndatasketch 通过概率数据结构将“大数据变小”，让海量文本的实时去重从昂贵的离线批处理变成了低成本的即时操作。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fekzhu_datasketch_d83f41ff.png","ekzhu","Eric Zhu","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fekzhu_41232cf8.jpg","Building AI agent systems",null,"Seattle, WA","https:\u002F\u002Fekzhu.com","https:\u002F\u002Fgithub.com\u002Fekzhu",[81],{"name":82,"color":83,"percentage":84},"Python","#3572A5",100,2905,315,"2026-04-18T23:33:31","MIT",1,"未说明",{"notes":92,"python":93,"dependencies":94},"该工具主要用于处理大数据的概率数据结构。若需使用分布式存储功能，需额外安装 Redis 或 Cassandra 依赖。开发环境推荐使用 uv 进行包管理。","3.9+",[95,96,97,98,99],"numpy>=1.11","scipy","redis (可选)","cassandra (可选)","bloom (可选)",[101,14,16],"其他",[103,104,105,106,107,108,109,110,111,112,113,114,115,116],"python","lsh-forest","jaccard-similarity","hyperloglog","lsh","minhash","weighted-quantiles","top-k","search","data-sketches","data-summary","lsh-ensemble","locality-sensitive-hashing","hnsw","2026-03-27T02:49:30.150509","2026-04-20T10:24:15.886026",[120,125,130,135,140,145],{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},44340,"如何获取 LSH 中所有重复项的对（而不仅仅是查询单个 MinHash）？","如果你需要获取桶内所有记录对，可以考虑使用 [SetSimilaritySearch](https:\u002F\u002Fgithub.com\u002Fekzhu\u002FSetSimilaritySearch) 库。这种方式可以获取精确的配对结果（无误报或漏报），特别适合需要确切相似对而非概率估计的场景。","https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fissues\u002F76",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},44341,"是否支持合并多个具有相同参数的 MinHashLSH 对象？","是的，当多个 MinHashLSH 对象的初始化参数（如阈值、排列数、哈希表数量及哈希值范围）完全一致时，可以合并它们的底层哈希表。实现逻辑是：检查参数一致性，然后对哈希表进行合并（更新键和哈希值）。如果键映射不同但键不重叠，用户仍可通过检索到的键进行过滤；若键重叠且映射不同，则无法区分检索结果的正确性。该功能已在后续版本中通过代码提交实现。","https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fissues\u002F205",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},44342,"MinHashLSH 查询时能否返回具体的 Jaccard 相似度数值？","默认情况下，MinHashLSH 查询仅返回超过阈值的键，不直接返回具体的 Jaccard 相似度数值。如果需要知道具体的相似度大小，建议在查询得到候选键后，使用原始的 MinHash 对象手动计算它们之间的 Jaccard 相似度。此外，对于内存优化，可以使用 LeanMinHash 进行序列化存储，其大小取决于排列数（例如 128 个排列序列化后约为 700 字节）。","https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fissues\u002F159",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},44343,"使用 MongoDB 后端时 AsyncMinhashLSH 查询速度非常慢，如何解决？","这是因为默认情况下查询未使用 MongoDB 的索引。解决方法是在数据库中对每个 bucket 集合和 key 集合的 'key' 字段创建索引。请在 MongoDB 中对每个相关的集合运行以下命令：`db.lsh_...._bucket_#.createIndex( { key: -1 } )`。应用此修复后，查询性能可从每查询 11 秒提升至小于 1 毫秒。由于 '_id' 需要唯一性而不能直接用作业务键，因此必须显式为 'key' 字段创建索引。","https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fissues\u002F146",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},44344,"在使用 Redis 后端时，如何从 MinHashLSH 中恢复 MinHashes 以查找所有重复项？","Redis 后端主要存储哈希带（hashbands）到 ID 的映射。若要查找所有重复项，建议采用以下策略：创建一个空字典，将哈希带（由 n 个 MinHash 值串联而成）映射到输入标识符列表。遍历每个输入对象，生成 MinHashes 并通过滑动窗口组合成哈希带，将对象 ID 添加到对应哈希带的列表中。最终，出现在同一列表中的 ID 即被视为具有相似的 Jaccard 相似度。这种方法已在大规模集群上验证有效。","https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fissues\u002F51",{"id":146,"question_zh":147,"answer_zh":148,"source_url":134},44345,"MinHashLSH 的序列化数据占用多少内存？与直接在内存中保存 Map 相比有何优势？","序列化后的大小取决于原始数据大小和使用的排列数（permutations）。例如，使用 128 个排列时，序列化后的字符串大小约为 700 字节。是否序列化取决于项目架构：对于许多场景，直接使用内存中的 Map 可能更简单；但在需要持久化或跨网络传输时，序列化（特别是使用 LeanMinHash）能显著节省空间并提供持久化能力。",[150,155,160,165,170,175,180,185,190,195,200,205,210,215,220,225,230,235,240,245],{"id":151,"version":152,"summary_zh":153,"released_at":154},351894,"v1.10.0","## 变更内容\n* @davidlowryduda 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F304 中将哈希函数更新为使用无符号 32 位整数\n* @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F308 中将 pypa\u002Fgh-action-pypi-publish 从 1.13.0 升级至 1.14.0\n* @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F306 中将 codecov\u002Fcodecov-action 从 5 升级至 6\n* @dipeshbabu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F307 中修复了跨存储后端的缓冲 MinHashLSH 查询聚合问题\n* @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F312 中锁定传递性依赖，以避免 dependabot 仅更新 uv.lock 的 PR\n* @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F314 中将 AsyncMinHashLSH 从实验性模块移至 datasketch.aio\n* @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F315 中将版本从 1.9.0 升级至 1.10.0\n* @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F316 中添加了 Python 3.10 及以上版本的约束，并更新了 GitHub 工作流权限\n\n## 新贡献者\n* @davidlowryduda 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F304 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.9.0...v1.10.0","2026-04-17T23:05:30",{"id":156,"version":157,"summary_zh":158,"released_at":159},351895,"v1.9.0","## 变更内容\n* 由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F294 中将 actions\u002Fcheckout 从 5 升级到 6\n* 由 @Varun0157 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F293 中使异步 `Redis` 接口与同步接口保持一致，并引入 `aioredis` 集成测试\n* 杂项：由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F295 中将 Python 3.13 添加到 CI 工作流矩阵中\n* 由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F296 中添加使用 pytest-cov 和 Codecov 的代码覆盖率支持\n* 杂项：由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F298 中移除 Travis CI 配置及相关脚本\n* 杂项：由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F280 中添加 pyright 类型检查及相应的 CI 检查\n* 由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F299 中将版本从 1.8.0 升级到 1.9.0\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.8.0...v1.9.0","2026-01-18T22:45:25",{"id":161,"version":162,"summary_zh":163,"released_at":164},351896,"v1.8.0","## 变更内容\n* 功能：实现 `MinHashLSHDeletionSession`，以加快密钥删除速度，由 @Varun0157 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F272 中完成\n* [破坏性变更] 迁移到 uv 包管理工具，并更新 CI\u002FCD 流程，由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F271 中完成\n* 修复：修复主分支中出现的链接检查失败问题，由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F276 中完成\n* 文档：更新环境搭建说明，并移除 `.python-version` 文件，由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F277 中完成\n* 将 supercharge\u002Fmongodb-github-action 从 1.8.0 升级至 1.12.0，由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F279 中完成\n* 将 actions\u002Fcheckout 从 4 升级至 5，由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F278 中完成\n* 修复：允许为 `Redis` 存储设置 `prepickle` 为 False，由 @Varun0157 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F274 中完成\n* 版本更新至 1.7.1，由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F281 中完成\n* 修复 README 中的格式问题，由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F283 中完成\n* 更新 README 中的 Python 版本，由 @dipeshbabu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F285 中完成\n* 修复：为兼容性将 `buffer` 替换为 `memoryview`，由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F289 中完成\n* 修复：引入 `Redis` 集成测试，并解决 `Redis` 和 `Cassandra` 之间密钥类型不一致的问题，由 @Varun0157 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F284 中完成\n* 将 supercharge\u002Fmongodb-github-action 从 1.12.0 升级至 1.12.1，由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F291 中完成\n* 功能：为 MinHash.update_batch 提供可选的 CuPy 后端，由 @dipeshbabu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F286 中完成\n* 修复：在 time 模块中使用 `perf_counter()` 替代已弃用的 `clock()`，由 @dipeshbabu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F290 中完成\n* 版本从 1.7.1 升级至 1.8.0，由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F292 中完成\n\n## 新贡献者\n* @Varun0157 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F272 中完成了首次贡献\n* @dipeshbabu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F285 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.7.0...v1.8.0","2025-11-28T00:44:41",{"id":166,"version":167,"summary_zh":168,"released_at":169},351897,"v1.7.0","## 变更内容\n* 修复 bBitMinHash 的 NumPy 序列化问题，由 @123epsilon 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F248 中完成\n* 杂项：添加 Dependabot 配置，由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F253 中完成\n* 修复（dependabot）：移除版本管理策略，由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F255 中完成\n* 将 actions\u002Fsetup-python 从 4 升级到 6，由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F259 中完成\n* 将 supercharge\u002Fmongodb-github-action 从 1.8.0 升级到 1.12.0，由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F260 中完成\n* 将 actions\u002Fcheckout 从 3 升级到 5，由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F262 中完成\n* 将 pypi.yml 工作流迁移到使用更新版操作的可信发布者模式，由 @Copilot 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F263 中完成\n* 为 .github 目录添加 CODEOWNERS 文件，由 @Copilot 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F265 中完成\n* 将 pypi.yml 转换为基于标签部署的手动工作流，由 @Copilot 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F266 中完成\n* 杂项：添加 PR 模板，由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F258 中完成\n* 【重大变更】停止对 Python 3.7 的官方支持；更新 GitHub Actions 工作流以使用 `uv`，由 @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F257 中完成\n* 向 Datasketch 添加 LSHBloom，由 @123epsilon 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F246 中完成\n* 将版本从 1.6.5 升级到 1.7.0，由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F270 中完成\n\n## 新贡献者\n* @bhimrazy 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F253 中完成了首次贡献\n* @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F259 中完成了首次贡献\n* @Copilot 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F263 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.6.5...v1.7.0","2025-11-05T05:46:51",{"id":171,"version":172,"summary_zh":173,"released_at":174},351898,"v1.6.5","## 变更内容\n* 由 @123epsilon 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F234 中实现，从 LSHForest 中检索 MinHash\n* 由 @rupeshkumaar 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F232 中实现，并合并（具有相同配置的）MinHashLSH 对象\n* 由 @123epsilon 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F238 中更新 bBitMinHash 基准测试\n\n## 新贡献者\n* @123epsilon 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F234 中完成了首次贡献\n* @rupeshkumaar 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F232 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.6.4...v1.6.5","2024-06-04T00:43:43",{"id":176,"version":177,"summary_zh":178,"released_at":179},351899,"v1.6.4","## 变更内容\n* HNSW 的错误修复，由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F230 中完成\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.6.3...v1.6.4","2023-10-03T09:59:29",{"id":181,"version":182,"summary_zh":183,"released_at":184},351900,"v1.6.3","## 变更内容\n* 文档更新，由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F224 中完成\n* HNSW 实现原地删除点。由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F225 中完成\n* 针对 Jaccard 距离的 HNSW 基准测试。由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F226 中完成\n* HNSW 支持软删除和硬删除。由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F227 中完成\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.6.2...v1.6.3","2023-09-12T08:10:38",{"id":186,"version":187,"summary_zh":188,"released_at":189},351901,"v1.6.2","## 变更内容\n* HNSW 实现为 MutableMap，由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F223 中完成\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.6.1...v1.6.2","2023-09-06T22:53:16",{"id":191,"version":192,"summary_zh":193,"released_at":194},351902,"v1.6.1","## 变更内容\n* 由 @chris-ha458 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F217 中简化了重塑操作\n* 由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F220 中实现了 HNSW 的更新点功能\n* 由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F221 中添加了 HNSW 的字典接口\n* 由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F222 中编写了 HNSW 的文档\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.6.0...v1.6.1","2023-09-06T09:13:31",{"id":196,"version":197,"summary_zh":198,"released_at":199},351903,"v1.6.0","## 变更内容\n* 更新 MinHashLSH.query 的文档字符串，详细说明结果的近似性质，由 @micimize 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F199 中完成。\n* 使用新模板修复文档，由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F202 中完成。\n* 更新 lsh.rst 文件，由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F208 中完成。\n* 为 Jaccard 距离基准测试 ANN 索引，由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F210 中完成。\n* 更新 hashfunc.py 文件，由 @chris-ha458 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F211 中完成。\n* 添加 HNSW 索引，由 @ekzhu 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F218 中完成。\n\n## 新贡献者\n* @micimize 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F199 中完成了首次贡献。\n* @chris-ha458 在 https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F211 中完成了首次贡献。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.5.9...v1.6.0","2023-08-30T19:50:12",{"id":201,"version":202,"summary_zh":203,"released_at":204},351904,"v1.5.9","## What's Changed\r\n* Create python-publish.yml by @ekzhu in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F191\r\n* Support numpy>=1.20.0 by @joehalliwell in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F192\r\n* Add note to documentation to address #195 by @ekzhu in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F197\r\n\r\n## New Contributors\r\n* @joehalliwell made their first contribution in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F192\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.5.8...v1.5.9","2023-02-19T15:27:02",{"id":206,"version":207,"summary_zh":208,"released_at":209},351905,"v1.5.8","## What's Changed\r\n* Add GitHub URL for PyPi by @andriyor in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F179\r\n* Support asyncio redis by @long2ice in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F185\r\n* Fix name construction for all values of b by @SenadI in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F190\r\n\r\n## New Contributors\r\n* @andriyor made their first contribution in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F179\r\n* @long2ice made their first contribution in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F185\r\n* @SenadI made their first contribution in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F190\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.5.7...v1.5.8","2022-08-21T03:02:55",{"id":211,"version":212,"summary_zh":213,"released_at":214},351906,"v1.5.7","## What's Changed\r\n* Unable to create multiple lsh indices each one in its own keyspace - issue #171 by @ronassa in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F172\r\n\r\n## New Contributors\r\n* @ronassa made their first contribution in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F172\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.5.6...v1.5.7","2022-02-04T09:05:43",{"id":216,"version":217,"summary_zh":218,"released_at":219},351907,"v1.5.6","Fixed broken packaging setup.py that missed experimental\u002Faio.","2021-12-27T20:13:28",{"id":221,"version":222,"summary_zh":223,"released_at":224},351908,"v1.5.5","## What's Changed\r\n* Adding minhash_many to WeightedMinHashGenerator. by @jroose-jv in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F165\r\n* Add query buffer by @hguhlich in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F167\r\n\r\n## New Contributors\r\n* @jroose-jv made their first contribution in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F165\r\n* @hguhlich made their first contribution in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F167\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002Fv1.5.4...v1.5.5","2021-12-16T05:59:19",{"id":226,"version":227,"summary_zh":228,"released_at":229},351909,"v1.5.4","## What's Changed\r\n* Fixes #146; MinhashLSH creates mongo index key. by @oisincar in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F148\r\n* Add `redis_buffer` configuration. by @QthCN in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F152\r\n* minhash: Get rid of deprecation warning by @xkubov in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F156\r\n\r\n## New Contributors\r\n* @oisincar made their first contribution in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F148\r\n* @QthCN made their first contribution in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F152\r\n* @xkubov made their first contribution in https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F156\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fcompare\u002F1.5.2...v1.5.4","2021-12-04T06:29:48",{"id":231,"version":232,"summary_zh":233,"released_at":234},351910,"1.5.2","* Performance improvement for MinHash's update method.\r\n* Make MinHash updates 4.5X faster by using `update_batch` method for bulk update on MinHash. [See API doc].(http:\u002F\u002Fekzhu.com\u002Fdatasketch\u002Fdocumentation.html#datasketch.MinHash.update_batch)\r\n* Further performance gain by using bulk generation of MinHash using `MinHash.bulk` or `MinHash.generator`. See [API doc](http:\u002F\u002Fekzhu.com\u002Fdatasketch\u002Fdocumentation.html#datasketch.MinHash.bulk) and [pull request](https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F142).\r\n* Optional compression for MinHash LSH index by hashing the bucket key produced by `MinHashLSH._H`. See [pull request](https:\u002F\u002Fgithub.com\u002Fekzhu\u002Fdatasketch\u002Fpull\u002F143). This leads to saving of memory\u002Fstorage space used by the index.\r\n\r\nThank you @Sinusoidal36! \r\n\r\n","2020-12-15T20:57:53",{"id":236,"version":237,"summary_zh":238,"released_at":239},351911,"v1.5.0","* Minor bug fixes\r\n* Cassandra storage layer, thank @ostefano! Now you can specify the Cassandra config just like the Redis one.\r\n\r\n```python\r\nfrom datasketch import MinHashLSH\r\n\r\nlsh = MinHashLSH(\r\n    threashold=0.5, num_perm=128, storage_config={\r\n        'type': 'cassandra',\r\n        'cassandra': {\r\n            'seeds': ['127.0.0.1'],\r\n            'keyspace': 'lsh_test',\r\n            'replication': {\r\n                'class': 'SimpleStrategy',\r\n                'replication_factor': '1',\r\n            },\r\n            'drop_keyspace': False,\r\n            'drop_tables': False,\r\n        }\r\n    }\r\n)\r\n```","2019-11-26T00:13:31",{"id":241,"version":242,"summary_zh":243,"released_at":244},351912,"v1.4.0","Now support `hashfunc` parameter for MinHash and HyperLogLog. The old parameter `hashobj` is removed.\r\n\r\n```python\r\n# Let's use MurmurHash3.\r\nimport mmh3\r\n\r\n# We need to define a new hash function that outputs an integer that\r\n# can be encoded in 32 bits.\r\ndef _hash_func(d):\r\n    return mmh3.hash32(d)\r\n\r\n# Use this function in MinHash constructor.\r\nm = MinHash(hashfunc=_hash_func)\r\n```","2019-01-06T22:25:36",{"id":246,"version":247,"summary_zh":248,"released_at":249},351913,"v.1.3.0","Use dynamic programming to create optimal partition, allow LSH Ensemble index to adapt to any set size distribution.","2018-12-27T16:02:31"]