[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-dleemiller--WordLlama":3,"tool-dleemiller--WordLlama":64},[4,18,28,36,44,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":24,"last_commit_at":25,"category_tags":26,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",143909,2,"2026-04-07T11:33:18",[14,13,27],"语言模型",{"id":29,"name":30,"github_repo":31,"description_zh":32,"stars":33,"difficulty_score":10,"last_commit_at":34,"category_tags":35,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[27,15,13,14],{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":24,"last_commit_at":42,"category_tags":43,"status":17},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[14,27],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":24,"last_commit_at":50,"category_tags":51,"status":17},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85013,"2026-04-06T11:09:19",[15,16,52,53,13,54,27,14,55],"视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":17},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[13,15,14,27,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":78,"owner_website":78,"owner_url":79,"languages":80,"stars":101,"forks":102,"last_commit_at":103,"license":104,"difficulty_score":105,"env_os":106,"env_gpu":107,"env_ram":108,"env_deps":109,"category_tags":114,"github_topics":78,"view_count":24,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":115,"updated_at":116,"faqs":117,"releases":146},5257,"dleemiller\u002FWordLlama","WordLlama","Things you can do with the token embeddings of an LLM","WordLlama 是一款轻量级、高性能的自然语言处理工具包，专为快速执行文本去重、相似度计算、文档排序、聚类分析及语义分割等任务而设计。它巧妙地复用了大型语言模型（LLM）中的词元嵌入（token embeddings），通过简单的查找与平均池化机制生成文本向量，从而在无需复杂推理依赖的情况下实现高效的语义理解。\n\n针对传统嵌入模型部署成本高、推理速度慢的痛点，WordLlama 进行了极致的 CPU 优化，显著降低了资源消耗，使其能在配置受限的环境中流畅运行。无论是需要清洗海量数据的工程师，还是追求快速原型的算法研究人员，都能利用它轻松解决文本匹配与组织难题。\n\n其技术亮点包括支持二值化嵌入以进一步加速计算、具备“套娃”式维度截断灵活性，以及原生兼容 Model2Vec 静态嵌入。此外，WordLlama 还能直接作为 Python 标准库函数（如 sorted、max）的键值参数使用，让开发者仅需几行代码即可实现智能排序与检索。如果你正在寻找一个无需显卡、即插即用的 NLP 解决方案，WordLlama 将是理想之选。","# WordLlama 📝🦙\n\n**WordLlama** is a fast, lightweight NLP toolkit designed for tasks like fuzzy deduplication, similarity computation, ranking, clustering, and semantic text splitting. It operates with minimal inference-time dependencies and is optimized for CPU hardware, making it suitable for deployment in resource-constrained environments.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdleemiller_WordLlama_readme_70a56b630177.png\" alt=\"Word Llama\" width=\"90%\">\n\u003C\u002Fp>\n\n## News and Updates 🔥\n\n- **2025-02-01**  Callable for stdlib functions (sorted\u002Fmin\u002Fmax)\n- **2025-01-04**  We're excited to announce support for model2vec static embeddings. See also: [Model2Vec](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fmodel2vec)\n- **2024-10-04**  Added semantic splitting inference algorithm. See our [technical overview](tutorials\u002Fblog\u002Fsemantic_split\u002Fwl_semantic_blog.md).\n\n## Table of Contents\n\n- [Quick Start](#quick-start)\n- [Features](#features)\n- [What is WordLlama?](#what-is-wordllama)\n- [MTEB Results](#mteb-results)\n- [How Fast?](#how-fast-zap)\n- [Usage](#usage)\n  - [Embedding Text](#embedding-text)\n  - [Stdlib Sorted\u002FMin\u002FMax](#stdlib-sorted-min-max)\n  - [Calculating Similarity](#calculating-similarity)\n  - [Ranking Documents](#ranking-documents)\n  - [Fuzzy Deduplication](#fuzzy-deduplication)\n  - [Clustering](#clustering)\n  - [Filtering](#filtering)\n  - [Top-K Retrieval](#top-k-retrieval)\n  - [Semantic Text Splitting](#semantic-text-splitting)\n  - [Loading Model2Vec](#loading-model2vec)\n  - [Inference Class](#inference-class)\n- [Training Notes](#training-notes)\n- [Roadmap](#roadmap)\n- [Extracting Token Embeddings](#extracting-token-embeddings)\n- [Community Projects](#community-projects)\n- [Citations](#citations)\n- [License](#license)\n\n## Quick Start\n\nInstall WordLlama via pip:\n\n```bash\npip install wordllama\n```\n\nLoad the default 256-dimensional model:\n\n```python\nfrom wordllama import WordLlama\n\n# Load the default WordLlama model\nwl = WordLlama.load()\n\nquery = \"Machine learning methods\"\ncandidates = [\n    \"Foundations of neural science\",\n    \"Introduction to neural networks\",\n    \"Cooking delicious pasta at home\",\n    \"Introduction to philosophy: logic\",\n]\n\n# Returns a Callable[[str], float] function\nsim_key = wl.key(query)\n\n# Sort candidates, most similar first\nsorted_candidates = sorted(candidates, key=sim_key, reverse=True)\n\n# Most similar candidate\nbest_candidate = max(candidates, key=sim_key)\n\n# Print the results\nprint(\"Ranked Candidates:\")\nfor i, candidate in enumerate(sorted_candidates, 1):\n    print(f\"{i}. {candidate} (Score: {sim_key(candidate):.4f})\")\n\nprint(f\"\\nBest Match: {best_candidate} (Score: {sim_key(best_candidate):.4f})\")\n\n# Ranked Candidates:\n# 1. Introduction to neural networks (Score: 0.3414)\n# 2. Foundations of neural science (Score: 0.2115)\n# 3. Introduction to philosophy: logic (Score: 0.1067)\n# 4. Cooking delicious pasta at home (Score: 0.0045)\n#\n# Best Match: Introduction to neural networks (Score: 0.3414)\n```\n\n## Features\n\n- **Fast Embeddings**: Efficiently generate text embeddings using a simple token lookup with average pooling.\n- **Similarity Computation**: Calculate cosine similarity between texts.\n- **Ranking**: Rank documents based on their similarity to a query.\n- **Fuzzy Deduplication**: Remove duplicate texts based on a similarity threshold.\n- **Clustering**: Cluster documents into groups using KMeans clustering.\n- **Filtering**: Filter documents based on their similarity to a query.\n- **Top-K Retrieval**: Retrieve the top-K most similar documents to a query.\n- **Semantic Text Splitting**: Split text into semantically coherent chunks.\n- **Binary Embeddings**: Support for binary embeddings with Hamming similarity for even faster computations.\n- **Matryoshka Representations**: Truncate embedding dimensions as needed for flexibility.\n- **Low Resource Requirements**: Optimized for CPU inference with minimal dependencies.\n\n## What is WordLlama?\n\nWordLlama is a utility for natural language processing (NLP) that recycles components from large language models (LLMs) to create efficient and compact word representations, similar to GloVe, Word2Vec, or FastText.\n\nStarting by extracting the token embedding codebook from state-of-the-art LLMs (e.g., LLaMA 2, LLaMA 3 70B), WordLlama trains a small context-less model within a general-purpose embedding framework. This approach results in a lightweight model that improves on all MTEB benchmarks over traditional word models like GloVe 300d, while being substantially smaller in size (e.g., **16MB default model** at 256 dimensions).\n\nWordLlama's key features include:\n\n1. **Matryoshka Representations**: Allows for truncation of the embedding dimension as needed, providing flexibility in model size and performance.\n2. **Low Resource Requirements**: Utilizes a simple token lookup with average pooling, enabling fast operation on CPUs without the need for GPUs.\n3. **Binary Embeddings**: Models trained using the straight-through estimator can be packed into small integer arrays for even faster Hamming distance calculations.\n4. **Numpy-only Inference**: Lightweight inference pipeline relying solely on NumPy, facilitating easy deployment and integration.\n\nBecause of its fast and portable size, WordLlama serves as a versatile tool for exploratory analysis and utility applications, such as LLM output evaluators or preparatory tasks in multi-hop or agentic workflows.\n\n## MTEB Results\n\nThe following table presents the performance of WordLlama models compared to other similar models.\n\n| Metric                 | WL64        | WL128        | WL256 (X)    | WL512        | WL1024        | GloVe 300d | Komninos | all-MiniLM-L6-v2 |\n|------------------------|-------------|--------------|--------------|--------------|---------------|------------|----------|------------------|\n| Clustering             | 30.27       | 32.20        | 33.25        | 33.40        | 33.62         | 27.73      | 26.57    | 42.35            |\n| Reranking              | 50.38       | 51.52        | 52.03        | 52.32        | 52.39         | 43.29      | 44.75    | 58.04            |\n| Classification         | 53.14       | 56.25        | 58.21        | 59.13        | 59.50         | 57.29      | 57.65    | 63.05            |\n| Pair Classification    | 75.80       | 77.59        | 78.22        | 78.50        | 78.60         | 70.92      | 72.94    | 82.37            |\n| STS                    | 66.24       | 67.53        | 67.91        | 68.22        | 68.27         | 61.85      | 62.46    | 78.90            |\n| CQA DupStack           | 18.76       | 22.54        | 24.12        | 24.59        | 24.83         | 15.47      | 16.79    | 41.32            |\n| SummEval               | 30.79       | 29.99        | 30.99        | 29.56        | 29.39         | 28.87      | 30.49    | 30.81            |\n\n**WL64** to **WL1024**: WordLlama models with embedding dimensions ranging from 64 to 1024.\n\n**Note**: The [l2_supercat](https:\u002F\u002Fhuggingface.co\u002Fdleemiller\u002Fword-llama-l2-supercat) is a LLaMA 2 vocabulary model. To train this model, we concatenated codebooks from several models, including LLaMA 2 70B and phi 3 medium, after removing additional special tokens. Because several models have used the LLaMA 2 tokenizer, their codebooks can be concatenated and trained together. The performance of the resulting model is comparable to training the LLaMA 3 70B codebook, while being 4x smaller (32k vs. 128k vocabulary).\n\n### Other Models\n\n- LLaMA 3-based: [l3_supercat](https:\u002F\u002Fhuggingface.co\u002Fdleemiller\u002Fwordllama-l3-supercat)\n- [Results](wordllama\u002FRESULTS.md)\n\n## How Fast? :zap:\n\n8k documents from the `ag_news` dataset\n- Single core performance (CPU), i9 12th gen, DDR4 3200\n- NVIDIA A4500 (GPU)\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdleemiller_WordLlama_readme_6c4540cb51ac.png\" alt=\"Word Llama\" width=\"80%\">\n\u003C\u002Fp>\n\n## Usage\n\n### Embedding Text\n\nLoad pre-trained embeddings and embed text:\n\n```python\nfrom wordllama import WordLlama\n\n# Load pre-trained embeddings (truncate dimension to 64)\nwl = WordLlama.load(trunc_dim=64)\n\n# Embed text\nembeddings = wl.embed([\"The quick brown fox jumps over the lazy dog\", \"And all that jazz\"])\nprint(embeddings.shape)  # Output: (2, 64)\n```\n\n### Stdlib Examples\n\nReturn a Callable function from `.key(query)`.\n\n```python\nquery = \"Machine learning methods\"\ncandidates = [\n    \"Foundations of neural science\",\n    \"Introduction to neural networks\",\n    \"Cooking delicious pasta at home\",\n    \"Introduction to philosophy: logic\",\n]\n\n# Returns a Callable[[str], float] function\nsim_key = wl.key(query)\n\n# Sort candidates, most similar first\nsorted_candidates = sorted(candidates, key=sim_key, reverse=True)\n\n# Most similar candidate\nbest_candidate = max(candidates, key=sim_key)\n\n# Print the results\nprint(\"Ranked Candidates:\")\nfor i, candidate in enumerate(sorted_candidates, 1):\n    print(f\"{i}. {candidate} (Score: {sim_key(candidate):.4f})\")\n\nprint(f\"\\nBest Match: {best_candidate} (Score: {sim_key(best_candidate):.4f})\")\n\n# Ranked Candidates:\n# 1. Introduction to neural networks (Score: 0.3414)\n# 2. Foundations of neural science (Score: 0.2115)\n# 3. Introduction to philosophy: logic (Score: 0.1067)\n# 4. Cooking delicious pasta at home (Score: 0.0045)\n#\n# Best Match: Introduction to neural networks (Score: 0.3414)\n```\n\n### Calculating Similarity\n\nCompute the similarity between two texts:\n\n```python\nsimilarity_score = wl.similarity(\"I went to the car\", \"I went to the pawn shop\")\nprint(similarity_score)  # Output: e.g., 0.0664\n```\n\n### Ranking Documents\n\nRank documents based on their similarity to a query:\n\n```python\nquery = \"I went to the car\"\ncandidates = [\"I went to the park\", \"I went to the shop\", \"I went to the truck\", \"I went to the vehicle\"]\nranked_docs = wl.rank(query, candidates, sort=True, batch_size=64)\nprint(ranked_docs)\n# Output:\n# [\n#   ('I went to the vehicle', 0.7441),\n#   ('I went to the truck', 0.2832),\n#   ('I went to the shop', 0.1973),\n#   ('I went to the park', 0.1510)\n# ]\n```\n\n### Fuzzy Deduplication\n\nRemove duplicate texts based on a similarity threshold:\n\n```python\ndeduplicated_docs = wl.deduplicate(candidates, return_indices=False, threshold=0.5)\nprint(deduplicated_docs)\n# Output:\n# ['I went to the park',\n#  'I went to the shop',\n#  'I went to the truck']\n```\n\n### Clustering\n\nCluster documents into groups using KMeans clustering:\n\n```python\nlabels, inertia = wl.cluster(candidates, k=3, max_iterations=100, tolerance=1e-4, n_init=3)\nprint(labels, inertia)\n# Output:\n# [2, 0, 1, 1], 0.4150\n```\n\n### Filtering\n\nFilter documents based on their similarity to a query:\n\n```python\nfiltered_docs = wl.filter(query, candidates, threshold=0.3)\nprint(filtered_docs)\n# Output:\n# ['I went to the vehicle']\n```\n\n### Top-K Retrieval\n\nRetrieve the top-K most similar documents to a query:\n\n```python\ntop_docs = wl.topk(query, candidates, k=2)\nprint(top_docs)\n# Output:\n# ['I went to the vehicle', 'I went to the truck']\n```\n\n### Semantic Text Splitting\n\nSplit text into semantic chunks:\n\n```python\nlong_text = \"Your very long text goes here... \" * 100\nchunks = wl.split(long_text, target_size=1536)\n\nprint(list(map(len, chunks)))\n# Output: [1055, 1055, 1187]\n```\n\nNote that the target size is also the maximum size. The `.split()` feature attempts to aggregate sections up to the `target_size`,\nbut will retain the order of the text as well as sentence and, as much as possible, paragraph structure.\nIt uses wordllama embeddings to locate more natural indexes to split on. As a result, there will be a range of chunk sizes in the output\nup to the target size.\n\nThe recommended target size is from 512 to 2048 characters, with the default size at 1536. Chunks that need to be much larger should\nprobably be batched after splitting, and will often be aggregated from multiple semantic chunks already.\n\nFor more information see: [technical overview](tutorials\u002Fblog\u002Fsemantic_split\u002Fwl_semantic_blog.md)\n\n\n### Loading Model2Vec\n\n```python\nwl = WordLlama.list_configs()\n# dict of config names\n\nwl = WordLlama.load_m2v(\"potion_base_8m\") # 256-dim model\nwl = WordLlama.load_m2v(\"m2v_multilingual\") # multilingual model\n```\n\nModel2Vec is a different way of creating static embeddings using PCA.\nNotably, they have produced multilingual models, and glove-based models, which score well in word similarity tasks.\n\nCheck them out on huggingface! [minishlab](https:\u002F\u002Fhuggingface.co\u002Fminishlab)\n\n\n### Inference Class\n\n```python\nfrom wordllama import WordLlamaInference\nfrom tokenizers import Tokenizer\n\ntokenizer = Tokenizer.from_pretrained(...)\nwl = WordLlamaInference(np_embeddings_ar, tokenizer)\n```\n\nThe inference class can be used directly with a bring-your-own static embeddings array (n_vocab, dim), rather than using the loader.\n\n\n## Training Notes\n\nBinary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 dimensions are recommended for binary embeddings.\n\nThe L2 Supercat model was trained using a batch size of 512 on a single A100 GPU for 12 hours.\n\n## Roadmap\n\n- **Additional Example Notebooks**:\n  - DSPy evaluators\n  - Retrieval-Augmented Generation (RAG) pipelines\n\n## Development\n\nFor local development:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama.git\ncd WordLlama\npip install uv\nuv sync --all-extras\nuv run python setup.py build_ext --inplace\nuv run pytest\n```\n\nSee the [Makefile](Makefile) for common development commands.\n\n## Extracting Token Embeddings\n\nTo extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for LLaMA models). You can then use the following snippet:\n\n```python\nfrom wordllama.extract.extract_safetensors import extract_safetensors\n\n# Extract embeddings for the specified configuration\nextract_safetensors(\"llama3_70B\", \"path\u002Fto\u002Fsaved\u002Fmodel-0001-of-00XX.safetensors\")\n```\n\n**Hint**: Embeddings are usually in the first `safetensors` file, but not always. Sometimes there is a manifest; sometimes you have to inspect and figure it out.\n\nFor training, use the scripts in the GitHub repository. You have to add a configuration file (copy\u002Fmodify an existing one into the folder).\n\n```bash\npip install wordllama[train]\npython train.py train --config your_new_config\n# (Training process begins)\npython train.py save --config your_new_config --checkpoint ... --outdir \u002Fpath\u002Fto\u002Fweights\u002F\n# (Saves one model per Matryoshka dimension)\n```\n\n## Community Projects\n\n- [Gradio Demo HF Space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002F1littlecoder\u002Fwordllama)\n- [CPU-ish RAG](https:\u002F\u002Fgithub.com\u002Fdinhanhx\u002Fcpu-ish-rag)\n\n## Citations\n\nIf you use WordLlama in your research or project, please consider citing it as follows:\n\n```bibtex\n@software{miller2024wordllama,\n  author = {Miller, D. Lee},\n  title = {WordLlama: Recycled Token Embeddings from Large Language Models},\n  year = {2024},\n  url = {https:\u002F\u002Fgithub.com\u002Fdleemiller\u002Fwordllama},\n  version = {0.3.9}\n}\n```\n\n## License\n\nThis project is licensed under the MIT License.\n","# WordLlama 📝🦙\n\n**WordLlama** 是一个快速、轻量级的自然语言处理工具包，专为模糊去重、相似度计算、排序、聚类以及语义文本分割等任务而设计。它在推理时依赖极少，并针对 CPU 硬件进行了优化，因此非常适合部署在资源受限的环境中。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdleemiller_WordLlama_readme_70a56b630177.png\" alt=\"Word Llama\" width=\"90%\">\n\u003C\u002Fp>\n\n## 新闻与更新 🔥\n\n- **2025-02-01**  支持标准库函数（sorted\u002Fmin\u002Fmax）的可调用接口\n- **2025-01-04**  我们很高兴地宣布支持 model2vec 静态嵌入。更多信息请参阅：[Model2Vec](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fmodel2vec)\n- **2024-10-04**  添加了语义分割推理算法。请查看我们的[技术概述](tutorials\u002Fblog\u002Fsemantic_split\u002Fwl_semantic_blog.md)。\n\n## 目录\n\n- [快速入门](#quick-start)\n- [功能特性](#features)\n- [什么是 WordLlama？](#what-is-wordllama)\n- [MTEB 测试结果](#mteb-results)\n- [速度有多快？](#how-fast-zap)\n- [使用方法](#usage)\n  - [文本嵌入](#embedding-text)\n  - [标准库排序\u002F最小值\u002F最大值](#stdlib-sorted-min-max)\n  - [计算相似度](#calculating-similarity)\n  - [文档排序](#ranking-documents)\n  - [模糊去重](#fuzzy-deduplication)\n  - [聚类](#clustering)\n  - [过滤](#filtering)\n  - [Top-K 检索](#top-k-retrieval)\n  - [语义文本分割](#semantic-text-splitting)\n  - [加载 Model2Vec](#loading-model2vec)\n  - [推理类](#inference-class)\n- [训练笔记](#training-notes)\n- [路线图](#roadmap)\n- [提取词元嵌入](#extracting-token-embeddings)\n- [社区项目](#community-projects)\n- [引用](#citations)\n- [许可证](#license)\n\n## 快速入门\n\n通过 pip 安装 WordLlama：\n\n```bash\npip install wordllama\n```\n\n加载默认的 256 维模型：\n\n```python\nfrom wordllama import WordLlama\n\n# 加载默认的 WordLlama 模型\nwl = WordLlama.load()\n\nquery = \"机器学习方法\"\ncandidates = [\n    \"神经科学基础\",\n    \"神经网络导论\",\n    \"在家烹饪美味意大利面\",\n    \"哲学导论：逻辑学\",\n]\n\n# 返回一个 Callable[[str], float] 函数\nsim_key = wl.key(query)\n\n# 对候选列表按相似度从高到低排序\nsorted_candidates = sorted(candidates, key=sim_key, reverse=True)\n\n# 最相似的候选\nbest_candidate = max(candidates, key=sim_key)\n\n# 打印结果\nprint(\"排序后的候选列表：\")\nfor i, candidate in enumerate(sorted_candidates, 1):\n    print(f\"{i}. {candidate} (得分: {sim_key(candidate):.4f})\")\n\nprint(f\"\\n最佳匹配：{best_candidate} (得分: {sim_key(best_candidate):.4f})\")\n\n# 排序后的候选列表：\n# 1. 神经网络导论 (得分: 0.3414)\n# 2. 神经科学基础 (得分: 0.2115)\n# 3. 哲学导论：逻辑学 (得分: 0.1067)\n# 4. 在家烹饪美味意大利面 (得分: 0.0045)\n#\n# 最佳匹配：神经网络导论 (得分: 0.3414)\n```\n\n## 功能特性\n\n- **快速嵌入**：通过简单的词元查找和平均池化，高效生成文本嵌入。\n- **相似度计算**：计算文本之间的余弦相似度。\n- **排序**：根据文档与查询的相似度对文档进行排序。\n- **模糊去重**：基于相似度阈值去除重复文本。\n- **聚类**：使用 KMeans 聚类算法将文档分组。\n- **过滤**：根据文档与查询的相似度筛选文档。\n- **Top-K 检索**：检索与查询最相似的前 K 个文档。\n- **语义文本分割**：将文本分割成语义连贯的块。\n- **二进制嵌入**：支持二进制嵌入和汉明相似度，以实现更快的计算。\n- **套娃式表示**：可根据需要截断嵌入维度，提供更大的灵活性。\n- **低资源需求**：针对 CPU 推理进行了优化，依赖项极少。\n\n## 什么是 WordLlama？\n\nWordLlama 是一种自然语言处理工具，它复用了大型语言模型（LLMs）中的组件，以创建高效且紧凑的词表示，类似于 GloVe、Word2Vec 或 FastText。\n\nWordLlama 首先从最先进的 LLMs（例如 LLaMA 2、LLaMA 3 70B）中提取词元嵌入码本，然后在一个通用的嵌入框架内训练一个小规模的无上下文模型。这种方法生成了一个轻量级模型，在所有 MTEB 基准测试上都优于传统的词模型（如 GloVe 300d），同时体积也小得多（例如，**256 维的默认模型仅 16MB**）。\n\nWordLlama 的主要特点包括：\n\n1. **套娃式表示**：允许根据需要截断嵌入维度，从而在模型大小和性能之间取得灵活平衡。\n2. **低资源需求**：采用简单的词元查找结合平均池化，可在 CPU 上快速运行，无需 GPU。\n3. **二进制嵌入**：使用直通估计器训练的模型可以打包成小型整数数组，以实现更快的汉明距离计算。\n4. **纯 NumPy 推理**：轻量级的推理流程完全依赖 NumPy，便于部署和集成。\n\n由于其速度快、体积小且便携，WordLlama 可作为探索性分析和实用应用程序的多功能工具，例如用于评估 LLM 输出或在多跳式、代理式工作流中执行预处理任务。\n\n## MTEB 评测结果\n\n下表展示了 WordLlama 模型与其他类似模型的性能对比。\n\n| 指标                 | WL64        | WL128        | WL256 (X)    | WL512        | WL1024        | GloVe 300d | Komninos | all-MiniLM-L6-v2 |\n|------------------------|-------------|--------------|--------------|--------------|---------------|------------|----------|------------------|\n| 聚类                 | 30.27       | 32.20        | 33.25        | 33.40        | 33.62         | 27.73      | 26.57    | 42.35            |\n| 重排序               | 50.38       | 51.52        | 52.03        | 52.32        | 52.39         | 43.29      | 44.75    | 58.04            |\n| 分类                 | 53.14       | 56.25        | 58.21        | 59.13        | 59.50         | 57.29      | 57.65    | 63.05            |\n| 对分类               | 75.80       | 77.59        | 78.22        | 78.50        | 78.60         | 70.92      | 72.94    | 82.37            |\n| STS                  | 66.24       | 67.53        | 67.91        | 68.22        | 68.27         | 61.85      | 62.46    | 78.90            |\n| CQA DupStack           | 18.76       | 22.54        | 24.12        | 24.59        | 24.83         | 15.47      | 16.79    | 41.32            |\n| SummEval               | 30.79       | 29.99        | 30.99        | 29.56        | 29.39         | 28.87      | 30.49    | 30.81            |\n\n**WL64** 至 **WL1024**：WordLlama 模型，嵌入维度从 64 到 1024 不等。\n\n**注**：[l2_supercat](https:\u002F\u002Fhuggingface.co\u002Fdleemiller\u002Fword-llama-l2-supercat) 是一个 LLaMA 2 词汇表模型。为了训练该模型，我们在移除额外的特殊标记后，将来自多个模型（包括 LLaMA 2 70B 和 phi 3 medium）的代码本拼接在一起。由于多个模型使用了 LLaMA 2 的分词器，它们的代码本可以拼接并一起训练。最终得到的模型性能与训练 LLaMA 3 70B 代码本相当，但规模仅为前者的四分之一（32k 词汇量 vs. 128k 词汇量）。\n\n### 其他模型\n\n- 基于 LLaMA 3 的：[l3_supercat](https:\u002F\u002Fhuggingface.co\u002Fdleemiller\u002Fwordllama-l3-supercat)\n- [结果](wordllama\u002FRESULTS.md)\n\n## 运行速度如何？ :zap:\n\n`ag_news` 数据集中的 8,000 条文档\n- 单核性能（CPU），i9 第 12 代，DDR4 3200\n- NVIDIA A4500（GPU）\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdleemiller_WordLlama_readme_6c4540cb51ac.png\" alt=\"Word Llama\" width=\"80%\">\n\u003C\u002Fp>\n\n## 使用方法\n\n### 文本嵌入\n\n加载预训练的嵌入并向文本进行嵌入：\n\n```python\nfrom wordllama import WordLlama\n\n# 加载预训练的嵌入（截断维度至 64）\nwl = WordLlama.load(trunc_dim=64)\n\n# 向文本进行嵌入\nembeddings = wl.embed([\"The quick brown fox jumps over the lazy dog\", \"And all that jazz\"])\nprint(embeddings.shape)  # 输出：(2, 64)\n```\n\n### 标准库示例\n\n通过 `.key(query)` 返回一个可调用函数。\n\n```python\nquery = \"Machine learning methods\"\ncandidates = [\n    \"Foundations of neural science\",\n    \"Introduction to neural networks\",\n    \"Cooking delicious pasta at home\",\n    \"Introduction to philosophy: logic\",\n]\n\n# 返回一个 Callable[[str], float] 函数\nsim_key = wl.key(query)\n\n# 对候选者按相似度排序，最相似的排在前面\nsorted_candidates = sorted(candidates, key=sim_key, reverse=True)\n\n# 最相似的候选者\nbest_candidate = max(candidates, key=sim_key)\n\n# 打印结果\nprint(\"Ranked Candidates:\")\nfor i, candidate in enumerate(sorted_candidates, 1):\n    print(f\"{i}. {candidate} (Score: {sim_key(candidate):.4f})\")\n\nprint(f\"\\nBest Match: {best_candidate} (Score: {sim_key(best_candidate):.4f})\")\n\n# Ranked Candidates:\n# 1. Introduction to neural networks (Score: 0.3414)\n# 2. Foundations of neural science (Score: 0.2115)\n# 3. Introduction to philosophy: logic (Score: 0.1067)\n# 4. Cooking delicious pasta at home (Score: 0.0045)\n#\n# Best Match: Introduction to neural networks (Score: 0.3414)\n```\n\n### 计算相似度\n\n计算两段文本之间的相似度：\n\n```python\nsimilarity_score = wl.similarity(\"I went to the car\", \"I went to the pawn shop\")\nprint(similarity_score)  # 输出：例如 0.0664\n```\n\n### 文档排名\n\n根据文档与查询的相似度对文档进行排名：\n\n```python\nquery = \"I went to the car\"\ncandidates = [\"I went to the park\", \"I went to the shop\", \"I went to the truck\", \"I went to the vehicle\"]\nranked_docs = wl.rank(query, candidates, sort=True, batch_size=64)\nprint(ranked_docs)\n# 输出：\n# [\n#   ('I went to the vehicle', 0.7441),\n#   ('I went to the truck', 0.2832),\n#   ('I went to the shop', 0.1973),\n#   ('I went to the park', 0.1510)\n# ]\n```\n\n### 模糊去重\n\n根据相似度阈值去除重复文本：\n\n```python\ndeduplicated_docs = wl.deduplicate(candidates, return_indices=False, threshold=0.5)\nprint(deduplicated_docs)\n# 输出：\n# ['I went to the park',\n#  'I went to the shop',\n#  'I went to the truck']\n```\n\n### 聚类\n\n使用 KMeans 聚类算法将文档聚为若干组：\n\n```python\nlabels, inertia = wl.cluster(candidates, k=3, max_iterations=100, tolerance=1e-4, n_init=3)\nprint(labels, inertia)\n# 输出：\n# [2, 0, 1, 1], 0.4150\n```\n\n### 过滤\n\n根据文档与查询的相似度进行过滤：\n\n```python\nfiltered_docs = wl.filter(query, candidates, threshold=0.3)\nprint(filtered_docs)\n# 输出：\n# ['I went to the vehicle']\n```\n\n### Top-K 检索\n\n检索与查询最相似的前 K 个文档：\n\n```python\ntop_docs = wl.topk(query, candidates, k=2)\nprint(top_docs)\n# 输出：\n# ['I went to the vehicle', 'I went to the truck']\n```\n\n### 语义文本分割\n\n将文本分割成语义块：\n\n```python\nlong_text = \"Your very long text goes here... \" * 100\nchunks = wl.split(long_text, target_size=1536)\n\nprint(list(map(len, chunks)))\n# 输出：[1055, 1055, 1187]\n```\n\n请注意，目标大小也是最大大小。`.split()` 功能会尝试聚合不超过 `target_size` 的部分，同时保留文本的顺序以及句子和尽可能多的段落结构。它使用 wordllama 嵌入来找到更自然的分割点。因此，输出中会出现一系列大小不一的块，但都不会超过目标大小。\n\n推荐的目标大小范围是 512 到 2048 个字符，默认大小为 1536。如果需要更大的块，可能需要在分割后进行批量处理，并且通常会由多个语义块组合而成。\n\n更多信息请参阅：[技术概述](tutorials\u002Fblog\u002Fsemantic_split\u002Fwl_semantic_blog.md)\n\n\n### 加载 Model2Vec\n\n```python\nwl = WordLlama.list_configs()\n# 配置名称字典\n\nwl = WordLlama.load_m2v(\"potion_base_8m\") # 256 维模型\nwl = WordLlama.load_m2v(\"m2v_multilingual\") # 多语言模型\n```\n\nModel2Vec 是一种使用 PCA 创建静态嵌入的不同方法。\n值得注意的是，他们已经开发出多语言模型和基于 GloVe 的模型，在词语相似度任务中表现优异。\n\n欢迎前往 Hugging Face 查看！[minishlab](https:\u002F\u002Fhuggingface.co\u002Fminishlab)\n\n### 推理类\n\n```python\nfrom wordllama import WordLlamaInference\nfrom tokenizers import Tokenizer\n\ntokenizer = Tokenizer.from_pretrained(...)\nwl = WordLlamaInference(np_embeddings_ar, tokenizer)\n```\n\n推理类可以直接与用户自定义的静态嵌入数组（n_vocab, dim）一起使用，而无需使用加载器。\n\n\n## 训练注意事项\n\n二值化嵌入模型在较高维度下表现出了更为显著的提升，因此建议为二值化嵌入选择512或1024维。\n\nL2 Supercat 模型是在单张 A100 GPU 上以 512 的批量大小训练了 12 小时。\n\n## 路线图\n\n- **更多示例笔记本**：\n  - DSPy 评估器\n  - 检索增强生成（RAG）流水线\n\n## 开发\n\n本地开发步骤如下：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama.git\ncd WordLlama\npip install uv\nuv sync --all-extras\nuv run python setup.py build_ext --inplace\nuv run pytest\n```\n\n更多常用开发命令请参阅 [Makefile](Makefile)。\n\n## 提取 Token 嵌入\n\n要从模型中提取 Token 嵌入，请确保已同意用户协议并使用 Hugging Face CLI 登录（适用于 LLaMA 模型）。然后可以使用以下代码片段：\n\n```python\nfrom wordllama.extract.extract_safetensors import extract_safetensors\n\n# 提取指定配置的嵌入\nextract_safetensors(\"llama3_70B\", \"path\u002Fto\u002Fsaved\u002Fmodel-0001-of-00XX.safetensors\")\n```\n\n**提示**：嵌入通常位于第一个 `safetensors` 文件中，但并不总是如此。有时会有清单文件，有时则需要手动检查并确定。\n\n训练时，请使用 GitHub 仓库中的脚本。您需要添加一个配置文件（复制或修改现有配置文件并放入相应文件夹）。\n\n```bash\npip install wordllama[train]\npython train.py train --config your_new_config\n# （开始训练过程）\npython train.py save --config your_new_config --checkpoint ... --outdir \u002Fpath\u002Fto\u002Fweights\u002F\n# （按每个 Matryoshka 维度保存一个模型）\n```\n\n## 社区项目\n\n- [Gradio 演示 HF Space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002F1littlecoder\u002Fwordllama)\n- [CPU-ish RAG](https:\u002F\u002Fgithub.com\u002Fdinhanhx\u002Fcpu-ish-rag)\n\n## 引用\n\n如果您在研究或项目中使用 WordLlama，请考虑按照以下方式引用：\n\n```bibtex\n@software{miller2024wordllama,\n  author = {Miller, D. Lee},\n  title = {WordLlama: 从大型语言模型中回收的 Token 嵌入},\n  year = {2024},\n  url = {https:\u002F\u002Fgithub.com\u002Fdleemiller\u002Fwordllama},\n  version = {0.3.9}\n}\n```\n\n## 许可证\n\n本项目采用 MIT 许可证授权。","# WordLlama 快速上手指南\n\nWordLlama 是一个轻量级、高速的 NLP 工具包，专为模糊去重、相似度计算、排序、聚类和语义文本分割等任务设计。它仅依赖 CPU 运行，推理时无需重型框架，非常适合资源受限的环境。\n\n## 环境准备\n\n- **操作系统**：Linux, macOS, Windows\n- **Python 版本**：Python 3.8+\n- **硬件要求**：仅需 CPU（已针对 CPU 推理优化），无需 GPU\n- **核心依赖**：`numpy`（安装时会自动处理）\n\n## 安装步骤\n\n使用 pip 直接安装官方版本：\n\n```bash\npip install wordllama\n```\n\n> **提示**：国内开发者若遇到下载速度慢的问题，可使用清华或阿里镜像源加速安装：\n> ```bash\n> pip install wordllama -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 基本使用\n\n以下示例展示如何加载默认模型并对候选文本进行相似度排序。\n\n### 1. 加载模型与相似度排序\n\n```python\nfrom wordllama import WordLlama\n\n# 加载默认的 256 维 WordLlama 模型\nwl = WordLlama.load()\n\nquery = \"Machine learning methods\"\ncandidates = [\n    \"Foundations of neural science\",\n    \"Introduction to neural networks\",\n    \"Cooking delicious pasta at home\",\n    \"Introduction to philosophy: logic\",\n]\n\n# 生成一个可调用函数，用于计算与 query 的相似度\nsim_key = wl.key(query)\n\n# 按相似度降序排序\nsorted_candidates = sorted(candidates, key=sim_key, reverse=True)\n\n# 获取最相似的候选项\nbest_candidate = max(candidates, key=sim_key)\n\n# 打印结果\nprint(\"Ranked Candidates:\")\nfor i, candidate in enumerate(sorted_candidates, 1):\n    print(f\"{i}. {candidate} (Score: {sim_key(candidate):.4f})\")\n\nprint(f\"\\nBest Match: {best_candidate} (Score: {sim_key(best_candidate):.4f})\")\n```\n\n**输出示例：**\n```text\nRanked Candidates:\n1. Introduction to neural networks (Score: 0.3414)\n2. Foundations of neural science (Score: 0.2115)\n3. Introduction to philosophy: logic (Score: 0.1067)\n4. Cooking delicious pasta at home (Score: 0.0045)\n\nBest Match: Introduction to neural networks (Score: 0.3414)\n```\n\n### 2. 其他常用功能速览\n\n除了排序，WordLlama 还支持多种开箱即用的功能：\n\n*   **计算相似度**：`wl.similarity(\"文本 A\", \"文本 B\")`\n*   **文档排序**：`wl.rank(query, candidates)`\n*   **模糊去重**：`wl.deduplicate(candidates, threshold=0.5)`\n*   **聚类**：`wl.cluster(candidates, k=3)`\n*   **Top-K 检索**：`wl.topk(query, candidates, k=2)`\n*   **语义文本分割**：`wl.split(long_text, target_size=1536)`\n\n### 3. 自定义维度加载\n\n如需更小的模型体积或更快的速度，可截断嵌入维度（例如降至 64 维）：\n\n```python\n# 加载并截断至 64 维\nwl = WordLlama.load(trunc_dim=64)\nembeddings = wl.embed([\"The quick brown fox\", \"And all that jazz\"])\nprint(embeddings.shape)  # 输出：(2, 64)\n```","某初创公司的数据工程师需要在资源有限的 CPU 服务器上，对每日抓取的数万条新闻标题进行实时去重和语义分类，以构建高质量的训练数据集。\n\n### 没有 WordLlama 时\n- **硬件成本高昂**：传统嵌入模型依赖 GPU 推理，导致服务器采购和维护成本居高不下，难以在边缘设备或低成本云实例上部署。\n- **处理延迟严重**：面对海量文本流，复杂的模型推理速度慢，无法在数据入库前完成实时的模糊去重，导致数据库充斥着大量重复内容。\n- **集成复杂度高**：引入重型 NLP 库需要管理繁琐的依赖环境和庞大的模型文件，增加了运维负担和系统不稳定性。\n- **语义切分困难**：缺乏轻量级的语义分割工具，只能按固定字符数粗暴截断长文本，破坏了内容的逻辑连贯性，影响下游任务效果。\n\n### 使用 WordLlama 后\n- **极致轻量化部署**：WordLlama 专为 CPU 优化且依赖极少，直接利用现有低配服务器即可流畅运行，大幅降低了基础设施开支。\n- **毫秒级实时处理**：凭借简单的 Token 查找和平均池化机制，WordLlama 能快速计算相似度并执行模糊去重，确保数据流的清洁与实时性。\n- **开箱即用的便捷性**：通过简单的 pip 安装即可加载默认模型，无需配置复杂环境，轻松实现文档排序、聚类和 Top-K 检索等功能。\n- **智能语义分割**：利用内置的语义分割算法，WordLlama 能根据内容逻辑自动将长文章切分为连贯片段，显著提升了数据质量。\n\nWordLlama 通过将大模型的 Token 嵌入转化为轻量级工具，让资源受限环境也能拥有高效、精准的语义处理能力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdleemiller_WordLlama_70a56b63.png","dleemiller","Lee Miller","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fdleemiller_546688c9.png",null,"https:\u002F\u002Fgithub.com\u002Fdleemiller",[81,85,89,93,97],{"name":82,"color":83,"percentage":84},"Python","#3572A5",66.8,{"name":86,"color":87,"percentage":88},"Cython","#fedf5b",15.8,{"name":90,"color":91,"percentage":92},"Jupyter Notebook","#DA5B0B",11.6,{"name":94,"color":95,"percentage":96},"Shell","#89e051",4.3,{"name":98,"color":99,"percentage":100},"Makefile","#427819",1.4,1454,50,"2026-04-03T15:13:00","MIT",1,"未说明 (基于 Python\u002FNumPy，通常支持 Linux, macOS, Windows)","不需要 GPU。专为 CPU 推理优化，无需 CUDA。","未说明 (模型极小，默认模型仅 16MB，适合资源受限环境)",{"notes":110,"python":111,"dependencies":112},"该工具设计为轻量级，仅依赖 NumPy 进行推理。默认模型大小为 16MB (256 维)，支持从 64 到 1024 维的灵活截断。支持加载 Model2Vec 静态嵌入。适用于资源受限的环境，无需重型深度学习框架（如 PyTorch）即可运行推理任务。","未说明",[113],"numpy",[27,16],"2026-03-27T02:49:30.150509","2026-04-08T05:30:24.430360",[118,123,128,133,137,142],{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},23821,"为什么在 Fedora Linux 上运行时会遇到 'Illegal instruction (core dumped)' 错误？","这通常是因为某些 CPU 不支持 `popcount` 指令。维护者已在新版本中添加了 `numpy>=2` 的要求，以便使用 NumPy 新的位运算后端（`numpy.bitwise_count`）来处理汉明距离，从而兼容更多硬件。请升级您的 numpy 版本至 2.0 以上，或升级到更新的 WordLlama 版本以解决此问题。","https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fissues\u002F10",{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},23822,"wl.embed 和 wl.cluster 功能内存占用过高怎么办？","内存占用与数据集大小及收敛设置有关。建议调整以下参数以平衡性能与资源：\n1. 降低 `max_iterations`（例如设为 10），因为默认的高收敛标准会导致不必要的长时间运行和内存消耗。\n2. 增加容忍度（tolerance），从默认的 1e-4 开始调大，直到达到可接受的延迟。\n3. 注意惯性（inertia）会随数据点数量增加而增加，因此对于大数据集，过小的容忍度意义不大。尝试多次初始化（initializations）通常比延长单次收敛时间更有效。","https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fissues\u002F17",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},23823,"WordLlama 是否支持多语言模型（如 Llama 3.1）？","是的，WordLlama 现已合并了对 `model2vec` 模型的支持，其中包含推荐使用的多语言模型。您可以参考官方文档中的 'Loading model2vec' 部分来加载和使用这些多语言嵌入模型。此外，Sentence Transformers (sbert) 的 3.2 版本也加强了对此类静态嵌入方法的支持。","https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fissues\u002F30",{"id":134,"question_zh":135,"answer_zh":136,"source_url":127},23824,"如何优化 K-Means 聚类的收敛时间和效果？","对于大规模数据（如 5 万个点），惯性值较大，默认的微小容忍度（1e-4）会导致过度迭代。优化建议如下：\n1. 将 `max_iterations` 设置为您可接受的最长延迟对应的次数。\n2. 逐步增大 `tolerance` 参数，直到获得理想的平均延迟。\n3. 相比于让单次运行长时间收敛，尝试更多的随机初始化（n_iter）通常能获得更好的聚类结果，因为 K-Means 容易陷入局部最优。",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},23825,"在使用 extract_safetensors 提取 Token 嵌入时遇到 FileNotFoundError 如何解决？","该错误通常是因为提供的模型文件路径不正确或文件不存在。请确保：\n1. `model_path` 指向的是完整的 `.safetensors` 文件路径（例如 `model-0001-of-00XX.safetensors`）。\n2. 如果模型被分割成多个分片，可能需要处理所有分片或确认提取脚本是否支持自动合并。\n3. 检查输出目录权限及路径是否存在。注意：提取脚本可能需要特定的配置名称（如 \"llama3_70B\"）来正确解析张量键。","https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fissues\u002F37",{"id":143,"question_zh":144,"answer_zh":145,"source_url":132},23826,"WordLlama 的文本嵌入过程是如何工作的？对长句子有效吗？","WordLlama 主要通过平均令牌嵌入（token embeddings）而不使用上下文部分来生成文本嵌入。虽然这种方法简单，但对于较大的句子，简单的平均可能无法完全代表复杂的语义上下文。不过，结合新的 `model2vec` 支持，可以使用经过蒸馏的多语言模型来获得更好的静态嵌入效果，这在一定程度上弥补了简单平均的不足。",[147,152,157],{"id":148,"version":149,"summary_zh":150,"released_at":151},145404,"v0.4.0","## 变更内容\n* 修复（语义分隔符）：返回语义块，而不是忽略分析，由 @bsmith925 在 https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fpull\u002F51 中完成\n* 功能：工具链现代化，由 @dleemiller 在 https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fpull\u002F52 中完成\n* 功能：将测试转换为 pytest 格式，由 @dleemiller 在 https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fpull\u002F53 中完成\n\n## 新贡献者\n* @bsmith925 在 https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fpull\u002F51 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fcompare\u002Fv0.3.9...v0.4.0","2025-12-01T21:42:13",{"id":153,"version":154,"summary_zh":155,"released_at":156},145405,"v0.3.9","## 变更内容\n* 功能\u002F标准库键函数，由 @dleemiller 在 https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fpull\u002F48 中实现\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fcompare\u002Fv0.3.8...v0.3.9","2025-02-02T02:54:47",{"id":158,"version":159,"summary_zh":160,"released_at":161},145406,"v0.3.8","## 变更内容\n* 工程优化：导出 WordLlamInference 类，支持加载本地模型。由 @karavindhan 在 https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fpull\u002F45 中完成。\n* 添加 m2v 加载器。由 @dleemiller 在 https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fpull\u002F47 中完成。\n\n## 新贡献者\n* @karavindhan 在 https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fpull\u002F45 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fdleemiller\u002FWordLlama\u002Fcompare\u002Fv0.3.7...v0.3.8","2025-01-05T14:42:54"]