[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-explosion--sense2vec":3,"tool-explosion--sense2vec":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",153609,2,"2026-04-13T11:34:59",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":77,"owner_email":78,"owner_twitter":77,"owner_website":79,"owner_url":80,"languages":81,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":32,"env_os":94,"env_gpu":95,"env_ram":96,"env_deps":97,"category_tags":103,"github_topics":104,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":113,"updated_at":114,"faqs":115,"releases":143},7234,"explosion\u002Fsense2vec","sense2vec","🦆 Contextually-keyed word vectors","sense2vec 是 word2vec 的进阶版本，由 Explosion AI 开发，旨在生成更具语境感知能力的词向量。传统的词嵌入模型往往无法区分多义词在不同场景下的含义，而 sense2vec 巧妙地将词的词性（如名词、动词）和实体标签融入向量表示中。例如，它能精准区分作为“银行”的金融机构和作为“河岸”的地理概念，从而让计算机更准确地理解词语在特定上下文中的语义。\n\n这一特性有效解决了自然语言处理中常见的歧义问题，显著提升了短语相似度计算、语义搜索及实体关联分析的准确性。sense2vec 特别适合 NLP 开发者、数据科学家及人工智能研究人员使用。它不仅能作为独立库加载和查询预训练模型，还能无缝集成到流行的 spaCy 管道中，通过扩展属性直接获取文本片段的向量和相似词推荐。\n\n其技术亮点在于支持基于词性和实体标签的多词短语查询，并允许用户利用原始文本结合 fastText 或 GloVe 训练自定义向量。此外，它还提供了与 Prodigy 标注工具配合的食谱，方便用户快速构建规则或引导命名实体识别任务。无论是想要探索海量评论数据的语义结构，还是希望为垂直领域模型注入更细腻","sense2vec 是 word2vec 的进阶版本，由 Explosion AI 开发，旨在生成更具语境感知能力的词向量。传统的词嵌入模型往往无法区分多义词在不同场景下的含义，而 sense2vec 巧妙地将词的词性（如名词、动词）和实体标签融入向量表示中。例如，它能精准区分作为“银行”的金融机构和作为“河岸”的地理概念，从而让计算机更准确地理解词语在特定上下文中的语义。\n\n这一特性有效解决了自然语言处理中常见的歧义问题，显著提升了短语相似度计算、语义搜索及实体关联分析的准确性。sense2vec 特别适合 NLP 开发者、数据科学家及人工智能研究人员使用。它不仅能作为独立库加载和查询预训练模型，还能无缝集成到流行的 spaCy 管道中，通过扩展属性直接获取文本片段的向量和相似词推荐。\n\n其技术亮点在于支持基于词性和实体标签的多词短语查询，并允许用户利用原始文本结合 fastText 或 GloVe 训练自定义向量。此外，它还提供了与 Prodigy 标注工具配合的食谱，方便用户快速构建规则或引导命名实体识别任务。无论是想要探索海量评论数据的语义结构，还是希望为垂直领域模型注入更细腻的语义理解能力，sense2vec 都是一个高效且灵活的选择。","\u003Ca href=\"https:\u002F\u002Fexplosion.ai\">\u003Cimg src=\"https:\u002F\u002Fexplosion.ai\u002Fassets\u002Fimg\u002Flogo.svg\" width=\"125\" height=\"125\" align=\"right\" \u002F>\u003C\u002Fa>\n\n# sense2vec: Contextually-keyed word vectors\n\nsense2vec ([Trask et. al](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06388), 2015) is a nice\ntwist on [word2vec](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FWord2vec) that lets you learn\nmore interesting and detailed word vectors. This library is a simple Python\nimplementation for loading, querying and training sense2vec models. For more\ndetails, check out\n[our blog post](https:\u002F\u002Fexplosion.ai\u002Fblog\u002Fsense2vec-reloaded). To explore the\nsemantic similarities across all Reddit comments of 2015 and 2019, see the\n[interactive demo](https:\u002F\u002Fdemos.explosion.ai\u002Fsense2vec).\n\n🦆 **Version 2.0 (for spaCy v3) out now!**\n[Read the release notes here.](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Freleases\u002F)\n\n[![tests](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Factions\u002Fworkflows\u002Ftests.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Factions\u002Fworkflows\u002Ftests.yml)\n[![Current Release Version](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fexplosion\u002Fsense2vec.svg?style=flat-square&logo=github)](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Freleases)\n[![pypi Version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fsense2vec.svg?style=flat-square&logo=pypi&logoColor=white)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fsense2vec\u002F)\n[![Code style: black](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcode%20style-black-000000.svg?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fambv\u002Fblack)\n\n## ✨ Features\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fexplosion_sense2vec_readme_73d80fe6436f.jpg)\n\n- Query **vectors for multi-word phrases** based on part-of-speech tags and\n  entity labels.\n- spaCy **pipeline component** and **extension attributes**.\n- Fully **serializable** so you can easily ship your sense2vec vectors with your\n  spaCy model packages.\n- Optional **caching of nearest neighbors** for super fast \"most similar\"\n  queries.\n- **Train your own vectors** using a pretrained spaCy model, raw text and\n  [GloVe](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002FGloVe) or Word2Vec via\n  [fastText](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText)\n  ([details](#-training-your-own-sense2vec-vectors)).\n- [Prodigy](https:\u002F\u002Fprodi.gy) **annotation recipes** for evaluating models,\n  creating lists of similar multi-word phrases and converting them to match\n  patterns, e.g. for rule-based NER or to bootstrap NER annotation\n  ([details & examples](#-prodigy-recipes)).\n\n## 🚀 Quickstart\n\n### Standalone usage\n\n```python\nfrom sense2vec import Sense2Vec\n\ns2v = Sense2Vec().from_disk(\"\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\")\nquery = \"natural_language_processing|NOUN\"\nassert query in s2v\nvector = s2v[query]\nfreq = s2v.get_freq(query)\nmost_similar = s2v.most_similar(query, n=3)\n# [('machine_learning|NOUN', 0.8986967),\n#  ('computer_vision|NOUN', 0.8636297),\n#  ('deep_learning|NOUN', 0.8573361)]\n```\n\n### Usage as a spaCy pipeline component\n\n> ⚠️ Note that this example describes usage with\n> [spaCy v3](https:\u002F\u002Fspacy.io\u002Fusage\u002Fv3). For usage with spaCy v2, download\n> `sense2vec==1.0.3` and check out the\n> [`v1.x`](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Ftree\u002Fv1.x) branch of this\n> repo.\n\n```python\nimport spacy\n\nnlp = spacy.load(\"en_core_web_sm\")\ns2v = nlp.add_pipe(\"sense2vec\")\ns2v.from_disk(\"\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\")\n\ndoc = nlp(\"A sentence about natural language processing.\")\nassert doc[3:6].text == \"natural language processing\"\nfreq = doc[3:6]._.s2v_freq\nvector = doc[3:6]._.s2v_vec\nmost_similar = doc[3:6]._.s2v_most_similar(3)\n# [(('machine learning', 'NOUN'), 0.8986967),\n#  (('computer vision', 'NOUN'), 0.8636297),\n#  (('deep learning', 'NOUN'), 0.8573361)]\n```\n\n### Interactive demos\n\n\u003Cimg width=\"34%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fexplosion_sense2vec_readme_55a216fc50c7.png\" align=\"right\" \u002F>\n\nTo try out our pretrained vectors trained on Reddit comments, check out the\n[interactive sense2vec demo](https:\u002F\u002Fexplosion.ai\u002Fdemos\u002Fsense2vec).\n\nThis repo also includes a [Streamlit](https:\u002F\u002Fstreamlit.io) demo script for\nexploring vectors and the most similar phrases. After installing `streamlit`,\nyou can run the script with `streamlit run` and **one or more paths to\npretrained vectors** as **positional arguments** on the command line. For\nexample:\n\n```bash\npip install streamlit\nstreamlit run https:\u002F\u002Fraw.githubusercontent.com\u002Fexplosion\u002Fsense2vec\u002Fmaster\u002Fscripts\u002Fstreamlit_sense2vec.py \u002Fpath\u002Fto\u002Fvectors\n```\n\n### Pretrained vectors\n\nTo use the vectors, download the archive(s) and pass the extracted directory to\n`Sense2Vec.from_disk` or `Sense2VecComponent.from_disk`. The vector files are\n**attached to the GitHub release**. Large files have been split into multi-part\ndownloads.\n\n| Vectors              |   Size | Description                  | 📥 Download (zipped)                                                                                                                                                                                                                                                                                                      |\n| -------------------- | -----: | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `s2v_reddit_2019_lg` |   4 GB | Reddit comments 2019 (01-07) | [part 1](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Freleases\u002Fdownload\u002Fv1.0.0\u002Fs2v_reddit_2019_lg.tar.gz.001), [part 2](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Freleases\u002Fdownload\u002Fv1.0.0\u002Fs2v_reddit_2019_lg.tar.gz.002), [part 3](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Freleases\u002Fdownload\u002Fv1.0.0\u002Fs2v_reddit_2019_lg.tar.gz.003) |\n| `s2v_reddit_2015_md` | 573 MB | Reddit comments 2015         | [part 1](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Freleases\u002Fdownload\u002Fv1.0.0\u002Fs2v_reddit_2015_md.tar.gz)                                                                                                                                                                                                                       |\n\nTo merge the multi-part archives, you can run the following:\n\n```bash\ncat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz\n```\n\n## ⏳ Installation & Setup\n\nsense2vec releases are available on pip:\n\n```bash\npip install sense2vec\n```\n\nTo use pretrained vectors, download\n[one of the vector packages](#pretrained-vectors), unpack the `.tar.gz` archive\nand point `from_disk` to the extracted data directory:\n\n```python\nfrom sense2vec import Sense2Vec\ns2v = Sense2Vec().from_disk(\"\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\")\n```\n\n## 👩‍💻 Usage\n\n### Usage with spaCy v3\n\nThe easiest way to use the library and vectors is to plug it into your spaCy\npipeline. The `sense2vec` package exposes a `Sense2VecComponent`, which can be\ninitialised with the shared vocab and added to your spaCy pipeline as a\n[custom pipeline component](https:\u002F\u002Fspacy.io\u002Fusage\u002Fprocessing-pipelines#custom-components).\nBy default, components are added to the _end of the pipeline_, which is the\nrecommended position for this component, since it needs access to the dependency\nparse and, if available, named entities.\n\n```python\nimport spacy\nfrom sense2vec import Sense2VecComponent\n\nnlp = spacy.load(\"en_core_web_sm\")\ns2v = nlp.add_pipe(\"sense2vec\")\ns2v.from_disk(\"\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\")\n```\n\nThe component will add several\n[extension attributes and methods](https:\u002F\u002Fspacy.io\u002Fusage\u002Fprocessing-pipelines#custom-components-attributes)\nto spaCy's `Token` and `Span` objects that let you retrieve vectors and\nfrequencies, as well as most similar terms.\n\n```python\ndoc = nlp(\"A sentence about natural language processing.\")\nassert doc[3:6].text == \"natural language processing\"\nfreq = doc[3:6]._.s2v_freq\nvector = doc[3:6]._.s2v_vec\nmost_similar = doc[3:6]._.s2v_most_similar(3)\n```\n\nFor entities, the entity labels are used as the \"sense\" (instead of the token's\npart-of-speech tag):\n\n```python\ndoc = nlp(\"A sentence about Facebook and Google.\")\nfor ent in doc.ents:\n    assert ent._.in_s2v\n    most_similar = ent._.s2v_most_similar(3)\n```\n\n#### Available attributes\n\nThe following extension attributes are exposed on the `Doc` object via the `._`\nproperty:\n\n| Name          | Attribute Type | Type | Description                                                                         |\n| ------------- | -------------- | ---- | ----------------------------------------------------------------------------------- |\n| `s2v_phrases` | property       | list | All sense2vec-compatible phrases in the given `Doc` (noun phrases, named entities). |\n\nThe following attributes are available via the `._` property of `Token` and\n`Span` objects – for example `token._.in_s2v`:\n\n| Name               | Attribute Type | Return Type        | Description                                                                        |\n| ------------------ | -------------- | ------------------ | ---------------------------------------------------------------------------------- |\n| `in_s2v`           | property       | bool               | Whether a key exists in the vector map.                                            |\n| `s2v_key`          | property       | unicode            | The sense2vec key of the given object, e.g. `\"duck NOUN\"`.                         |\n| `s2v_vec`          | property       | `ndarray[float32]` | The vector of the given key.                                                       |\n| `s2v_freq`         | property       | int                | The frequency of the given key.                                                    |\n| `s2v_other_senses` | property       | list               | Available other senses, e.g. `\"duck\\|VERB\"` for `\"duck\\|NOUN\"`.                    |\n| `s2v_most_similar` | method         | list               | Get the `n` most similar terms. Returns a list of `((word, sense), score)` tuples. |\n| `s2v_similarity`   | method         | float              | Get the similarity to another `Token` or `Span`.                                   |\n\n> ⚠️ **A note on span attributes:** Under the hood, entities in `doc.ents` are\n> `Span` objects. This is why the pipeline component also adds attributes and\n> methods to spans and not just tokens. However, it's not recommended to use the\n> sense2vec attributes on arbitrary slices of the document, since the model\n> likely won't have a key for the respective text. `Span` objects also don't\n> have a part-of-speech tag, so if no entity label is present, the \"sense\"\n> defaults to the root's part-of-speech tag.\n\n#### Adding sense2vec to a trained pipeline\n\nIf you're training and packaging a spaCy pipeline and want to include a\nsense2vec component in it, you can load in the data via the\n[`[initialize]` block](https:\u002F\u002Fspacy.io\u002Fusage\u002Ftraining#config-lifecycle) of the\ntraining config:\n\n```ini\n[initialize.components]\n\n[initialize.components.sense2vec]\ndata_path = \"\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\"\n```\n\n### Standalone usage\n\nYou can also use the underlying `Sense2Vec` class directly and load in the\nvectors using the `from_disk` method. See below for the available API methods.\n\n```python\nfrom sense2vec import Sense2Vec\ns2v = Sense2Vec().from_disk(\"\u002Fpath\u002Fto\u002Freddit_vectors-1.1.0\")\nmost_similar = s2v.most_similar(\"natural_language_processing|NOUN\", n=10)\n```\n\n> ⚠️ **Important note:** To look up entries in the vectors table, the keys need\n> to follow the scheme of `phrase_text|SENSE` (note the `_` instead of spaces\n> and the `|` before the tag or label) – for example, `machine_learning|NOUN`.\n> Also note that the underlying vector table is case-sensitive.\n\n## 🎛 API\n\n### \u003Ckbd>class\u003C\u002Fkbd> `Sense2Vec`\n\nThe standalone `Sense2Vec` object that holds the vectors, strings and\nfrequencies.\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.__init__`\n\nInitialize the `Sense2Vec` object.\n\n| Argument       | Type                        | Description                                                                                                            |\n| -------------- | --------------------------- | ---------------------------------------------------------------------------------------------------------------------- |\n| `shape`        | tuple                       | The vector shape. Defaults to `(1000, 128)`.                                                                           |\n| `strings`      | `spacy.strings.StringStore` | Optional string store. Will be created if it doesn't exist.                                                            |\n| `senses`       | list                        | Optional list of all available senses. Used in methods that generate the best sense or other senses.                   |\n| `vectors_name` | unicode                     | Optional name to assign to the `Vectors` table, to prevent clashes. Defaults to `\"sense2vec\"`.                         |\n| `overrides`    | dict                        | Optional custom functions to use, mapped to names registered via the registry, e.g. `{\"make_key\": \"custom_make_key\"}`. |\n| **RETURNS**    | `Sense2Vec`                 | The newly constructed object.                                                                                          |\n\n```python\ns2v = Sense2Vec(shape=(300, 128), senses=[\"VERB\", \"NOUN\"])\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.__len__`\n\nThe number of rows in the vectors table.\n\n| Argument    | Type | Description                              |\n| ----------- | ---- | ---------------------------------------- |\n| **RETURNS** | int  | The number of rows in the vectors table. |\n\n```python\ns2v = Sense2Vec(shape=(300, 128))\nassert len(s2v) == 300\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.__contains__`\n\nCheck if a key is in the vectors table.\n\n| Argument    | Type          | Description                      |\n| ----------- | ------------- | -------------------------------- |\n| `key`       | unicode \u002F int | The key to look up.              |\n| **RETURNS** | bool          | Whether the key is in the table. |\n\n```python\ns2v = Sense2Vec(shape=(10, 4))\ns2v.add(\"avocado|NOUN\", numpy.asarray([4, 2, 2, 2], dtype=numpy.float32))\nassert \"avocado|NOUN\" in s2v\nassert \"avocado|VERB\" not in s2v\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.__getitem__`\n\nRetrieve a vector for a given key. Returns None if the key is not in the table.\n\n| Argument    | Type            | Description           |\n| ----------- | --------------- | --------------------- |\n| `key`       | unicode \u002F int   | The key to look up.   |\n| **RETURNS** | `numpy.ndarray` | The vector or `None`. |\n\n```python\nvec = s2v[\"avocado|NOUN\"]\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.__setitem__`\n\nSet a vector for a given key. Will raise an error if the key doesn't exist. To\nadd a new entry, use `Sense2Vec.add`.\n\n| Argument | Type            | Description        |\n| -------- | --------------- | ------------------ |\n| `key`    | unicode \u002F int   | The key.           |\n| `vector` | `numpy.ndarray` | The vector to set. |\n\n```python\nvec = s2v[\"avocado|NOUN\"]\ns2v[\"avacado|NOUN\"] = vec\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.add`\n\nAdd a new vector to the table.\n\n| Argument | Type            | Description                                                  |\n| -------- | --------------- | ------------------------------------------------------------ |\n| `key`    | unicode \u002F int   | The key to add.                                              |\n| `vector` | `numpy.ndarray` | The vector to add.                                           |\n| `freq`   | int             | Optional frequency count. Used to find best matching senses. |\n\n```python\nvec = s2v[\"avocado|NOUN\"]\ns2v.add(\"🥑|NOUN\", vec, 1234)\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.get_freq`\n\nGet the frequency count for a given key.\n\n| Argument    | Type          | Description                                       |\n| ----------- | ------------- | ------------------------------------------------- |\n| `key`       | unicode \u002F int | The key to look up.                               |\n| `default`   | -             | Default value to return if no frequency is found. |\n| **RETURNS** | int           | The frequency count.                              |\n\n```python\nvec = s2v[\"avocado|NOUN\"]\ns2v.add(\"🥑|NOUN\", vec, 1234)\nassert s2v.get_freq(\"🥑|NOUN\") == 1234\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.set_freq`\n\nSet a frequency count for a given key.\n\n| Argument | Type          | Description                   |\n| -------- | ------------- | ----------------------------- |\n| `key`    | unicode \u002F int | The key to set the count for. |\n| `freq`   | int           | The frequency count.          |\n\n```python\ns2v.set_freq(\"avocado|NOUN\", 104294)\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.__iter__`, `Sense2Vec.items`\n\nIterate over the entries in the vectors table.\n\n| Argument   | Type  | Description                               |\n| ---------- | ----- | ----------------------------------------- |\n| **YIELDS** | tuple | String key and vector pairs in the table. |\n\n```python\nfor key, vec in s2v:\n    print(key, vec)\n\nfor key, vec in s2v.items():\n    print(key, vec)\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.keys`\n\nIterate over the keys in the table.\n\n| Argument   | Type    | Description                   |\n| ---------- | ------- | ----------------------------- |\n| **YIELDS** | unicode | The string keys in the table. |\n\n```python\nall_keys = list(s2v.keys())\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.values`\n\nIterate over the vectors in the table.\n\n| Argument   | Type            | Description               |\n| ---------- | --------------- | ------------------------- |\n| **YIELDS** | `numpy.ndarray` | The vectors in the table. |\n\n```python\nall_vecs = list(s2v.values())\n```\n\n#### \u003Ckbd>property\u003C\u002Fkbd> `Sense2Vec.senses`\n\nThe available senses in the table, e.g. `\"NOUN\"` or `\"VERB\"` (added at\ninitialization).\n\n| Argument    | Type | Description           |\n| ----------- | ---- | --------------------- |\n| **RETURNS** | list | The available senses. |\n\n```python\ns2v = Sense2Vec(senses=[\"VERB\", \"NOUN\"])\nassert \"VERB\" in s2v.senses\n```\n\n#### \u003Ckbd>property\u003C\u002Fkbd> `Sense2vec.frequencies`\n\nThe frequencies of the keys in the table, in descending order.\n\n| Argument    | Type | Description                                        |\n| ----------- | ---- | -------------------------------------------------- |\n| **RETURNS** | list | The `(key, freq)` tuples by frequency, descending. |\n\n```python\nmost_frequent = s2v.frequencies[:10]\nkey, score = s2v.frequencies[0]\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2vec.similarity`\n\nMake a semantic similarity estimate of two keys or two sets of keys. The default\nestimate is cosine similarity using an average of vectors.\n\n| Argument    | Type                     | Description                         |\n| ----------- | ------------------------ | ----------------------------------- |\n| `keys_a`    | unicode \u002F int \u002F iterable | The string or integer key(s).       |\n| `keys_b`    | unicode \u002F int \u002F iterable | The other string or integer key(s). |\n| **RETURNS** | float                    | The similarity score.               |\n\n```python\nkeys_a = [\"machine_learning|NOUN\", \"natural_language_processing|NOUN\"]\nkeys_b = [\"computer_vision|NOUN\", \"object_detection|NOUN\"]\nprint(s2v.similarity(keys_a, keys_b))\nassert s2v.similarity(\"machine_learning|NOUN\", \"machine_learning|NOUN\") == 1.0\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.most_similar`\n\nGet the most similar entries in the table. If more than one key is provided, the\naverage of the vectors is used. To make this method faster, see the\n[script for precomputing a cache](scripts\u002F06_precompute_cache.py) of the nearest\nneighbors.\n\n| Argument     | Type                      | Description                                             |\n| ------------ | ------------------------- | ------------------------------------------------------- |\n| `keys`       | unicode \u002F int \u002F iterable  | The string or integer key(s) to compare to.             |\n| `n`          | int                       | The number of similar keys to return. Defaults to `10`. |\n| `batch_size` | int                       | The batch size to use. Defaults to `16`.                |\n| **RETURNS**  | list                      | The `(key, score)` tuples of the most similar vectors.  |\n\n```python\nmost_similar = s2v.most_similar(\"natural_language_processing|NOUN\", n=3)\n# [('machine_learning|NOUN', 0.8986967),\n#  ('computer_vision|NOUN', 0.8636297),\n#  ('deep_learning|NOUN', 0.8573361)]\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.get_other_senses`\n\nFind other entries for the same word with a different sense, e.g. `\"duck|VERB\"`\nfor `\"duck|NOUN\"`.\n\n| Argument      | Type          | Description                                                       |\n| ------------- | ------------- | ----------------------------------------------------------------- |\n| `key`         | unicode \u002F int | The key to check.                                                 |\n| `ignore_case` | bool          | Check for uppercase, lowercase and titlecase. Defaults to `True`. |\n| **RETURNS**   | list          | The string keys of other entries with different senses.           |\n\n```python\nother_senses = s2v.get_other_senses(\"duck|NOUN\")\n# ['duck|VERB', 'Duck|ORG', 'Duck|VERB', 'Duck|PERSON', 'Duck|ADJ']\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.get_best_sense`\n\nFind the best-matching sense for a given word based on the available senses and\nfrequency counts. Returns `None` if no match is found.\n\n| Argument      | Type    | Description                                                                                             |\n| ------------- | ------- | ------------------------------------------------------------------------------------------------------- |\n| `word`        | unicode | The word to check.                                                                                      |\n| `senses`      | list    | Optional list of senses to limit the search to. If not set \u002F empty, all senses in the vectors are used. |\n| `ignore_case` | bool    | Check for uppercase, lowercase and titlecase. Defaults to `True`.                                       |\n| **RETURNS**   | unicode | The best-matching key or None.                                                                          |\n\n```python\nassert s2v.get_best_sense(\"duck\") == \"duck|NOUN\"\nassert s2v.get_best_sense(\"duck\", [\"VERB\", \"ADJ\"]) == \"duck|VERB\"\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.to_bytes`\n\nSerialize a `Sense2Vec` object to a bytestring.\n\n| Argument    | Type  | Description                               |\n| ----------- | ----- | ----------------------------------------- |\n| `exclude`   | list  | Names of serialization fields to exclude. |\n| **RETURNS** | bytes | The serialized `Sense2Vec` object.        |\n\n```python\ns2v_bytes = s2v.to_bytes()\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.from_bytes`\n\nLoad a `Sense2Vec` object from a bytestring.\n\n| Argument     | Type        | Description                               |\n| ------------ | ----------- | ----------------------------------------- |\n| `bytes_data` | bytes       | The data to load.                         |\n| `exclude`    | list        | Names of serialization fields to exclude. |\n| **RETURNS**  | `Sense2Vec` | The loaded object.                        |\n\n```python\ns2v_bytes = s2v.to_bytes()\nnew_s2v = Sense2Vec().from_bytes(s2v_bytes)\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.to_disk`\n\nSerialize a `Sense2Vec` object to a directory.\n\n| Argument  | Type             | Description                               |\n| --------- | ---------------- | ----------------------------------------- |\n| `path`    | unicode \u002F `Path` | The path.                                 |\n| `exclude` | list             | Names of serialization fields to exclude. |\n\n```python\ns2v.to_disk(\"\u002Fpath\u002Fto\u002Fsense2vec\")\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.from_disk`\n\nLoad a `Sense2Vec` object from a directory.\n\n| Argument    | Type             | Description                               |\n| ----------- | ---------------- | ----------------------------------------- |\n| `path`      | unicode \u002F `Path` | The path to load from                     |\n| `exclude`   | list             | Names of serialization fields to exclude. |\n| **RETURNS** | `Sense2Vec`      | The loaded object.                        |\n\n```python\ns2v.to_disk(\"\u002Fpath\u002Fto\u002Fsense2vec\")\nnew_s2v = Sense2Vec().from_disk(\"\u002Fpath\u002Fto\u002Fsense2vec\")\n```\n\n---\n\n### \u003Ckbd>class\u003C\u002Fkbd> `Sense2VecComponent`\n\nThe pipeline component to add sense2vec to spaCy pipelines.\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2VecComponent.__init__`\n\nInitialize the pipeline component.\n\n| Argument        | Type                                                                                                                  | Description                                                                                                 |\n| --------------- | --------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |\n| `vocab`         | `Vocab`                                                                                                               | The shared `Vocab`. Mostly used for the shared `StringStore`.                                               |\n| `shape`         | tuple                                                                                                                 | The vector shape.                                                                                           |\n| `merge_phrases` | bool                                                                                                                  | Whether to merge sense2vec phrases into one token. Defaults to `False`.                                     |\n| `lemmatize`     | bool                                                                                                                  | Always look up lemmas if available in the vectors, otherwise default to original word. Defaults to `False`. |\n| `overrides`     | Optional custom functions to use, mapped to names registred via the registry, e.g. `{\"make_key\": \"custom_make_key\"}`. |\n| **RETURNS**     | `Sense2VecComponent`                                                                                                  | The newly constructed object.                                                                               |\n\n```python\ns2v = Sense2VecComponent(nlp.vocab)\n```\n\n#### \u003Ckbd>classmethod\u003C\u002Fkbd> `Sense2VecComponent.from_nlp`\n\nInitialize the component from an nlp object. Mostly used as the component\nfactory for the entry point (see setup.cfg) and to auto-register via the\n`@spacy.component` decorator.\n\n| Argument    | Type                 | Description                   |\n| ----------- | -------------------- | ----------------------------- |\n| `nlp`       | `Language`           | The `nlp` object.             |\n| `**cfg`     | -                    | Optional config parameters.   |\n| **RETURNS** | `Sense2VecComponent` | The newly constructed object. |\n\n```python\ns2v = Sense2VecComponent.from_nlp(nlp)\n```\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2VecComponent.__call__`\n\nProcess a `Doc` object with the component. Typically only called as part of the\nspaCy pipeline and not directly.\n\n| Argument    | Type  | Description              |\n| ----------- | ----- | ------------------------ |\n| `doc`       | `Doc` | The document to process. |\n| **RETURNS** | `Doc` | the processed document.  |\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2Vec.init_component`\n\nRegister the component-specific extension attributes here and only if the\ncomponent is added to the pipeline and used – otherwise, tokens will still get\nthe attributes even if the component is only created and not added.\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2VecComponent.to_bytes`\n\nSerialize the component to a bytestring. Also called when the component is added\nto the pipeline and you run `nlp.to_bytes`.\n\n| Argument    | Type  | Description               |\n| ----------- | ----- | ------------------------- |\n| **RETURNS** | bytes | The serialized component. |\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2VecComponent.from_bytes`\n\nLoad a component from a bytestring. Also called when you run `nlp.from_bytes`.\n\n| Argument     | Type                 | Description        |\n| ------------ | -------------------- | ------------------ |\n| `bytes_data` | bytes                | The data to load.  |\n| **RETURNS**  | `Sense2VecComponent` | The loaded object. |\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2VecComponent.to_disk`\n\nSerialize the component to a directory. Also called when the component is added\nto the pipeline and you run `nlp.to_disk`.\n\n| Argument | Type             | Description |\n| -------- | ---------------- | ----------- |\n| `path`   | unicode \u002F `Path` | The path.   |\n\n#### \u003Ckbd>method\u003C\u002Fkbd> `Sense2VecComponent.from_disk`\n\nLoad a `Sense2Vec` object from a directory. Also called when you run\n`nlp.from_disk`.\n\n| Argument    | Type                 | Description           |\n| ----------- | -------------------- | --------------------- |\n| `path`      | unicode \u002F `Path`     | The path to load from |\n| **RETURNS** | `Sense2VecComponent` | The loaded object.    |\n\n---\n\n### \u003Ckbd>class\u003C\u002Fkbd> `registry`\n\nFunction registry (powered by\n[`catalogue`](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fcatalogue)) to easily customize the\nfunctions used to generate keys and phrases. Allows you to decorate and name\ncustom functions, swap them out and serialize the custom names when you save out\nthe model. The following registry options are available:\n\n| Name                      | Description                                                                                                                                                                                                                                        |\n| ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | \n| `registry.make_key`       | Given a `word` and `sense`, return a string of the key, e.g. `\"word\\|sense\".`                                                                                                                                                                      |\n| `registry.split_key`      | Given a string key, return a `(word, sense)` tuple.                                                                                                                                                                                                |\n| `registry.make_spacy_key` | Given a spaCy object (`Token` or `Span`) and a boolean `prefer_ents` keyword argument (whether to prefer the entity label for single tokens), return a `(word, sense)` tuple. Used in extension attributes to generate a key for tokens and spans. |\n| `registry.get_phrases`    | Given a spaCy `Doc`, return a list of `Span` objects used for sense2vec phrases (typically noun phrases and named entities).                                                                                                                       |\n| `registry.merge_phrases`  | Given a spaCy `Doc`, get all sense2vec phrases and merge them into single tokens.                                                                                                                                                                  |\n\nEach registry has a `register` method that can be used as a function decorator\nand takes one argument, the name of the custom function.\n\n```python\nfrom sense2vec import registry\n\n@registry.make_key.register(\"custom\")\ndef custom_make_key(word, sense):\n    return f\"{word}###{sense}\"\n\n@registry.split_key.register(\"custom\")\ndef custom_split_key(key):\n    word, sense = key.split(\"###\")\n    return word, sense\n```\n\nWhen initializing the `Sense2Vec` object, you can now pass in a dictionary of\noverrides with the names of your custom registered functions.\n\n```python\noverrides = {\"make_key\": \"custom\", \"split_key\": \"custom\"}\ns2v = Sense2Vec(overrides=overrides)\n```\n\nThis makes it easy to experiment with different strategies and serializing the\nstrategies as plain strings (instead of having to pass around and\u002For pickle the\nfunctions themselves).\n\n## 🚂 Training your own sense2vec vectors\n\nThe [`\u002Fscripts`](\u002Fscripts) directory contains command line utilities for\npreprocessing text and training your own vectors.\n\n### Requirements\n\nTo train your own sense2vec vectors, you'll need the following:\n\n- A **very large** source of raw text (ideally more than you'd use for word2vec,\n  since the senses make the vocabulary more sparse). We recommend at least 1\n  billion words.\n- A [pretrained spaCy model](https:\u002F\u002Fspacy.io\u002Fmodels) that assigns\n  part-of-speech tags, dependencies and named entities, and populates the\n  `doc.noun_chunks`. If the language you need doesn't provide a built in\n  [syntax iterator for noun phrases](https:\u002F\u002Fspacy.io\u002Fusage\u002Fadding-languages#syntax-iterators),\n  you'll need to write your own. (The `doc.noun_chunks` and `doc.ents` are what\n  sense2vec uses to determine what's a phrase.)\n- [GloVe](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002FGloVe) or\n  [fastText](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText) installed and built.\n  You should be able to clone the repo and run `make` in the respective\n  directory.\n\n### Step-by-step process\n\nThe training process is split up into several steps to allow you to resume at\nany given point. Processing scripts are designed to operate on single files,\nmaking it easy to parallellize the work. The scripts in this repo require either\n[Glove](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002FGloVe) or\n[fastText](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText) which you need to clone\nand `make`.\n\nFor Fasttext, the scripts will require the path to the created binary file. If\nyou're working on Windows, you can build with `cmake`, or alternatively use the\n`.exe` file from this **unofficial** repo with FastText binary builds for\nWindows: https:\u002F\u002Fgithub.com\u002Fxiamx\u002FfastText\u002Freleases.\n\n|        | Script                                                                                                                                       | Description                                                                                                                                                                                 |\n| ------ | -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| **1.** | [`01_parse.py`](scripts\u002F01_parse.py)                                                                                                         | Use spaCy to parse the raw text and output binary collections of `Doc` objects (see [`DocBin`](https:\u002F\u002Fspacy.io\u002Fapi\u002Fdocbin)).                                                               |\n| **2.** | [`02_preprocess.py`](scripts\u002F02_preprocess.py)                                                                                               | Load a collection of parsed `Doc` objects produced in the previous step and output text files in the sense2vec format (one sentence per line and merged phrases with senses).               |\n| **3.** | [`03_glove_build_counts.py`](scripts\u002F03_glove_build_counts.py)                                                                               | Use [GloVe](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002FGloVe) to build the vocabulary and counts. Skip this step if you're using Word2Vec via [FastText](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText). |\n| **4.** | [`04_glove_train_vectors.py`](scripts\u002F04_glove_train_vectors.py)\u003Cbr \u002F>[`04_fasttext_train_vectors.py`](scripts\u002F04_fasttext_train_vectors.py) | Use [GloVe](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002FGloVe) or [FastText](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText) to train vectors.                                                             |\n| **5.** | [`05_export.py`](scripts\u002F05_export.py)                                                                                                       | Load the vectors and frequencies and output a sense2vec component that can be loaded via `Sense2Vec.from_disk`.                                                                             |\n| **6.** | [`06_precompute_cache.py`](scripts\u002F06_precompute_cache.py)                                                                                   | **Optional:** Precompute nearest-neighbor queries for every entry in the vocab to make `Sense2Vec.most_similar` faster.                                                                     |\n\nFor more detailed documentation of the scripts, check out the source or run them\nwith `--help`. For example, `python scripts\u002F01_parse.py --help`.\n\n## 🍳 Prodigy recipes\n\nThis package also seamlessly integrates with the [Prodigy](https:\u002F\u002Fprodi.gy)\nannotation tool and exposes recipes for using sense2vec vectors to quickly\ngenerate lists of multi-word phrases and bootstrap NER annotations. To use a\nrecipe, `sense2vec` needs to be installed in the same environment as Prodigy.\nFor an example of a real-world use case, check out this\n[NER project](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fprojects\u002Ftree\u002Fmaster\u002Fner-fashion-brands)\nwith downloadable datasets.\n\nThe following recipes are available – see below for more detailed docs.\n\n| Recipe                                                              | Description                                                          |\n| ------------------------------------------------------------------- | -------------------------------------------------------------------- |\n| [`sense2vec.teach`](#recipe-sense2vecteach)                         | Bootstrap a terminology list using sense2vec.                        |\n| [`sense2vec.to-patterns`](#recipe-sense2vecto-patterns)             | Convert phrases dataset to token-based match patterns.               |\n| [`sense2vec.eval`](#recipe-sense2veceval)                           | Evaluate a sense2vec model by asking about phrase triples.           |\n| [`sense2vec.eval-most-similar`](#recipe-sense2veceval-most-similar) | Evaluate a sense2vec model by correcting the most similar entries.   |\n| [`sense2vec.eval-ab`](#recipe-sense2veceval-ab)                     | Perform an A\u002FB evaluation of two pretrained sense2vec vector models. |\n\n### \u003Ckbd>recipe\u003C\u002Fkbd> `sense2vec.teach`\n\nBootstrap a terminology list using sense2vec. Prodigy will suggest similar terms\nbased on the most similar phrases from sense2vec, and the suggestions will be\nadjusted as you annotate and accept similar phrases. For each seed term, the\nbest matching sense according to the sense2vec vectors will be used.\n\n```bash\nprodigy sense2vec.teach [dataset] [vectors_path] [--seeds] [--threshold]\n[--n-similar] [--batch-size] [--resume]\n```\n\n| Argument             | Type       | Description                               |\n| -------------------- | ---------- | ----------------------------------------- |\n| `dataset`            | positional | Dataset to save annotations to.           |\n| `vectors_path`       | positional | Path to pretrained sense2vec vectors.     |\n| `--seeds`, `-s`      | option     | One or more comma-separated seed phrases. |\n| `--threshold`, `-t`  | option     | Similarity threshold. Defaults to `0.85`. |\n| `--n-similar`, `-n`  | option     | Number of similar items to get at once.   |\n| `--batch-size`, `-b` | option     | Batch size for submitting annotations.    |\n| `--resume`, `-R`     | flag       | Resume from an existing phrases dataset.  |\n\n#### Example\n\n```bash\nprodigy sense2vec.teach tech_phrases \u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\n--seeds \"natural language processing, machine learning, artificial intelligence\"\n```\n\n### \u003Ckbd>recipe\u003C\u002Fkbd> `sense2vec.to-patterns`\n\nConvert a dataset of phrases collected with `sense2vec.teach` to token-based\nmatch patterns that can be used with\n[spaCy's `EntityRuler`](https:\u002F\u002Fspacy.io\u002Fusage\u002Frule-based-matching#entityruler)\nor recipes like `ner.match`. If no output file is specified, the patterns are\nwritten to stdout. The examples are tokenized so that multi-token terms are\nrepresented correctly, e.g.:\n`{\"label\": \"SHOE_BRAND\", \"pattern\": [{ \"LOWER\": \"new\" }, { \"LOWER\": \"balance\" }]}`.\n\n```bash\nprodigy sense2vec.to-patterns [dataset] [spacy_model] [label] [--output-file]\n[--case-sensitive] [--dry]\n```\n\n| Argument                  | Type       | Description                                  |\n| ------------------------- | ---------- | -------------------------------------------- |\n| `dataset`                 | positional | Phrase dataset to convert.                   |\n| `spacy_model`             | positional | spaCy model for tokenization.                |\n| `label`                   | positional | Label to apply to all patterns.              |\n| `--output-file`, `-o`     | option     | Optional output file. Defaults to stdout.    |\n| `--case-sensitive`, `-CS` | flag       | Make patterns case-sensitive.                |\n| `--dry`, `-D`             | flag       | Perform a dry run and don't output anything. |\n\n#### Example\n\n```bash\nprodigy sense2vec.to-patterns tech_phrases en_core_web_sm TECHNOLOGY\n--output-file \u002Fpath\u002Fto\u002Fpatterns.jsonl\n```\n\n### \u003Ckbd>recipe\u003C\u002Fkbd> `sense2vec.eval`\n\nEvaluate a sense2vec model by asking about phrase triples: is word A more\nsimilar to word B, or to word C? If the human mostly agrees with the model, the\nvectors model is good. The recipe will only ask about vectors with the same\nsense and supports different example selection strategies.\n\n```bash\nprodigy sense2vec.eval [dataset] [vectors_path] [--strategy] [--senses]\n[--exclude-senses] [--n-freq] [--threshold] [--batch-size] [--eval-whole]\n[--eval-only] [--show-scores]\n```\n\n| Argument                  | Type       | Description                                                                                                   |\n| ------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------- |\n| `dataset`                 | positional | Dataset to save annotations to.                                                                               |\n| `vectors_path`            | positional | Path to pretrained sense2vec vectors.                                                                         |\n| `--strategy`, `-st`       | option     | Example selection strategy. `most similar` (default) or `random`.                                             |\n| `--senses`, `-s`          | option     | Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used. |\n| `--exclude-senses`, `-es` | option     | Comma-separated list of senses to exclude. See `prodigy_recipes.EVAL_EXCLUDE_SENSES` fro the defaults.        |\n| `--n-freq`, `-f`          | option     | Number of most frequent entries to limit to.                                                                  |\n| `--threshold`, `-t`       | option     | Minimum similarity threshold to consider examples.                                                            |\n| `--batch-size`, `-b`      | option     | Batch size to use.                                                                                            |\n| `--eval-whole`, `-E`      | flag       | Evaluate the whole dataset instead of the current session.                                                    |\n| `--eval-only`, `-O`       | flag       | Don't annotate, only evaluate the current dataset.                                                            |\n| `--show-scores`, `-S`     | flag       | Show all scores for debugging.                                                                                |\n\n#### Strategies\n\n| Name                 | Description                                                                                                                                                           |\n| -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `most_similar`       | Pick a random word from a random sense and get its most similar entries of the same sense. Ask about the similarity to the last and middle entry from that selection. |\n| `most_least_similar` | Pick a random word from a random sense and get the least similar entry from its most similar entries, and then the last most similar entry of that.                   |\n| `random`             | Pick a random sample of 3 words from the same random sense.                                                                                                           |\n\n#### Example\n\n```bash\nprodigy sense2vec.eval vectors_eval \u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\n--senses NOUN,ORG,PRODUCT --threshold 0.5\n```\n\n![UI preview of sense2vec.eval](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fexplosion_sense2vec_readme_50920dc027ab.png)\n\n### \u003Ckbd>recipe\u003C\u002Fkbd> `sense2vec.eval-most-similar`\n\nEvaluate a vectors model by looking at the most similar entries it returns for a\nrandom phrase and unselecting the mistakes.\n\n```bash\nprodigy sense2vec.eval [dataset] [vectors_path] [--senses] [--exclude-senses]\n[--n-freq] [--n-similar] [--batch-size] [--eval-whole] [--eval-only]\n[--show-scores]\n```\n\n| Argument                  | Type       | Description                                                                                                   |\n| ------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------- |\n| `dataset`                 | positional | Dataset to save annotations to.                                                                               |\n| `vectors_path`            | positional | Path to pretrained sense2vec vectors.                                                                         |\n| `--senses`, `-s`          | option     | Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used. |\n| `--exclude-senses`, `-es` | option     | Comma-separated list of senses to exclude. See `prodigy_recipes.EVAL_EXCLUDE_SENSES` fro the defaults.        |\n| `--n-freq`, `-f`          | option     | Number of most frequent entries to limit to.                                                                  |\n| `--n-similar`, `-n`       | option     | Number of similar items to check. Defaults to `10`.                                                           |\n| `--batch-size`, `-b`      | option     | Batch size to use.                                                                                            |\n| `--eval-whole`, `-E`      | flag       | Evaluate the whole dataset instead of the current session.                                                    |\n| `--eval-only`, `-O`       | flag       | Don't annotate, only evaluate the current dataset.                                                            |\n| `--show-scores`, `-S`     | flag       | Show all scores for debugging.                                                                                |\n\n```bash\nprodigy sense2vec.eval-most-similar vectors_eval_sim \u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\n--senses NOUN,ORG,PRODUCT\n```\n\n### \u003Ckbd>recipe\u003C\u002Fkbd> `sense2vec.eval-ab`\n\nPerform an A\u002FB evaluation of two pretrained sense2vec vector models by comparing\nthe most similar entries they return for a random phrase. The UI shows two\nrandomized options with the most similar entries of each model and highlights\nthe phrases that differ. At the end of the annotation session the overall stats\nand preferred model are shown.\n\n```bash\nprodigy sense2vec.eval [dataset] [vectors_path_a] [vectors_path_b] [--senses]\n[--exclude-senses] [--n-freq] [--n-similar] [--batch-size] [--eval-whole]\n[--eval-only] [--show-mapping]\n```\n\n| Argument                  | Type       | Description                                                                                                   |\n| ------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------- |\n| `dataset`                 | positional | Dataset to save annotations to.                                                                               |\n| `vectors_path_a`          | positional | Path to pretrained sense2vec vectors.                                                                         |\n| `vectors_path_b`          | positional | Path to pretrained sense2vec vectors.                                                                         |\n| `--senses`, `-s`          | option     | Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used. |\n| `--exclude-senses`, `-es` | option     | Comma-separated list of senses to exclude. See `prodigy_recipes.EVAL_EXCLUDE_SENSES` fro the defaults.        |\n| `--n-freq`, `-f`          | option     | Number of most frequent entries to limit to.                                                                  |\n| `--n-similar`, `-n`       | option     | Number of similar items to check. Defaults to `10`.                                                           |\n| `--batch-size`, `-b`      | option     | Batch size to use.                                                                                            |\n| `--eval-whole`, `-E`      | flag       | Evaluate the whole dataset instead of the current session.                                                    |\n| `--eval-only`, `-O`       | flag       | Don't annotate, only evaluate the current dataset.                                                            |\n| `--show-mapping`, `-S`    | flag       | Show which models are option 1 and option 2 in the UI (for debugging).                                        |\n\n```bash\nprodigy sense2vec.eval-ab vectors_eval_sim \u002Fpath\u002Fto\u002Fs2v_reddit_2015_md \u002Fpath\u002Fto\u002Fs2v_reddit_2019_md --senses NOUN,ORG,PRODUCT\n```\n\n![UI preview of sense2vec.eval-ab](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fexplosion_sense2vec_readme_c84126360637.png)\n\n## Pretrained vectors\n\nThe pretrained Reddit vectors support the following \"senses\", either\npart-of-speech tags or entity labels. For more details, see spaCy's\n[annotation scheme overview](https:\u002F\u002Fspacy.io\u002Fapi\u002Fannotation).\n\n| Tag     | Description               | Examples                             |\n| ------- | ------------------------- | ------------------------------------ |\n| `ADJ`   | adjective                 | big, old, green                      |\n| `ADP`   | adposition                | in, to, during                       |\n| `ADV`   | adverb                    | very, tomorrow, down, where          |\n| `AUX`   | auxiliary                 | is, has (done), will (do)            |\n| `CONJ`  | conjunction               | and, or, but                         |\n| `DET`   | determiner                | a, an, the                           |\n| `INTJ`  | interjection              | psst, ouch, bravo, hello             |\n| `NOUN`  | noun                      | girl, cat, tree, air, beauty         |\n| `NUM`   | numeral                   | 1, 2017, one, seventy-seven, MMXIV   |\n| `PART`  | particle                  | 's, not                              |\n| `PRON`  | pronoun                   | I, you, he, she, myself, somebody    |\n| `PROPN` | proper noun               | Mary, John, London, NATO, HBO        |\n| `PUNCT` | punctuation               | , ? ( )                              |\n| `SCONJ` | subordinating conjunction | if, while, that                      |\n| `SYM`   | symbol                    | \\$, %, =, :), 😝                     |\n| `VERB`  | verb                      | run, runs, running, eat, ate, eating |\n\n| Entity Label  | Description                                          |\n| ------------- | ---------------------------------------------------- |\n| `PERSON`      | People, including fictional.                         |\n| `NORP`        | Nationalities or religious or political groups.      |\n| `FACILITY`    | Buildings, airports, highways, bridges, etc.         |\n| `ORG`         | Companies, agencies, institutions, etc.              |\n| `GPE`         | Countries, cities, states.                           |\n| `LOC`         | Non-GPE locations, mountain ranges, bodies of water. |\n| `PRODUCT`     | Objects, vehicles, foods, etc. (Not services.)       |\n| `EVENT`       | Named hurricanes, battles, wars, sports events, etc. |\n| `WORK_OF_ART` | Titles of books, songs, etc.                         |\n| `LANGUAGE`    | Any named language.                                  |\n","\u003Ca href=\"https:\u002F\u002Fexplosion.ai\">\u003Cimg src=\"https:\u002F\u002Fexplosion.ai\u002Fassets\u002Fimg\u002Flogo.svg\" width=\"125\" height=\"125\" align=\"right\" \u002F>\u003C\u002Fa>\n\n# sense2vec：基于上下文的词向量\n\nsense2vec（[Trask 等人](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06388), 2015）是对 [word2vec](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FWord2vec) 的一种巧妙改进，能够学习到更加丰富和细致的词向量。本库是一个简单的 Python 实现，用于加载、查询和训练 sense2vec 模型。更多详情请参阅\n[我们的博客文章](https:\u002F\u002Fexplosion.ai\u002Fblog\u002Fsense2vec-reloaded)。若想探索 2015 年和 2019 年所有 Reddit 评论之间的语义相似性，请访问\n[交互式演示](https:\u002F\u002Fdemos.explosion.ai\u002Fsense2vec)。\n\n🦆 **版本 2.0（适用于 spaCy v3）现已发布！**\n[请在此处阅读发行说明。](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Freleases\u002F)\n\n[![测试](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Factions\u002Fworkflows\u002Ftests.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Factions\u002Fworkflows\u002Ftests.yml)\n[![当前发布版本](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fexplosion\u002Fsense2vec.svg?style=flat-square&logo=github)](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Freleases)\n[![PyPI 版本](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fsense2vec.svg?style=flat-square&logo=pypi&logoColor=white)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fsense2vec\u002F)\n[![代码风格：black](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcode%20style-black-000000.svg?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fambv\u002Fblack)\n\n## ✨ 功能特性\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fexplosion_sense2vec_readme_73d80fe6436f.jpg)\n\n- 基于词性标注和实体标签，查询**多词短语的向量**。\n- spaCy **管道组件**和**扩展属性**。\n- 完全**可序列化**，因此您可以轻松地将 sense2vec 向量与您的 spaCy 模型包一起分发。\n- 可选的**最近邻缓存**，以实现超快速的“最相似”查询。\n- 使用预训练的 spaCy 模型、原始文本以及通过 [fastText](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText) 提供的 [GloVe](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002FGloVe) 或 Word2Vec，\n  **训练您自己的向量**（[详情](#-training-your-own-sense2vec-vectors)）。\n- [Prodigy](https:\u002F\u002Fprodi.gy) **注释配方**，用于评估模型、创建相似的多词短语列表，并将其转换为匹配模式，例如用于基于规则的 NER 或启动 NER 注释工作（[详情及示例](#-prodigy-recipes)）。\n\n## 🚀 快速入门\n\n### 独立使用\n\n```python\nfrom sense2vec import Sense2Vec\n\ns2v = Sense2Vec().from_disk(\"\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\")\nquery = \"natural_language_processing|NOUN\"\nassert query in s2v\nvector = s2v[query]\nfreq = s2v.get_freq(query)\nmost_similar = s2v.most_similar(query, n=3)\n# [('machine_learning|NOUN', 0.8986967),\n#  ('computer_vision|NOUN', 0.8636297),\n#  ('deep_learning|NOUN', 0.8573361)]\n```\n\n### 作为 spaCy 管道组件使用\n\n> ⚠️ 请注意，此示例描述的是与\n> [spaCy v3](https:\u002F\u002Fspacy.io\u002Fusage\u002Fv3) 的用法。若要与 spaCy v2 配合使用，请下载\n> `sense2vec==1.0.3` 并查看此仓库的\n> [`v1.x`](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Ftree\u002Fv1.x) 分支。\n\n```python\nimport spacy\n\nnlp = spacy.load(\"en_core_web_sm\")\ns2v = nlp.add_pipe(\"sense2vec\")\ns2v.from_disk(\"\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\")\n\ndoc = nlp(\"A sentence about natural language processing.\")\nassert doc[3:6].text == \"natural language processing\"\nfreq = doc[3:6]._.s2v_freq\nvector = doc[3:6]._.s2v_vec\nmost_similar = doc[3:6]._.s2v_most_similar(3)\n# [(('machine learning', 'NOUN'), 0.8986967),\n#  (('computer vision', 'NOUN'), 0.8636297),\n#  (('deep learning', 'NOUN'), 0.8573361)]\n```\n\n### 交互式演示\n\n\u003Cimg width=\"34%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fexplosion_sense2vec_readme_55a216fc50c7.png\" align=\"right\" \u002F>\n\n要试用我们在 Reddit 评论上训练的预训练向量，请访问\n[交互式 sense2vec 演示](https:\u002F\u002Fexplosion.ai\u002Fdemos\u002Fsense2vec)。\n\n该仓库还包含一个 [Streamlit](https:\u002F\u002Fstreamlit.io) 演示脚本，用于探索向量和最相似的短语。安装 `streamlit` 后，您可以通过运行 `streamlit run` 命令，并在命令行中提供\n**一个或多个预训练向量的路径**作为**位置参数**来运行该脚本。例如：\n\n```bash\npip install streamlit\nstreamlit run https:\u002F\u002Fraw.githubusercontent.com\u002Fexplosion\u002Fsense2vec\u002Fmaster\u002Fscripts\u002Fstreamlit_sense2vec.py \u002Fpath\u002Fto\u002Fvectors\n```\n\n### 预训练向量\n\n要使用这些向量，您需要下载归档文件，并将解压后的目录传递给 `Sense2Vec.from_disk` 或 `Sense2VecComponent.from_disk`。向量文件已**附加到 GitHub 发布页面**。较大的文件已被拆分为多部分下载。\n\n| 向量              |   大小 | 描述                  | 📥 下载（压缩包）                                                                                                                                                                                                                                                                                                      |\n| -------------------- | -----: | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `s2v_reddit_2019_lg` |   4 GB | Reddit 评论 2019（01–07） | [第1部分](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Freleases\u002Fdownload\u002Fv1.0.0\u002Fs2v_reddit_2019_lg.tar.gz.001), [第2部分](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Freleases\u002Fdownload\u002Fv1.0.0\u002Fs2v_reddit_2019_lg.tar.gz.002), [第3部分](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Freleases\u002Fdownload\u002Fv1.0.0\u002Fs2v_reddit_2019_lg.tar.gz.003) |\n| `s2v_reddit_2015_md` | 573 MB | Reddit 评论 2015         | [第1部分](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Freleases\u002Fdownload\u002Fv1.0.0\u002Fs2v_reddit_2015_md.tar.gz)                                                                                                                                                                                                                       |\n\n要合并多部分归档文件，您可以运行以下命令：\n\n```bash\ncat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz\n```\n\n## ⏳ 安装与设置\n\nsense2vec 的发布版本可在 PyPI 上获取：\n\n```bash\npip install sense2vec\n```\n\n要使用预训练向量，您需要下载\n[其中一个向量包](#pretrained-vectors)，解压 `.tar.gz` 归档文件，并将 `from_disk` 指向解压后的数据目录：\n\n```python\nfrom sense2vec import Sense2Vec\ns2v = Sense2Vec().from_disk(\"\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\")\n```\n\n## 👩‍💻 使用方法\n\n### 与 spaCy v3 的使用\n\n使用该库和向量的最简单方法是将其集成到你的 spaCy 管道中。`sense2vec` 包公开了一个 `Sense2VecComponent`，它可以使用共享词汇表进行初始化，并作为[自定义管道组件](https:\u002F\u002Fspacy.io\u002Fusage\u002Fprocessing-pipelines#custom-components)添加到你的 spaCy 管道中。默认情况下，组件会被添加到管道的_末尾_，这也是推荐的位置，因为该组件需要访问依存句法分析结果，以及命名实体（如果有的话）。\n\n```python\nimport spacy\nfrom sense2vec import Sense2VecComponent\n\nnlp = spacy.load(\"en_core_web_sm\")\ns2v = nlp.add_pipe(\"sense2vec\")\ns2v.from_disk(\"\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\")\n```\n\n该组件会为 spaCy 的 `Token` 和 `Span` 对象添加若干\n[扩展属性和方法](https:\u002F\u002Fspacy.io\u002Fusage\u002Fprocessing-pipelines#custom-components-attributes)，使你可以检索向量和频率，以及找到最相似的术语。\n\n```python\ndoc = nlp(\"A sentence about natural language processing.\")\nassert doc[3:6].text == \"natural language processing\"\nfreq = doc[3:6]._.s2v_freq\nvector = doc[3:6]._.s2v_vec\nmost_similar = doc[3:6]._.s2v_most_similar(3)\n```\n\n对于实体，实体标签会被用作“语义”（而不是词性的标签）：\n\n```python\ndoc = nlp(\"A sentence about Facebook and Google.\")\nfor ent in doc.ents:\n    assert ent._.in_s2v\n    most_similar = ent._.s2v_most_similar(3)\n```\n\n#### 可用属性\n\n以下扩展属性通过 `._` 属性在 `Doc` 对象上暴露：\n\n| 名称          | 属性类型 | 类型   | 描述                                                                         |\n| ------------- | ---------- | ------ | ---------------------------------------------------------------------------- |\n| `s2v_phrases` | 属性       | 列表   | 给定 `Doc` 中所有与 sense2vec 兼容的短语（名词短语、命名实体）。             |\n\n以下属性可通过 `Token` 和 `Span` 对象的 `._` 属性访问——例如 `token._.in_s2v`：\n\n| 名称               | 属性类型 | 返回类型        | 描述                                                                        |\n| ------------------ | ---------- | --------------- | --------------------------------------------------------------------------- |\n| `in_s2v`           | 属性       | 布尔值          | 键是否存在于向量映射中。                                                    |\n| `s2v_key`          | 属性       | Unicode         | 给定对象的 sense2vec 键，例如 `\"duck NOUN\"`。                               |\n| `s2v_vec`          | 属性       | `ndarray[float32]` | 给定键的向量。                                                              |\n| `s2v_freq`         | 属性       | 整数            | 给定键的频率。                                                              |\n| `s2v_other_senses` | 属性       | 列表            | 可用的其他语义，例如对于 `\"duck\\|NOUN\"` 的 `\"duck\\|VERB\"`。                |\n| `s2v_most_similar` | 方法       | 列表            | 获取最相似的 `n` 个术语。返回一个 `((word, sense), score)` 元组列表。      |\n| `s2v_similarity`   | 方法       | 浮点数          | 计算与另一个 `Token` 或 `Span` 的相似度。                                    |\n\n> ⚠️ **关于 span 属性的说明：** 在底层，`doc.ents` 中的实体实际上是 `Span` 对象。这就是为什么管道组件也会为 spans 而不是仅 tokens 添加属性和方法的原因。然而，不建议对文档的任意切片使用 sense2vec 属性，因为模型很可能没有对应文本的键。此外，`Span` 对象也没有词性标签，因此如果没有实体标签，则“语义”会默认为根节点的词性标签。\n\n#### 将 sense2vec 添加到训练好的管道\n\n如果你正在训练并打包一个 spaCy 管道，并希望在其中包含 sense2vec 组件，可以通过训练配置中的\n[`[initialize]` 块](https:\u002F\u002Fspacy.io\u002Fusage\u002Ftraining#config-lifecycle)加载数据：\n\n```ini\n[initialize.components]\n\n[initialize.components.sense2vec]\ndata_path = \"\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\"\n```\n\n### 独立使用\n\n你也可以直接使用底层的 `Sense2Vec` 类，并使用 `from_disk` 方法加载向量。可用的 API 方法见下文。\n\n```python\nfrom sense2vec import Sense2Vec\ns2v = Sense2Vec().from_disk(\"\u002Fpath\u002Fto\u002Freddit_vectors-1.1.0\")\nmost_similar = s2v.most_similar(\"natural_language_processing|NOUN\", n=10)\n```\n\n> ⚠️ **重要提示：** 要在向量表中查找条目，键必须遵循 `phrase_text|SENSE` 的格式（注意使用 `_` 而不是空格，以及在标签或类别前使用 `|`），例如 `machine_learning|NOUN`。另外请注意，底层向量表是区分大小写的。\n\n## 🎛 API\n\n### \u003Ckbd>类\u003C\u002Fkbd> `Sense2Vec`\n\n独立的 `Sense2Vec` 对象，用于存储向量、字符串和频率。\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2Vec.__init__`\n\n初始化 `Sense2Vec` 对象。\n\n| 参数       | 类型                        | 描述                                                                                                            |\n| ---------- | --------------------------- | ---------------------------------------------------------------------------------------------------------------- |\n| `shape`    | 元组                        | 向量形状。默认值为 `(1000, 128)`。                                                                               |\n| `strings`  | `spacy.strings.StringStore` | 可选的字符串存储。如果不存在，则会自动创建。                                                                    |\n| `senses`   | 列表                        | 可选的全部可用语义列表。用于生成最佳语义或其他语义的方法中。                                                   |\n| `vectors_name` | Unicode                 | 可选的向量表名称，用于避免冲突。默认值为 `\"sense2vec\"`。                                                       |\n| `overrides` | 字典                        | 可选的自定义函数，映射到注册表中已注册的名称，例如 `{\"make_key\": \"custom_make_key\"}`。                     |\n| **返回值** | `Sense2Vec`                 | 新构造的对象。                                                                                                  |\n\n```python\ns2v = Sense2Vec(shape=(300, 128), senses=[\"VERB\", \"NOUN\"])\n```\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2Vec.__len__`\n\n向量表中的行数。\n\n| 参数    | 类型 | 描述                              |\n| ------- | ---- | --------------------------------- |\n| **返回值** | 整数  | 向量表中的行数。                  |\n\n```python\ns2v = Sense2Vec(shape=(300, 128))\nassert len(s2v) == 300\n```\n\n#### \u003Ckbd>方法 \u003C\u002Fkbd> `Sense2Vec.__contains__`\n\n检查键是否存在于向量表中。\n\n| 参数    | 类型          | 描述                      |\n| ----------- | ------------- | -------------------------------- |\n| `key`       | unicode \u002F int | 要查找的键。              |\n| **返回值** | bool          | 键是否在表中。 |\n\n```python\ns2v = Sense2Vec(shape=(10, 4))\ns2v.add(\"avocado|NOUN\", numpy.asarray([4, 2, 2, 2], dtype=numpy.float32))\nassert \"avocado|NOUN\" in s2v\nassert \"avocado|VERB\" not in s2v\n```\n\n#### \u003Ckbd> 方法 \u003C\u002Fkbd> `Sense2Vec.__getitem__`\n\n根据给定的键检索向量。如果键不在表中，则返回 None。\n\n| 参数    | 类型            | 描述           |\n| ----------- | --------------- | --------------------- |\n| `key`       | unicode \u002F int   | 要查找的键。   |\n| **返回值** | `numpy.ndarray` | 向量或 `None`。 |\n\n```python\nvec = s2v[\"avocado|NOUN\"]\n```\n\n#### \u003Ckbd> 方法 \u003C\u002Fkbd> `Sense2Vec.__setitem__`\n\n为给定的键设置向量。如果键不存在，则会引发错误。要添加新条目，请使用 `Sense2Vec.add`。\n\n| 参数 | 类型            | 描述        |\n| -------- | --------------- | ---------- |\n| `key`    | unicode \u002F int   | 键。       |\n| `vector` | `numpy.ndarray` | 要设置的向量。 |\n\n```python\nvec = s2v[\"avocado|NOUN\"]\ns2v[\"avacado|NOUN\"] = vec\n```\n\n#### \u003Ckbd> 方法 \u003C\u002Fkbd> `Sense2Vec.add`\n\n向表中添加一个新的向量。\n\n| 参数 | 类型            | 描述                                                  |\n| -------- | --------------- | ------------------------------------------------------------ |\n| `key`    | unicode \u002F int   | 要添加的键。                                              |\n| `vector` | `numpy.ndarray` | 要添加的向量。                                           |\n| `freq`   | int             | 可选的频率计数。用于找到最佳匹配的词义。 |\n\n```python\nvec = s2v[\"avocado|NOUN\"]\ns2v.add(\"🥑|NOUN\", vec, 1234)\n```\n\n#### \u003Ckbd> 方法 \u003C\u002Fkbd> `Sense2Vec.get_freq`\n\n获取给定键的频率计数。\n\n| 参数    | 类型          | 描述                                       |\n| ----------- | ------------- | ------------------------------------------------- |\n| `key`       | unicode \u002F int | 要查找的键。                               |\n| `default`   | -             | 如果未找到频率，则返回的默认值。         |\n| **返回值** | int           | 频率计数。                              |\n\n```python\nvec = s2v[\"avocado|NOUN\"]\ns2v.add(\"🥑|NOUN\", vec, 1234)\nassert s2v.get_freq(\"🥑|NOUN\") == 1234\n```\n\n#### \u003Ckbd> 方法 \u003C\u002Fkbd> `Sense2Vec.set_freq`\n\n为给定的键设置频率计数。\n\n| 参数 | 类型          | 描述                   |\n| -------- | ------------- | ----------------------------- |\n| `key`    | unicode \u002F int | 要设置频率的键。 |\n| `freq`   | int           | 频率计数。          |\n\n```python\ns2v.set_freq(\"avocado|NOUN\", 104294)\n```\n\n#### \u003Ckbd> 方法 \u003C\u002Fkbd> `Sense2Vec.__iter__`, `Sense2Vec.items`\n\n遍历向量表中的条目。\n\n| 参数   | 类型  | 描述                               |\n| ---------- | ----- | ----------------------------------------- |\n| **产出** | tuple | 表中的字符串键和向量对。 |\n\n```python\nfor key, vec in s2v:\n    print(key, vec)\n\nfor key, vec in s2v.items():\n    print(key, vec)\n```\n\n#### \u003Ckbd> 方法 \u003C\u002Fkbd> `Sense2Vec.keys`\n\n遍历表中的键。\n\n| 参数   | 类型    | 描述                   |\n| ---------- | ------- | ----------------------------- |\n| **产出** | unicode | 表中的字符串键。 |\n\n```python\nall_keys = list(s2v.keys())\n```\n\n#### \u003Ckbd> 方法 \u003C\u002Fkbd> `Sense2Vec.values`\n\n遍历表中的向量。\n\n| 参数   | 类型            | 描述               |\n| ---------- | --------------- | ------------------------- |\n| **产出** | `numpy.ndarray` | 表中的向量。 |\n\n```python\nall_vecs = list(s2v.values())\n```\n\n#### \u003Ckbd> 属性 \u003C\u002Fkbd> `Sense2Vec.senses`\n\n表中可用的词义，例如 `\"NOUN\"` 或 `\"VERB\"`（在初始化时添加）。\n\n| 参数    | 类型 | 描述           |\n| ----------- | ---- | --------------------- |\n| **返回值** | list | 可用的词义。 |\n\n```python\ns2v = Sense2Vec(senses=[\"VERB\", \"NOUN\"])\nassert \"VERB\" in s2v.senses\n```\n\n#### \u003Ckbd> 属性 \u003C\u002Fkbd> `Sense2vec.frequencies`\n\n表中键的频率，按降序排列。\n\n| 参数    | 类型 | 描述                                        |\n| ----------- | ---- | -------------------------------------------------- |\n| **返回值** | list | 按频率降序排列的 `(key, freq)` 元组。 |\n\n```python\nmost_frequent = s2v.frequencies[:10]\nkey, score = s2v.frequencies[0]\n```\n\n#### \u003Ckbd> 方法 \u003C\u002Fkbd> `Sense2vec.similarity`\n\n对两个键或两组键进行语义相似度估计。默认估计是使用向量平均值的余弦相似度。\n\n| 参数    | 类型                     | 描述                         |\n| ----------- | ------------------------ | ----------------------------------- |\n| `keys_a`    | unicode \u002F int \u002F iterable | 字符串或整数键。       |\n| `keys_b`    | unicode \u002F int \u002F iterable | 另一组字符串或整数键。 |\n| **返回值** | float                    | 相似度分数。               |\n\n```python\nkeys_a = [\"machine_learning|NOUN\", \"natural_language_processing|NOUN\"]\nkeys_b = [\"computer_vision|NOUN\", \"object_detection|NOUN\"]\nprint(s2v.similarity(keys_a, keys_b))\nassert s2v.similarity(\"machine_learning|NOUN\", \"machine_learning|NOUN\") == 1.0\n```\n\n#### \u003Ckbd> 方法 \u003C\u002Fkbd> `Sense2Vec.most_similar`\n\n获取表中最相似的条目。如果提供了多个键，则使用向量的平均值。为了加快此方法的速度，可以参考 [预计算缓存的脚本](scripts\u002F06_precompute_cache.py)，以预先计算最近邻节点。\n\n| 参数     | 类型                      | 描述                                             |\n| ------------ | ------------------------- | ------------------------------------------------------- |\n| `keys`       | unicode \u002F int \u002F iterable  | 要比较的字符串或整数键。             |\n| `n`          | int                       | 返回的相似键数量。默认为 `10`。                |\n| `batch_size` | int                       | 使用的批量大小。默认为 `16`。                  |\n| **返回值**  | list                      | 最相似向量的 `(key, score)` 元组。  |\n\n```python\nmost_similar = s2v.most_similar(\"natural_language_processing|NOUN\", n=3)\n\n\n# [('machine_learning|NOUN', 0.8986967),\n#  ('computer_vision|NOUN', 0.8636297),\n\n#  ('深度学习|名词', 0.8573361)]\n```\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2Vec.get_other_senses`\n\n查找同一单词但具有不同语义的其他条目，例如针对 `\"duck|NOUN\"` 的 `\"duck|VERB\"`。\n\n| 参数        | 类型          | 描述                                                       |\n| ------------- | ------------- | ---------------------------------------------------------- |\n| `key`         | unicode \u002F int | 要检查的键。                                               |\n| `ignore_case` | bool          | 检查大写、小写和首字母大写形式。默认值为 `True`。         |\n| **返回值**   | list          | 具有不同语义的其他条目的字符串键。                         |\n\n```python\nother_senses = s2v.get_other_senses(\"duck|NOUN\")\n# ['duck|VERB', 'Duck|ORG', 'Duck|VERB', 'Duck|PERSON', 'Duck|ADJ']\n```\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2Vec.get_best_sense`\n\n根据可用的语义和频率计数，为给定单词找到最佳匹配的语义。如果没有找到匹配项，则返回 `None`。\n\n| 参数        | 类型    | 描述                                                                                             |\n| ------------- | ------- | ------------------------------------------------------------------------------------------------------- |\n| `word`        | unicode | 需要检查的单词。                                                                                      |\n| `senses`      | list    | 可选的语义列表，用于限制搜索范围。如果未设置或为空，则使用向量中的所有语义。                     |\n| `ignore_case` | bool    | 检查大写、小写和首字母大写形式。默认值为 `True`。                                                 |\n| **返回值**   | unicode | 最佳匹配的键，或 `None`。                                                                          |\n\n```python\nassert s2v.get_best_sense(\"duck\") == \"duck|NOUN\"\nassert s2v.get_best_sense(\"duck\", [\"VERB\", \"ADJ\"]) == \"duck|VERB\"\n```\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2Vec.to_bytes`\n\n将 `Sense2Vec` 对象序列化为字节串。\n\n| 参数        | 类型  | 描述                               |\n| ----------- | ----- | ----------------------------------- |\n| `exclude`   | list  | 要排除的序列化字段名称。           |\n| **返回值**  | bytes | 序列化的 `Sense2Vec` 对象。         |\n\n```python\ns2v_bytes = s2v.to_bytes()\n```\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2Vec.from_bytes`\n\n从字节串中加载 `Sense2Vec` 对象。\n\n| 参数         | 类型        | 描述                               |\n| ------------ | ----------- | ----------------------------------- |\n| `bytes_data` | bytes       | 要加载的数据。                     |\n| `exclude`    | list        | 要排除的序列化字段名称。           |\n| **返回值**   | `Sense2Vec` | 加载后的对象。                     |\n\n```python\ns2v_bytes = s2v.to_bytes()\nnew_s2v = Sense2Vec().from_bytes(s2v_bytes)\n```\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2Vec.to_disk`\n\n将 `Sense2Vec` 对象序列化到一个目录中。\n\n| 参数  | 类型             | 描述                               |\n| ------- | ---------------- | ----------------------------------- |\n| `path`  | unicode \u002F `Path` | 目标路径。                         |\n| `exclude` | list             | 要排除的序列化字段名称。           |\n\n```python\ns2v.to_disk(\"\u002Fpath\u002Fto\u002Fsense2vec\")\n```\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2Vec.from_disk`\n\n从目录中加载 `Sense2Vec` 对象。\n\n| 参数    | 类型             | 描述                               |\n| ------- | ---------------- | ----------------------------------- |\n| `path`  | unicode \u002F `Path` | 要从中加载的路径                   |\n| `exclude`   | list             | 要排除的序列化字段名称。           |\n| **返回值** | `Sense2Vec`      | 加载后的对象。                     |\n\n```python\ns2v.to_disk(\"\u002Fpath\u002Fto\u002Fsense2vec\")\nnew_s2v = Sense2Vec().from_disk(\"\u002Fpath\u002Fto\u002Fsense2vec\")\n```\n\n---\n\n### \u003Ckbd>类\u003C\u002Fkbd> `Sense2VecComponent`\n\n用于将 sense2vec 添加到 spaCy 管道中的管道组件。\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2VecComponent.__init__`\n\n初始化管道组件。\n\n| 参数        | 类型                                                                                                                  | 描述                                                                                                 |\n| --------------- | --------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |\n| `vocab`         | `Vocab`                                                                                                               | 共享的 `Vocab`。主要用于共享的 `StringStore`。                                               |\n| `shape`         | 元组                                                                                                                 | 向量形状。                                                                                           |\n| `merge_phrases` | 布尔值                                                                                                                  | 是否将 sense2vec 短语合并为一个标记。默认为 `False`。                                     |\n| `lemmatize`     | 布尔值                                                                                                                  | 如果向量中存在词元，则始终查找词元，否则使用原始词。默认为 `False`。                     |\n| `overrides`     | 可选的自定义函数，映射到通过注册表注册的名称，例如 `{\"make_key\": \"custom_make_key\"}`。 |\n| **返回值**     | `Sense2VecComponent`                                                                                                  | 新创建的对象。                                                                               |\n\n```python\ns2v = Sense2VecComponent(nlp.vocab)\n```\n\n#### \u003Ckbd>类方法\u003C\u002Fkbd> `Sense2VecComponent.from_nlp`\n\n从 nlp 对象初始化组件。主要用于入口点的组件工厂（参见 setup.cfg），以及通过 `@spacy.component` 装饰器自动注册。\n\n| 参数    | 类型                 | 描述                   |\n| ----------- | -------------------- | ----------------------------- |\n| `nlp`       | `Language`           | `nlp` 对象。             |\n| `**cfg`     | -                    | 可选配置参数。   |\n| **返回值** | `Sense2VecComponent` | 新创建的对象。 |\n\n```python\ns2v = Sense2VecComponent.from_nlp(nlp)\n```\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2VecComponent.__call__`\n\n使用该组件处理 `Doc` 对象。通常仅作为 spaCy 管道的一部分被调用，而不直接调用。\n\n| 参数    | 类型  | 描述              |\n| ----------- | ----- | ------------------------ |\n| `doc`       | `Doc` | 要处理的文档。 |\n| **返回值** | `Doc` | 处理后的文档。  |\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2Vec.init_component`\n\n在此处注册组件特定的扩展属性，且仅在组件被添加到管道并使用时才会注册；否则，即使组件仅被创建而未添加到管道，标记仍会获得这些属性。\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2VecComponent.to_bytes`\n\n将组件序列化为字节串。当组件被添加到管道并运行 `nlp.to_bytes` 时也会调用此方法。\n\n| 参数    | 类型  | 描述               |\n| ----------- | ----- | ------------------------- |\n| **返回值** | 字节 | 序列化的组件。 |\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2VecComponent.from_bytes`\n\n从字节串加载组件。当运行 `nlp.from_bytes` 时也会调用此方法。\n\n| 参数     | 类型                 | 描述        |\n| ------------ | -------------------- | ------------------ |\n| `bytes_data` | 字节                | 要加载的数据。  |\n| **返回值**  | `Sense2VecComponent` | 加载后的对象。 |\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2VecComponent.to_disk`\n\n将组件序列化到目录中。当组件被添加到管道并运行 `nlp.to_disk` 时也会调用此方法。\n\n| 参数 | 类型             | 描述 |\n| -------- | ---------------- | ----------- |\n| `path`   | unicode \u002F `Path` | 路径。   |\n\n#### \u003Ckbd>方法\u003C\u002Fkbd> `Sense2VecComponent.from_disk`\n\n从目录中加载 `Sense2Vec` 对象。当运行 `nlp.from_disk` 时也会调用此方法。\n\n| 参数    | 类型                 | 描述           |\n| ----------- | -------------------- | --------------------- |\n| `path`      | unicode \u002F `Path`     | 要加载的路径 |\n| **返回值** | `Sense2VecComponent` | 加载后的对象。    |\n\n---\n\n### \u003Ckbd>class\u003C\u002Fkbd> `registry`\n\n函数注册表（由 [`catalogue`](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fcatalogue) 提供支持）用于轻松自定义生成键和短语的函数。它允许你为自定义函数添加装饰器并命名，替换这些函数，并在保存模型时序列化自定义名称。以下是可用的注册表选项：\n\n| 名称                      | 描述                                                                                                                                                                                                                                        |\n| ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | \n| `registry.make_key`       | 给定一个 `word` 和 `sense`，返回键的字符串形式，例如 `\"word\\|sense\"`。                                                                                                                                                                      |\n| `registry.split_key`      | 给定一个字符串键，返回一个 `(word, sense)` 元组。                                                                                                                                                                                                |\n| `registry.make_spacy_key` | 给定一个 spaCy 对象（`Token` 或 `Span`）以及一个布尔类型的 `prefer_ents` 关键字参数（是否优先使用单个标记的实体标签），返回一个 `(word, sense)` 元组。该函数用于扩展属性中，为标记和跨度生成键。 |\n| `registry.get_phrases`    | 给定一个 spaCy `Doc`，返回用于 sense2vec 短语的 `Span` 对象列表（通常为名词短语和命名实体）。                                                                                                                       |\n| `registry.merge_phrases`  | 给定一个 spaCy `Doc`, 获取所有 sense2vec 短语并将它们合并为单个标记。                                                                                                                                                                  |\n\n每个注册表都提供一个 `register` 方法，可以用作函数装饰器，并接受一个参数，即自定义函数的名称。\n\n```python\nfrom sense2vec import registry\n\n@registry.make_key.register(\"custom\")\ndef custom_make_key(word, sense):\n    return f\"{word}###{sense}\"\n\n@registry.split_key.register(\"custom\")\ndef custom_split_key(key):\n    word, sense = key.split(\"###\")\n    return word, sense\n```\n\n在初始化 `Sense2Vec` 对象时，现在可以传入一个包含自定义注册函数名称的覆盖字典。\n\n```python\noverrides = {\"make_key\": \"custom\", \"split_key\": \"custom\"}\ns2v = Sense2Vec(overrides=overrides)\n```\n\n这使得尝试不同的策略变得容易，并且可以将这些策略序列化为普通字符串（而无需传递或 pickle 函数本身）。\n\n## 🚂 训练你自己的 sense2vec 向量\n\n[`\u002Fscripts`](\u002Fscripts) 目录包含用于文本预处理和训练你自己的向量的命令行工具。\n\n### 要求\n\n要训练你自己的 sense2vec 向量，你需要以下内容：\n\n- 一个 **非常大** 的原始文本源（理想情况下比用于 word2vec 的数据量还要多，因为语义会使词汇表更加稀疏）。我们建议至少 10 亿词。\n- 一个 [预训练的 spaCy 模型](https:\u002F\u002Fspacy.io\u002Fmodels)，能够分配词性标签、依存关系和命名实体，并填充 `doc.noun_chunks`。如果你需要的语言没有内置的 [名词短语语法迭代器](https:\u002F\u002Fspacy.io\u002Fusage\u002Fadding-languages#syntax-iterators)，则需要自己编写。（`doc.noun_chunks` 和 `doc.ents` 是 sense2vec 用来确定什么是短语的关键部分。）\n- 已安装并构建好的 [GloVe](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002FGloVe) 或 [fastText](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText)。你应该能够克隆仓库并在相应目录中运行 `make` 命令。\n\n### 分步流程\n\n训练过程被拆分为多个步骤，以便您可以在任何时刻从中断处继续。处理脚本设计为对单个文件进行操作，从而便于并行化工作。此仓库中的脚本需要使用 [Glove](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002FGloVe) 或 [fastText](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText)，您需要先克隆这些仓库并执行 `make` 命令来编译。\n\n对于 FastText，脚本需要指定生成的二进制文件路径。如果您在 Windows 上工作，可以使用 `cmake` 进行编译，或者直接使用这个提供 Windows 版 FastText 二进制构建的**非官方**仓库中的 `.exe` 文件：https:\u002F\u002Fgithub.com\u002Fxiamx\u002FfastText\u002Freleases。\n\n|        | 脚本                                                                                                                                       | 描述                                                                                                                                                                                 |\n| ------ | -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| **1.** | [`01_parse.py`](scripts\u002F01_parse.py)                                                                                                         | 使用 spaCy 解析原始文本，并输出 `Doc` 对象的二进制集合（参见 [`DocBin`](https:\u002F\u002Fspacy.io\u002Fapi\u002Fdocbin)）。                                                               |\n| **2.** | [`02_preprocess.py`](scripts\u002F02_preprocess.py)                                                                                               | 加载上一步生成的已解析 `Doc` 对象集合，输出 sense2vec 格式的文本文件（每行为一个句子，合并带有语义的短语）。               |\n| **3.** | [`03_glove_build_counts.py`](scripts\u002F03_glove_build_counts.py)                                                                               | 使用 [GloVe](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002FGloVe) 构建词汇表和词频统计。如果您使用的是通过 [FastText](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText) 实现的 Word2Vec，则可跳过此步骤。 |\n| **4.** | [`04_glove_train_vectors.py`](scripts\u002F04_glove_train_vectors.py)\u003Cbr \u002F>[`04_fasttext_train_vectors.py`](scripts\u002F04_fasttext_train_vectors.py) | 使用 [GloVe](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002FGloVe) 或 [FastText](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText) 训练词向量。                                                             |\n| **5.** | [`05_export.py`](scripts\u002F05_export.py)                                                                                                       | 加载词向量和词频，并输出一个 sense2vec 组件，可通过 `Sense2Vec.from_disk` 加载。                                                                             |\n| **6.** | [`06_precompute_cache.py`](scripts\u002F06_precompute_cache.py)                                                                                   | **可选：** 预计算词汇表中每个条目的近邻查询，以加快 `Sense2Vec.most_similar` 的速度。                                                                     |\n\n如需更详细的脚本说明，请查看源代码或运行脚本时添加 `--help` 参数。例如，`python scripts\u002F01_parse.py --help`。\n\n## 🍳 Prodigy 配方\n\n本包还与标注工具 [Prodigy](https:\u002F\u002Fprodi.gy) 无缝集成，并提供了利用 sense2vec 向量快速生成多词短语列表以及启动 NER 标注的配方。要使用这些配方，`sense2vec` 需安装在与 Prodigy 相同的环境中。有关实际应用场景的示例，请参阅此 [NER 项目](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fprojects\u002Ftree\u002Fmaster\u002Fner-fashion-brands)，其中包含可下载的数据集。\n\n以下是可用的配方——更多详细文档请见下文。\n\n| 配方                                                              | 描述                                                          |\n| ------------------------------------------------------------------- | -------------------------------------------------------------------- |\n| [`sense2vec.teach`](#recipe-sense2vecteach)                         | 使用 sense2vec 启动术语列表。                        |\n| [`sense2vec.to-patterns`](#recipe-sense2vecto-patterns)             | 将短语数据集转换为基于标记的匹配模式。               |\n| [`sense2vec.eval`](#recipe-sense2veceval)                           | 通过询问短语三元组来评估 sense2vec 模型。           |\n| [`sense2vec.eval-most-similar`](#recipe-sense2veceval-most-similar) | 通过纠正最相似的条目来评估 sense2vec 模型。   |\n| [`sense2vec.eval-ab`](#recipe-sense2veceval-ab)                     | 对两个预训练的 sense2vec 向量模型进行 A\u002FB 评估。 |\n\n### \u003Ckbd>配方\u003C\u002Fkbd> `sense2vec.teach`\n\n使用 sense2vec 启动术语列表。Prodigy 会根据 sense2vec 中最相似的短语建议相关术语，随着您标注并接受相似短语，建议也会相应调整。对于每个种子术语，将采用 sense2vec 向量所确定的最佳匹配语义。\n\n```bash\nprodigy sense2vec.teach [数据集] [向量路径] [--seeds] [--threshold]\n[--n-similar] [--batch-size] [--resume]\n```\n\n| 参数             | 类型       | 描述                               |\n| -------------------- | ---------- | ----------------------------------------- |\n| `数据集`            | 必填位置参数 | 用于保存标注结果的数据集。           |\n| `向量路径`       | 必填位置参数 | 预训练 sense2vec 向量的路径。     |\n| `--seeds`, `-s`      | 可选参数     | 一个或多个用逗号分隔的种子短语。 |\n| `--threshold`, `-t`  | 可选参数     | 相似度阈值，默认为 `0.85`。 |\n| `--n-similar`, `-n`  | 可选参数     | 每次获取的相似项数量。   |\n| `--batch-size`, `-b` | 可选参数     | 提交标注的批次大小。    |\n| `--resume`, `-R`     | 标志         | 从现有短语数据集中续接。  |\n\n#### 示例\n\n```bash\nprodigy sense2vec.teach tech_phrases \u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\n--seeds \"自然语言处理, 机器学习, 人工智能\"\n```\n\n### \u003Ckbd>recipe\u003C\u002Fkbd> `sense2vec.to-patterns`\n\n将使用 `sense2vec.teach` 收集的短语数据集转换为基于标记的匹配模式，这些模式可以与\n[spaCy 的 `EntityRuler`](https:\u002F\u002Fspacy.io\u002Fusage\u002Frule-based-matching#entityruler)\n或类似 `ner.match` 的配方一起使用。如果未指定输出文件，则模式会写入标准输出。示例会被分词，以便多标记术语能够正确表示，例如：\n`{\"label\": \"SHOE_BRAND\", \"pattern\": [{ \"LOWER\": \"new\" }, { \"LOWER\": \"balance\" }]}`。\n\n```bash\nprodigy sense2vec.to-patterns [dataset] [spacy_model] [label] [--output-file]\n[--case-sensitive] [--dry]\n```\n\n| 参数                  | 类型       | 描述                                  |\n| ------------------------- | ---------- | -------------------------------------------- |\n| `dataset`                 | 必需位置参数 | 要转换的短语数据集。                   |\n| `spacy_model`             | 必需位置参数 | 用于分词的 spaCy 模型。                |\n| `label`                   | 必需位置参数 | 应用于所有模式的标签。              |\n| `--output-file`, `-o`     | 可选参数     | 可选的输出文件。默认为标准输出。    |\n| `--case-sensitive`, `-CS` | 标志       | 使模式区分大小写。                |\n| `--dry`, `-D`             | 标志       | 执行试运行，不输出任何内容。      |\n\n#### 示例\n\n```bash\nprodigy sense2vec.to-patterns tech_phrases en_core_web_sm TECHNOLOGY\n--output-file \u002Fpath\u002Fto\u002Fpatterns.jsonl\n```\n\n### \u003Ckbd>recipe\u003C\u002Fkbd> `sense2vec.eval`\n\n通过询问短语三元组来评估 sense2vec 模型：单词 A 更类似于单词 B，还是更类似于单词 C？如果人类的判断大多与模型一致，则该向量模型表现良好。该配方只会询问具有相同语义的向量，并支持不同的示例选择策略。\n\n```bash\nprodigy sense2vec.eval [dataset] [vectors_path] [--strategy] [--senses]\n[--exclude-senses] [--n-freq] [--threshold] [--batch-size] [--eval-whole]\n[--eval-only] [--show-scores]\n```\n\n| 参数                  | 类型       | 描述                                                                                                   |\n| ------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------- |\n| `dataset`                 | 必需位置参数 | 用于保存标注的数据集。                                                                               |\n| `vectors_path`            | 必需位置参数 | 预训练 sense2vec 向量的路径。                                                                         |\n| `--strategy`, `-st`       | 可选参数     | 示例选择策略。`most similar`（默认）或 `random`。                                                     |\n| `--senses`, `-s`          | 可选参数     | 以逗号分隔的语义列表，用于限制选择范围。若未设置，则会使用向量中的所有语义。                     |\n| `--exclude-senses`, `-es` | 可选参数     | 以逗号分隔的要排除的语义列表。默认值请参阅 `prodigy_recipes.EVAL_EXCLUDE_SENSES`。                    |\n| `--n-freq`, `-f`          | 可选参数     | 限制为出现频率最高的若干条目。                                                                       |\n| `--threshold`, `-t`       | 可选参数     | 考虑示例时使用的最小相似度阈值。                                                                     |\n| `--batch-size`, `-b`      | 可选参数     | 使用的批次大小。                                                                                     |\n| `--eval-whole`, `-E`      | 标志       | 评估整个数据集，而不是当前会话。                                                                   |\n| `--eval-only`, `-O`       | 标志       | 不进行标注，仅评估当前数据集。                                                                     |\n| `--show-scores`, `-S`     | 标志       | 显示所有分数以供调试。                                                                               |\n\n#### 策略\n\n| 名称                 | 描述                                                                                                                                                           |\n| -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `most_similar`       | 从随机语义中随机选取一个单词，获取其属于同一语义的最相似条目。然后询问该选择中最后一条和中间一条之间的相似性。 |\n| `most_least_similar` | 从随机语义中随机选取一个单词，获取其最相似条目中最不相似的一条，然后再获取其中最相似的最后一条。                   |\n| `random`             | 从同一个随机语义中随机选取 3 个单词作为样本。                                                                                                           |\n\n#### 示例\n\n```bash\nprodigy sense2vec.eval vectors_eval \u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\n--senses NOUN,ORG,PRODUCT --threshold 0.5\n```\n\n![sense2vec.eval 的 UI 预览图](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fexplosion_sense2vec_readme_50920dc027ab.png)\n\n### \u003Ckbd>recipe\u003C\u002Fkbd> `sense2vec.eval-most-similar`\n\n通过查看模型为随机短语返回的最相似条目，并剔除其中的错误项，来评估向量模型。\n\n```bash\nprodigy sense2vec.eval [数据集] [向量路径] [--senses] [--exclude-senses]\n[--n-freq] [--n-similar] [--batch-size] [--eval-whole] [--eval-only]\n[--show-scores]\n```\n\n| 参数                  | 类型       | 描述                                                                                                   |\n| ------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------- |\n| `dataset`                 | 位置参数 | 用于保存标注的数据集。                                                                               |\n| `vectors_path`            | 位置参数 | 预训练 sense2vec 向量的路径。                                                                         |\n| `--senses`, `-s`          | 选项     | 以逗号分隔的词义列表，用于限制选择范围。若未设置，则会使用向量中的所有词义。                     |\n| `--exclude-senses`, `-es` | 选项     | 以逗号分隔的要排除的词义列表。默认值请参阅 `prodigy_recipes.EVAL_EXCLUDE_SENSES`。                |\n| `--n-freq`, `-f`          | 选项     | 限制为出现频率最高的若干条目。                                                                       |\n| `--n-similar`, `-n`       | 选项     | 要检查的相似项数量。默认值为 `10`。                                                                  |\n| `--batch-size`, `-b`      | 选项     | 使用的批次大小。                                                                                     |\n| `--eval-whole`, `-E`      | 标志       | 对整个数据集进行评估，而不是仅对当前会话。                                                           |\n| `--eval-only`, `-O`       | 标志       | 不进行标注，仅对当前数据集进行评估。                                                               |\n| `--show-scores`, `-S`     | 标志       | 显示所有得分以供调试。                                                                               |\n\n```bash\nprodigy sense2vec.eval-most-similar vectors_eval_sim \u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\n--senses NOUN,ORG,PRODUCT\n```\n\n### \u003Ckbd>recipe\u003C\u002Fkbd> `sense2vec.eval-ab`\n\n通过对两个预训练 sense2vec 向量模型为随机短语返回的最相似条目进行比较，执行 A\u002FB 评估。界面会显示两个随机排列的选项，分别展示每个模型的最相似条目，并高亮显示不同的短语。在标注会话结束时，将显示总体统计信息和更优的模型。\n\n```bash\nprodigy sense2vec.eval [数据集] [向量路径_a] [向量路径_b] [--senses]\n[--exclude-senses] [--n-freq] [--n-similar] [--batch-size] [--eval-whole]\n[--eval-only] [--show-mapping]\n```\n\n| 参数                  | 类型       | 描述                                                                                                   |\n| ------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------- |\n| `dataset`                 | 位置参数 | 用于保存标注的数据集。                                                                               |\n| `vectors_path_a`          | 位置参数 | 预训练 sense2vec 向量的路径。                                                                         |\n| `vectors_path_b`          | 位置参数 | 预训练 sense2vec 向量的路径。                                                                         |\n| `--senses`, `-s`          | 选项     | 以逗号分隔的词义列表，用于限制选择范围。若未设置，则会使用向量中的所有词义。                     |\n| `--exclude-senses`, `-es` | 选项     | 以逗号分隔的要排除的词义列表。默认值请参阅 `prodigy_recipes.EVAL_EXCLUDE_SENSES`。                |\n| `--n-freq`, `-f`          | 选项     | 限制为出现频率最高的若干条目。                                                                       |\n| `--n-similar`, `-n`       | 选项     | 要检查的相似项数量。默认值为 `10`。                                                                  |\n| `--batch-size`, `-b`      | 选项     | 使用的批次大小。                                                                                     |\n| `--eval-whole`, `-E`      | 标志       | 对整个数据集进行评估，而不是仅对当前会话。                                                         |\n| `--eval-only`, `-O`       | 标志       | 不进行标注，仅对当前数据集进行评估。                                                               |\n| `--show-mapping`, `-S`    | 标志       | 在界面上显示哪个模型是选项 1，哪个是选项 2（用于调试）。                                           |\n\n```bash\nprodigy sense2vec.eval-ab vectors_eval_sim \u002Fpath\u002Fto\u002Fs2v_reddit_2015_md \u002Fpath\u002Fto\u002Fs2v_reddit_2019_md --senses NOUN,ORG,PRODUCT\n```\n\n![sense2vec.eval-ab 的界面预览](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fexplosion_sense2vec_readme_c84126360637.png)\n\n## 预训练向量\n\n预训练的 Reddit 向量支持以下“语义”，即词性标注或实体标签。更多详情请参阅 spaCy 的 [标注方案概述](https:\u002F\u002Fspacy.io\u002Fapi\u002Fannotation)。\n\n| 标签     | 描述               | 示例                             |\n| ------- | ------------------------- | ------------------------------------ |\n| `ADJ`   | 形容词                 | 大的、旧的、绿色的                      |\n| `ADP`   | 介词                   | 在、向、在……期间                       |\n| `ADV`   | 副词                   | 非常、明天、向下、哪里                  |\n| `AUX`   | 助动词                | 是、已经（完成）、将会（做）            |\n| `CONJ`  | 连词                   | 和、或、但                             |\n| `DET`   | 冠词                   | 一个、一只、那                        |\n| `INTJ`  | 感叹词                | 嘘、哎哟、好极了、你好                 |\n| `NOUN`  | 名词                   | 女孩、猫、树、空气、美丽               |\n| `NUM`   | 数词                   | 1、2017年、一、七十七、MMXIV           |\n| `PART`  | 助词                   | ‘s、不                                |\n| `PRON`  | 代词                   | 我、你、他、她、我自己、某人            |\n| `PROPN` | 专有名词               | 玛丽、约翰、伦敦、北约、HBO             |\n| `PUNCT` | 标点符号               | ，？（）                               |\n| `SCONJ` | 从属连词               | 如果、当……时、那                       |\n| `SYM`   | 符号                   | $、%、=、:)、😝                       |\n| `VERB`  | 动词                   | 跑、跑步、吃、吃了、正在吃             |\n\n| 实体标签  | 描述                                          |\n| ------------- | ---------------------------------------------------- |\n| `PERSON`      | 人物，包括虚构人物。                         |\n| `NORP`        | 国籍、宗教或政治团体。                      |\n| `FACILITY`    | 建筑物、机场、高速公路、桥梁等。         |\n| `ORG`         | 公司、机构、组织等。                      |\n| `GPE`         | 国家、城市、州。                           |\n| `LOC`         | 非 GPE 地点，如山脉、水域等。              |\n| `PRODUCT`     | 物品、车辆、食品等（不包括服务）。       |\n| `EVENT`       | 有名称的飓风、战役、战争、体育赛事等。     |\n| `WORK_OF_ART` | 书籍、歌曲等作品的标题。                  |\n| `LANGUAGE`    | 任何有名称的语言。                          |","# sense2vec 快速上手指南\n\nsense2vec 是 word2vec 的增强版，能够基于词性标签和实体标签学习更精细的多词短语向量。它与 spaCy 深度集成，适用于语义相似度查询、规则构建及 NER 标注引导。\n\n## 环境准备\n\n*   **操作系统**：Linux, macOS, Windows\n*   **Python 版本**：3.6+\n*   **核心依赖**：\n    *   `spacy` (推荐 v3.0+)\n    *   `numpy`\n    *   `srsly`\n*   **前置模型**：若作为 spaCy 组件使用，需预先下载 spaCy 语言模型（如 `en_core_web_sm`）。\n\n## 安装步骤\n\n### 1. 安装 sense2vec\n使用 pip 直接安装最新版：\n\n```bash\npip install sense2vec\n```\n\n> **提示**：国内用户如遇下载缓慢，可使用清华或阿里镜像源：\n> ```bash\n> pip install sense2vec -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n### 2. 下载预训练向量\nsense2vec 本身不包含向量数据，需从 GitHub Release 页面下载预训练好的向量包（基于 Reddit 评论训练）。\n\n*   **轻量版 (2015 数据，约 573MB)**: `s2v_reddit_2015_md`\n*   **大型版 (2019 数据，约 4GB，分卷压缩)**: `s2v_reddit_2019_lg`\n\n下载后解压，记下文件夹路径（例如 `\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md`）。\n\n*(注：大型版分卷文件需在 Linux\u002FmacOS 下合并：`cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz`)*\n\n## 基本使用\n\n### 方式一：独立模式 (Standalone)\n直接加载向量库进行查询，无需 spaCy 管道。\n\n```python\nfrom sense2vec import Sense2Vec\n\n# 加载本地向量目录\ns2v = Sense2Vec().from_disk(\"\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\")\n\n# 构造查询键：短语_下划线连接 | 词性\u002F标签\nquery = \"natural_language_processing|NOUN\"\n\n# 检查是否存在\nif query in s2v:\n    # 获取向量\n    vector = s2v[query]\n    \n    # 获取频率\n    freq = s2v.get_freq(query)\n    \n    # 获取最相似的 3 个短语\n    most_similar = s2v.most_similar(query, n=3)\n    print(most_similar)\n    # 输出示例：[('machine_learning|NOUN', 0.898...), ('computer_vision|NOUN', 0.863...), ...]\n```\n\n### 方式二：作为 spaCy v3 管道组件\n集成到 spaCy 流程中，自动处理分词、词性标注和实体识别，支持通过 `._` 属性直接访问。\n\n```python\nimport spacy\nfrom sense2vec import Sense2VecComponent\n\n# 加载 spaCy 模型\nnlp = spacy.load(\"en_core_web_sm\")\n\n# 添加 sense2vec 组件到管道末尾\ns2v = nlp.add_pipe(\"sense2vec\")\ns2v.from_disk(\"\u002Fpath\u002Fto\u002Fs2v_reddit_2015_md\")\n\n# 处理文本\ndoc = nlp(\"A sentence about natural language processing.\")\n\n# 获取短语片段 (Span)\nphrase_span = doc[3:6] # \"natural language processing\"\n\n# 访问扩展属性\nif phrase_span._.in_s2v:\n    freq = phrase_span._.s2v_freq\n    vector = phrase_span._.s2v_vec\n    similar = phrase_span._.s2v_most_similar(3)\n    \n    print(similar)\n    # 输出示例：[(('machine learning', 'NOUN'), 0.898...), ...]\n\n# 实体查询示例 (自动使用实体标签作为 Sense)\ndoc_ent = nlp(\"Facebook and Google are tech giants.\")\nfor ent in doc_ent.ents:\n    if ent._.in_s2v:\n        print(f\"{ent.text}: {ent._.s2v_most_similar(2)}\")\n```\n\n### 关键注意事项\n1.  **Key 格式**：独立模式下查询键必须严格遵循 `短语文本 (空格转下划线)|标签` 格式（如 `machine_learning|NOUN`），且区分大小写。\n2.  **Span 限制**：在 spaCy 模式中，不建议对任意文本切片使用 sense2vec 属性，除非该切片是模型识别出的名词短语或命名实体，否则可能找不到对应的 Key。\n3.  **默认 Sense**：若非实体且未指定词性，系统默认使用根节点的词性标签作为 Sense。","某电商公司的数据团队正在构建一个智能客服系统，需要从海量用户评论中自动识别并关联具体的产品功能问题。\n\n### 没有 sense2vec 时\n- **一词多义导致误判**：传统词向量无法区分\"苹果\"是指水果还是手机品牌，导致将“屏幕碎了”错误关联到水果类目。\n- **短语语义丢失**：模型只能处理单个词汇，无法理解“电池续航”或“快充技术”等多词短语的整体含义，检索结果支离破碎。\n- **上下文感知缺失**：无法根据词性（如名词 vs 动词）调整语义表示，难以精准匹配用户描述的具体故障场景。\n- **冷启动效率低**：面对新出现的产品术语（如“潜望式镜头”），需要大量标注数据重新训练模型才能识别相似概念。\n\n### 使用 sense2vec 后\n- **精准消歧**：sense2vec 通过“词 + 词性\u002F实体标签”（如 `apple|ORG`）的键值结构，完美区分不同语境下的同一词汇，确保故障归类准确。\n- **原生支持短语**：直接为“电池续航 |NOUN\"等多词短语生成独立向量，能一次性召回“耗电快”、“充电慢”等语义高度相关的用户反馈。\n- **上下文动态感知**：利用 spaCy 管道组件自动解析句子结构，让向量表示随上下文动态变化，显著提升了对复杂投诉句子的理解力。\n- **快速迁移学习**：基于预训练的 Reddit 大规模语料库，无需从零训练即可识别新兴科技词汇，大幅缩短了新产品的模型上线周期。\n\nsense2vec 通过将上下文信息注入词向量，彻底解决了传统 NLP 模型在处理多义词和复合短语时的语义模糊难题。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fexplosion_sense2vec_73d80fe6.jpg","explosion","Explosion","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fexplosion_6bc9ac84.png","Software company specializing in developer tools and tailored solutions for AI and Natural Language Processing",null,"contact@explosion.ai","https:\u002F\u002Fexplosion.ai","https:\u002F\u002Fgithub.com\u002Fexplosion",[82,86],{"name":83,"color":84,"percentage":85},"Python","#3572A5",99.8,{"name":87,"color":88,"percentage":89},"Shell","#89e051",0.2,1673,238,"2026-04-02T08:33:21","MIT","Linux, macOS, Windows","未说明","未说明 (预训练向量文件最大约 4GB，加载需相应内存)",{"notes":98,"python":95,"dependencies":99},"该工具主要作为 spaCy v3 的管道组件使用。预训练向量文件较大（最大分卷包约 4GB），需手动下载并解压后通过 from_disk 加载。支持独立使用或集成到 spaCy 流程中。若使用多部分压缩的大型向量文件，需在 Linux\u002FmacOS 下使用 cat 命令合并后再解压。",[100,101,102],"spacy>=3.0","numpy","streamlit (可选，用于演示)",[14,35],[105,106,107,108,109,64,110,111,112],"spacy","nlp","natural-language-processing","word2vec","python","gensim","gensim-word2vec","machine-learning","2026-03-27T02:49:30.150509","2026-04-14T04:24:05.976442",[116,121,126,131,135,139],{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},32477,"如何使用自己的数据集（而非 Reddit）训练 Sense2Vec 模型？","您可以使用重构后的脚本来训练自定义数据集。预处理脚本接受一个文本文件作为输入并输出一个文本文件；训练脚本接受一个文本文件或包含文本文件的目录作为输入，并将输出保存到指定的目录路径（如果目录不存在会自动创建）。输出的模型组件目录可以通过 `Sense2Vec.from_disk` 加载。相关脚本和说明请参阅：https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Ftree\u002Frefactor\u002Fscripts 和 https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Fblob\u002Frefactor\u002FREADME.md","https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Fissues\u002F36",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},32478,"在 Windows 上运行 sense2vec 脚本时遇到 `os.system` 命令不兼容的问题怎么办？","建议修改脚本以使用 Python 版的 fastText 库（通过 pip 安装），而不是依赖二进制的 fastText 命令行工具。这样可以避免使用 `os.system` 调用外部命令，从而提高跨平台兼容性（包括 Windows）。已有用户确认通过 PR 修改后，使用 pip 安装的 fastText 库可以在 Windows 上正常运行，无需额外的二进制构建或 CLI 代码。","https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Fissues\u002F105",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},32479,"执行 `python -m sense2vec.download` 下载模型时出现 SSL 错误或连接失败如何解决？","这是一个已知的 SSL 验证问题。临时解决方法是修改本地安装的 `sputnik` 库中的 `session.py` 文件，禁用 SSL 证书验证。具体步骤：\n1. 找到 `sputnik\u002Fsession.py` 文件（通常位于 site-packages 目录下）。\n2. 添加导入：`import ssl` 和 `from urllib.request import ..., HTTPSHandler`。\n3. 在 `build_opener` 之前添加以下代码以创建不验证证书的上下文：\n   ```python\n   ctx = ssl.create_default_context()\n   ctx.check_hostname = False\n   ctx.verify_mode = ssl.CERT_NONE\n   self.opener = build_opener(HTTPSHandler(context=ctx), ...)\n   ```\n注意：此方法仅用于调试，生产环境请确保证书安全。","https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Fissues\u002F15",{"id":132,"question_zh":133,"answer_zh":134,"source_url":120},32480,"训练 Sense2Vec 模型需要多长时间？CPU 是否足够？","训练时间取决于数据量和硬件配置。对于较小的数据集（如几 MB 的文本文件），CPU 通常足以完成训练，但具体时间需根据实际测试确定。如果是大规模数据或对时间敏感的任务（如几分钟内交付），建议使用 GPU 加速或优化预处理流程。目前官方脚本支持 CPU 训练，但未提供精确的时间估算公式，建议先用小样本数据进行基准测试。",{"id":136,"question_zh":137,"answer_zh":138,"source_url":125},32481,"在 Windows 上使用 fastText 训练向量时，如何替代 `fasttext dump` 命令生成 vocab 文件？","由于 Windows 不支持直接调用 `fasttext` 二进制命令，建议使用 Python 的 fastText 库来加载模型并提取词汇表信息，而不是依赖命令行工具。虽然中间文件格式可能不同，但只要最终能生成 `05_export.py` 所需的输入格式即可。用户反馈表明，通过完全使用 Python API（而非 `os.system`）重写训练脚本后，可以成功生成兼容的输出文件，无需手动处理二进制 dump 输出。",{"id":140,"question_zh":141,"answer_zh":142,"source_url":120},32482,"遇到 AttributeError: 'spacy.vectors.Vectors' object has no attribute 'borrow' 错误怎么办？","该错误通常是由于 spaCy 版本不兼容导致的。sense2vec 的较新版本已经进行了重构（见 issue #77），修复了与新版 spaCy 的兼容性问题。请确保您使用的是最新版本的 sense2vec 和对应的 spaCy 版本，并参考最新的 README 和脚本示例进行操作。如果问题仍然存在，尝试在虚拟环境中重新安装匹配的依赖版本。",[144,149,154,159,164,169,174,179,183,187,191,195,199,203,207,211,216,221],{"id":145,"version":146,"summary_zh":147,"released_at":148},247269,"v2.0.2","* 修复问题 #155：反序列化后将向量固定到 CPU 上（https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Fpull\u002F157）。","2023-04-17T13:11:14",{"id":150,"version":151,"summary_zh":152,"released_at":153},247270,"v2.0.1","* 在 `sense2vec.teach` Prodigy 配置中：仅当没有种子词时才失败。\n* 将对 `wasabi` 的支持扩展到 v1.1.x 版本。","2022-12-08T13:16:53",{"id":155,"version":156,"summary_zh":157,"released_at":158},247271,"v2.0.0","* 更新 [spaCy v3](https:\u002F\u002Fspacy.io\u002Fusage\u002Fv3) 的组件和内部实现。","2021-02-07T06:11:17",{"id":160,"version":161,"summary_zh":162,"released_at":163},247272,"v1.0.3","* 各种小修复和改进。\n* 优化训练脚本。\n* 修复问题 #102：拆分二进制 `.spacy` 文件。\n* 修复问题 #118：修正 `s2v_other_senses` 中的拼写错误。\n\n感谢 @ahalterman、@dshefman1 和 @Anxo06 提交的拉取请求！","2021-02-07T04:36:05",{"id":165,"version":166,"summary_zh":167,"released_at":168},247273,"v1.0.2","## 🔴 Bug修复\n\n* 如果保存的模型中不包含某些属性，则为配置添加默认值。\n* 修复组件中字符串存储的序列化和反序列化问题。","2019-11-22T17:40:43",{"id":170,"version":171,"summary_zh":172,"released_at":173},247274,"v1.0.1","## 🔴 错误修复\n\n* 修复了一个 bug，该 bug 会导致无法从预计算的 `most_similar` 缓存中正确读取分数。","2019-11-22T15:49:56",{"id":175,"version":176,"summary_zh":177,"released_at":178},247275,"v1.0.0","## ✨ 新功能与改进\n\n* 从头彻底重写整个包。\n* 将内置的向量存储替换为 spaCy 的 [`Vectors`](https:\u002F\u002Fspacy.io\u002Fapi\u002Fvectors)，使该包成为纯 Python 包，并支持向量的开箱即用序列化。\n* 添加完全可序列化的 spaCy 管道组件和扩展属性。\n* 新增 `get_best_sense` 和 `get_other_senses` 方法，并优化了 `most_similar`。\n* 添加用于预计算最近邻索引的脚本，以实现超快速的“最相似”查询。\n* 为 [Prodigy](https:\u002F\u002Fprodi.gy) 添加标注配方，方便使用 sense2vec 向量从相似短语中创建词表和匹配模式（类似于 `terms.teach` 配方，但适用于多词表达）。\n* 使用 GloVe 和 fastText 的全新且更高效的 [训练与预处理脚本](scripts)。\n\n## ⚠️ 向后不兼容变更\n\n* 已移除 `sense2vec.load` 方法。请改用 `Sense2Vec.from_disk`。\n* 之前的 `VectorMap` 和 `VectorStorage` 已被移除。\n* 该包现需 Python 3.6 或更高版本。\n* 此更新需要新的向量格式（详见附带文件）。\n\n## 📖 文档与示例\n\n* 从头重写 [`README`](README.md)，并包含完整的 API 文档。\n\n## 👥 贡献者\n\n感谢 @kabirkhan 贡献了最初的 Prodigy 配方！","2019-11-22T15:07:44",{"id":180,"version":181,"summary_zh":77,"released_at":182},247276,"v1.0.0a10","2019-11-21T20:08:54",{"id":184,"version":185,"summary_zh":77,"released_at":186},247277,"v1.0.0a9","2019-11-21T01:57:57",{"id":188,"version":189,"summary_zh":77,"released_at":190},247278,"v1.0.0a8","2019-11-19T15:48:25",{"id":192,"version":193,"summary_zh":77,"released_at":194},247279,"v1.0.0a7","2019-11-19T15:45:32",{"id":196,"version":197,"summary_zh":77,"released_at":198},247280,"v1.0.0a6","2019-11-03T17:17:36",{"id":200,"version":201,"summary_zh":77,"released_at":202},247281,"v1.0.0a5","2019-11-02T16:46:09",{"id":204,"version":205,"summary_zh":77,"released_at":206},247282,"v1.0.0a4","2019-11-02T16:45:56",{"id":208,"version":209,"summary_zh":77,"released_at":210},247283,"v1.0.0a3","2019-11-02T16:45:36",{"id":212,"version":213,"summary_zh":214,"released_at":215},247284,"v1.0.0a2","> **⚠️ 这是一个 Alpha 版本，尚未准备好用于生产环境。** 您可以通过指定确切版本号，使用 pip 下载 sense2vec。\n> ```bash\n> pip install sense2vec==1.0.0a2\n> ```\n> 此版本附带了一个 `.tar.gz` 文件，其中包含了转换后的 Reddit 向量（基于 2015 年的所有评论训练而成）。有关更多详细信息和使用说明，请参阅 [`README`](README.md)。\n\n---\n\n## ✨ 新功能与改进\n\n* 从头彻底重写整个包。\n* 将内置向量存储替换为 spaCy 的 [`Vectors`](https:\u002F\u002Fspacy.io\u002Fapi\u002Fvectors)，使该包成为纯 Python 包，并支持向量的开箱即用序列化。\n* 添加完全可序列化的 spaCy 管道组件和扩展属性。\n* 新增 `get_best_sense` 和 `get_other_senses` 方法，并改进了 `most_similar` 方法。\n* 为 [Prodigy](https:\u002F\u002Fprodi.gy) 添加注释配方，以便轻松地利用 sense2vec 向量创建词表和匹配相似短语的模式（类似于 `terms.teach` 配方，只不过适用于多词表达式）。\n* 使用 GloVe 重新编写并优化了新的训练和预处理脚本。\n\n## ⚠️ 向后不兼容的变更\n\n* 已移除 `sense2vec.load` 方法，请改用 `Sense2Vec.from_disk`。\n* 之前的 `VectorMap` 和 `VectorStorage` 已被移除。\n* 本包现需 Python 3.6 或更高版本。\n* 此更新需要采用新的向量格式（详见随附的 `.tar.gz` 文件）。\n\n## 📖 文档与示例\n\n* 从头重写 [`README`](README.md)，并包含完整的 API 文档。\n\n## 👥 贡献者\n\n感谢 @kabirkhan 贡献了 Prodigy 配方！","2019-10-31T21:17:11",{"id":217,"version":218,"summary_zh":219,"released_at":220},247285,"v1.0.0a1","> **⚠️ 这是一个 Alpha 版本，尚未准备好用于生产环境。** 您可以通过指定确切版本号，使用 pip 下载 sense2vec。\n> ```bash\n> pip install sense2vec==1.0.0a1\n> ```\n> 请注意，该库不再依赖 spaCy，因此您可能需要单独[安装 spaCy](https:\u002F\u002Fspacy.io\u002Fusage) 和英语模型。本次发布附带了一个 `.tar.gz` 文件，其中包含了 Reddit 向量（基于 2015 年的所有评论训练而成）。有关更多详细信息和使用说明，请参阅 [`README`](README.md)。\n\n---\n\n## ✨ 新功能与改进\n\n* **新增：** 移除对 spaCy 的依赖，允许独立使用 `sense2vec` 库。\n* **新增：** 包含 spaCy v2.x 的[管道组件](https:\u002F\u002Fspacy.io\u002Fusage\u002Fprocessing-pipelines#custom-components)，以添加与 sense2vec 兼容的标记合并功能以及标记属性和方法。\n* 将 `reddit_vectors` 模型附加到发布包中，使其更易于下载和加载。\n\n## 📖 文档与示例\n\n* 从头重写 [`README`](README.md)，并包含完整的 API 文档。\n\n## 🚧 待办事项\n\n- [ ] 使用 spaCy 的 `Vectors` 类替换 `VectorMap` 实现。\n- [ ] 不再在运行时合并标记，并相应调整扩展属性。\n- [ ] 更新适用于 spaCy v2.x 的训练和预处理脚本。\n- [ ] 使用更多数据重新训练向量。","2019-09-12T14:12:58",{"id":222,"version":223,"summary_zh":224,"released_at":225},247286,"v1.0.0a0","> **⚠️ 这是一个 Alpha 版本，尚未准备好用于生产环境。** 您可以通过指定确切版本号，使用 pip 下载 sense2vec。\n> ```bash\n> pip install sense2vec==1.0.0a0\n> ```\n> 请注意，该库不再依赖 spaCy，因此您可能需要单独[安装 spaCy](https:\u002F\u002Fspacy.io\u002Fusage) 和英语模型。本次发布附带了一个 `.tar.gz` 文件，其中包含了 Reddit 向量（基于 2015 年的所有评论训练而成）。有关更多详细信息和使用说明，请参阅 [`README`](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Ftree\u002Fdevelop\u002FREADME.rst)。\n\n---\n\n## ✨ 新功能与改进\n\n* **新增：** 移除对 spaCy 的依赖，允许独立使用 `sense2vec` 库。\n* **新增：** 包含 spaCy v2.x 的[管道组件](https:\u002F\u002Fspacy.io\u002Fusage\u002Fprocessing-pipelines#custom-components)，以添加与 sense2vec 兼容的标记合并功能以及标记属性和方法。\n* 将 `reddit_vectors` 模型附加到发布包中，使其更易于下载和加载。\n\n## 📖 文档与示例\n\n* 从头重写 [`README`](https:\u002F\u002Fgithub.com\u002Fexplosion\u002Fsense2vec\u002Ftree\u002Fdevelop\u002FREADME.rst)，并包含完整的 API 文档。\n\n## 🚧 待办事项\n\n- [ ] 更新适用于 spaCy v2.x 的训练和预处理脚本。","2018-04-08T15:31:45"]