[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Muennighoff--sgpt":3,"tool-Muennighoff--sgpt":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":78,"owner_location":79,"owner_email":80,"owner_twitter":75,"owner_website":81,"owner_url":82,"languages":83,"stars":96,"forks":97,"last_commit_at":98,"license":99,"difficulty_score":10,"env_os":100,"env_gpu":101,"env_ram":102,"env_deps":103,"category_tags":110,"github_topics":111,"view_count":23,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":121,"updated_at":122,"faqs":123,"releases":164},2659,"Muennighoff\u002Fsgpt","sgpt","SGPT: GPT Sentence Embeddings for Semantic Search","sgpt 是一个基于 GPT 大模型的开源项目，旨在将强大的语言生成能力转化为高效的语义搜索工具。它主要解决了传统小模型在理解复杂语境和长文本语义时精度不足的问题，让开发者能够利用 GPT 的强大表征能力进行高精度的文档检索、问答匹配及语义相似度计算。\n\n该项目特别适合 AI 研究人员、后端开发者以及需要构建企业级知识库或智能搜索系统的技术团队使用。sgpt 的核心亮点在于其独特的技术路径：它提出了两种模式——SGPT-BE（双编码器）和 SGPT-CE（交叉编码器）。其中，SGPT-BE 通过仅微调偏置张量并结合位置加权平均池化，就能生成高质量的句子向量，极大降低了训练成本；而 SGPT-CE 则无需任何微调，直接利用 GPT 原有的对数概率机制即可实现卓越的排序效果。此外，sgpt 不仅支持英语，还发布了基于 BLOOM 的多语言版本，并能无缝集成到流行的 Sentence Transformers 库中，方便用户快速部署对称或非对称的搜索任务。虽然作者近期推荐了性能更统一的新一代模型 GRIT，但 sgpt 作为验证大模型嵌入能力的经典方案，依然具有重要的参考和应用价值。","# SGPT: GPT Sentence Embeddings for Semantic Search\n\nThis repository contains code, results & pre-trained models for the paper [SGPT: GPT Sentence Embeddings for Semantic Search](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.08904).\n\n**************************** Updates ****************************\n\n* 2024-02: We released [GRIT & GritLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.09906) - These models unify SGPT Bi-Encoders, Cross-Encoders, symmetric, asymmetric, and regular GPT (i.e. generation) all in 1 single model at much better performance on all accounts. We recommend switching to these new models :)\n* 2022-09: SGPT Bi-Encoders are now easy to use with [Sentence Transformers](https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers), see [new scripts](#use-sgpt-with-sentence-transformers)\n* 2022-08: Multilingual BLOOM SGPT models were released: [Asymmetric, 7.1B parameters](https:\u002F\u002Fhuggingface.co\u002Fbigscience\u002Fsgpt-bloom-7b1-msmarco) & [Symmetric, 1.7B parameters](https:\u002F\u002Fhuggingface.co\u002Fbigscience-data\u002Fsgpt-bloom-1b7-nli). Feel free to open an issue if you need a different model.\n* 2022-06: OpenAI released the mechanism of their Search Endpoint that we compared to SGPT Cross-Encoders in the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.08904). Our methods are very similar. Feel free to test their prompt as seen in `crossencoder\u002Fbeir\u002Fopenai_search_endpoint_functionality.py`!\n* 2022-03: 5.8B Bi-Encoder models are now 4% & 1% better on USEB & BEIR, respectively. [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.08904) & [models](https:\u002F\u002Fhuggingface.co\u002Fmodels?search=sgpt-5.8b) on HF have been updated. This has been done by using larger batch sizes with GradCache, see the paper for more info. If you have previously downloaded them, we recommend replacing it with the new version.\n* 2022-02: We released [our paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.08904). Check it out! :)\n\n## Quick Links\n\n- [Overview](#overview)\n- [Structure](#structure)\n- [Use SGPT with Huggingface](#use-sgpt-with-huggingface)\n    - [Bi-Encoder](#bi-encoder)\n        - [Symmetric Semantic Search BE](#symmetric-semantic-search-be)\n        - [Asymmetric Semantic Search BE](#asymmetric-semantic-search-be)\n    - [Cross-Encoder](#cross-encoder)\n        - [Asymmetric Semantic Search CE](#asymmetric-semantic-search-ce)\n        - [Symmetric Semantic Search CE](#symmetric-semantic-search-ce)\n- [Use SGPT with Sentence Transformers](#use-sgpt-with-sentence-transformers)\n    - [Bi-Encoder ST](#bi-encoder-st)\n        - [Symmetric Semantic Search BE ST](#symmetric-semantic-search-be-st)\n        - [Asymmetric Semantic Search BE ST](#asymmetric-semantic-search-be-st)\n            - [SGPT Sentence Transformers](#sgpt-sentence-transformers)\n            - [Original Sentence Transformers](#original-sentence-transformers)\n- [Acknowledgements](#acknowledgements)\n- [Citation](#citation)\n\n## Overview\n\nWe present SGPT-BE and SGPT-CE for applying GPT models as Bi-Encoders or Cross-Encoders to symmetric or asymmetric search. SGPT-BE produces semantically meaningful sentence embeddings by contrastive fine-tuning of only bias tensors and position-weighted mean pooling. SGPT-CE uses log probabilities from GPT models without any fine-tuning. An illustration of the methods follows.\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FMuennighoff_sgpt_readme_78c7ffc323c4.png)\n\nFeel free to open an issue should you have any questions~\n\n## Structure\n\n```bash\n.\n├── biencoder  # Training & Inference of Bi-Encoders\n│   ├── beir\n│   │   ├── custommodels # Directory providing BEIR compatibility for asymmetric mdoels & models with special tokens\n│   │   │   └── ...\n│   │   ├── io_utils # Exclusively used for beir_openai_embeddings_batched_parallel.py\n│   │   │   └── ...\n│   │   ├── parallelizer # Exclusively used for beir_openai_embeddings_batched_parallel.py\n│   │   │   └── ...\n│   │   ├── beir_dense_retriever.py\n│   │   ├── beir_openai_embeddings_batched_parallel.py\n│   │   ├── requirements.txt\n│   │   ├── *.bash # Bash scripts to run multiple experiments\n│   │   └── README.md\n│   ├── nli_msmarco\n│   │   ├── sentence-transformers # An adapted version of sentence-transformers - Install this version for all biencoder experiments\n│   │   │   └── ...\n│   │   └── README.md\n│   └── useb\n│       ├── useb\n│       │   └── ...\n│       ├── *.bash # Bash scripts to run multiple experiments\n│       ├── useb_dense_retriever.py\n│       └── README.md\n├── crossencoder  # Inference of Cross-Encoders\n│   └── beir\n│       ├── *.ipynb # Notebooks explained in the README\n│       └── README.md\n├── other\n│   ├── sgpt_graphic.png\n│   └── sgpt_utils.ipynb # Code for creating the graphs in the paper & other\n├── requirements.txt\n└── README.md\n```\n\nEach data sub-directory provides its own README with an overview of its **Structure**, **Downloads** (Datasets, Models) & **Commands** used to produce the datasets, models & other things. Generally, you can find all models at https:\u002F\u002Fhuggingface.co\u002FMuennighoff and json results in various datasets at https:\u002F\u002Fwww.kaggle.com\u002Fmuennighoff\u002Fdatasets. Model names are explained in their Huggingface READMEs. Dataset names are explained in the sub-folders of this repository.\n\n\n## Use SGPT with Huggingface\n\nBelow we provide python examples to use the pre-trained models for your own semantic search use case.\nWe highly recommend replacing the model names with larger models, e.g. `Muennighoff\u002FSGPT-5.8B-weightedmean-nli-bitfit` for biencoder\u002Fsymmetric.\n\n### Bi-Encoder\n\n#### Symmetric Semantic Search BE\n\n```python\nimport torch\nfrom transformers import AutoModel, AutoTokenizer\nfrom scipy.spatial.distance import cosine\n\n# Get our models - The package will take care of downloading the models automatically\n# For best performance: Muennighoff\u002FSGPT-5.8B-weightedmean-nli-bitfit\ntokenizer = AutoTokenizer.from_pretrained(\"Muennighoff\u002FSGPT-125M-weightedmean-nli-bitfit\")\nmodel = AutoModel.from_pretrained(\"Muennighoff\u002FSGPT-125M-weightedmean-nli-bitfit\")\n# Deactivate Dropout (There is no dropout in the above models so it makes no difference here but other SGPT models may have dropout)\nmodel.eval()\n\n# Tokenize input texts\ntexts = [\n    \"deep learning\",\n    \"artificial intelligence\",\n    \"deep diving\",\n    \"artificial snow\",\n]\nbatch_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors=\"pt\")\n\n# Get the embeddings\nwith torch.no_grad():\n    # Get hidden state of shape [bs, seq_len, hid_dim]\n    last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state\n\n# Get weights of shape [bs, seq_len, hid_dim]\nweights = (\n    torch.arange(start=1, end=last_hidden_state.shape[1] + 1)\n    .unsqueeze(0)\n    .unsqueeze(-1)\n    .expand(last_hidden_state.size())\n    .float().to(last_hidden_state.device)\n)\n\n# Get attn mask of shape [bs, seq_len, hid_dim]\ninput_mask_expanded = (\n    batch_tokens[\"attention_mask\"]\n    .unsqueeze(-1)\n    .expand(last_hidden_state.size())\n    .float()\n)\n\n# Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim\nsum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)\nsum_mask = torch.sum(input_mask_expanded * weights, dim=1)\n\nembeddings = sum_embeddings \u002F sum_mask\n\n# Calculate cosine similarities\n# Cosine similarities are in [-1, 1]. Higher means more similar\ncosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])\ncosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])\ncosine_sim_0_3 = 1 - cosine(embeddings[0], embeddings[3])\n\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (texts[0], texts[1], cosine_sim_0_1))\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (texts[0], texts[2], cosine_sim_0_2))\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (texts[0], texts[3], cosine_sim_0_3))\n```\n\n#### Asymmetric Semantic Search BE\n\n```python\nimport torch\nfrom transformers import AutoModel, AutoTokenizer\nfrom scipy.spatial.distance import cosine\n\n# Get our models - The package will take care of downloading the models automatically\n# For best performance: Muennighoff\u002FSGPT-5.8B-weightedmean-msmarco-specb-bitfit\ntokenizer = AutoTokenizer.from_pretrained(\"Muennighoff\u002FSGPT-125M-weightedmean-msmarco-specb-bitfit\")\nmodel = AutoModel.from_pretrained(\"Muennighoff\u002FSGPT-125M-weightedmean-msmarco-specb-bitfit\")\n# Deactivate Dropout (There is no dropout in the above models so it makes no difference here but other SGPT models may have dropout)\nmodel.eval()\n\nqueries = [\n    \"I'm searching for a planet not too far from Earth.\",\n]\n\ndocs = [\n    \"Neptune is the eighth and farthest-known Solar planet from the Sun. In the Solar System, it is the fourth-largest planet by diameter, the third-most-massive planet, and the densest giant planet. It is 17 times the mass of Earth, slightly more massive than its near-twin Uranus.\",\n    \"TRAPPIST-1d, also designated as 2MASS J23062928-0502285 d, is a small exoplanet (about 30% the mass of the earth), which orbits on the inner edge of the habitable zone of the ultracool dwarf star TRAPPIST-1 approximately 40 light-years (12.1 parsecs, or nearly 3.7336×1014 km) away from Earth in the constellation of Aquarius.\",\n    \"A harsh desert world orbiting twin suns in the galaxy’s Outer Rim, Tatooine is a lawless place ruled by Hutt gangsters. Many settlers scratch out a living on moisture farms, while spaceport cities such as Mos Eisley and Mos Espa serve as home base for smugglers, criminals, and other rogues.\",\n]\n\nSPECB_QUE_BOS = tokenizer.encode(\"[\", add_special_tokens=False)[0]\nSPECB_QUE_EOS = tokenizer.encode(\"]\", add_special_tokens=False)[0]\n\nSPECB_DOC_BOS = tokenizer.encode(\"{\", add_special_tokens=False)[0]\nSPECB_DOC_EOS = tokenizer.encode(\"}\", add_special_tokens=False)[0]\n\n\ndef tokenize_with_specb(texts, is_query):\n    # Tokenize without padding\n    batch_tokens = tokenizer(texts, padding=False, truncation=True)   \n    # Add special brackets & pay attention to them\n    for seq, att in zip(batch_tokens[\"input_ids\"], batch_tokens[\"attention_mask\"]):\n        if is_query:\n            seq.insert(0, SPECB_QUE_BOS)\n            seq.append(SPECB_QUE_EOS)\n        else:\n            seq.insert(0, SPECB_DOC_BOS)\n            seq.append(SPECB_DOC_EOS)\n        att.insert(0, 1)\n        att.append(1)\n    # Add padding\n    batch_tokens = tokenizer.pad(batch_tokens, padding=True, return_tensors=\"pt\")\n    return batch_tokens\n\ndef get_weightedmean_embedding(batch_tokens, model):\n    # Get the embeddings\n    with torch.no_grad():\n        # Get hidden state of shape [bs, seq_len, hid_dim]\n        last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state\n\n    # Get weights of shape [bs, seq_len, hid_dim]\n    weights = (\n        torch.arange(start=1, end=last_hidden_state.shape[1] + 1)\n        .unsqueeze(0)\n        .unsqueeze(-1)\n        .expand(last_hidden_state.size())\n        .float().to(last_hidden_state.device)\n    )\n\n    # Get attn mask of shape [bs, seq_len, hid_dim]\n    input_mask_expanded = (\n        batch_tokens[\"attention_mask\"]\n        .unsqueeze(-1)\n        .expand(last_hidden_state.size())\n        .float()\n    )\n\n    # Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim\n    sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)\n    sum_mask = torch.sum(input_mask_expanded * weights, dim=1)\n\n    embeddings = sum_embeddings \u002F sum_mask\n\n    return embeddings\n\n\nquery_embeddings = get_weightedmean_embedding(tokenize_with_specb(queries, is_query=True), model)\ndoc_embeddings = get_weightedmean_embedding(tokenize_with_specb(docs, is_query=False), model)\n\n# Calculate cosine similarities\n# Cosine similarities are in [-1, 1]. Higher means more similar\ncosine_sim_0_1 = 1 - cosine(query_embeddings[0], doc_embeddings[0])\ncosine_sim_0_2 = 1 - cosine(query_embeddings[0], doc_embeddings[1])\ncosine_sim_0_3 = 1 - cosine(query_embeddings[0], doc_embeddings[2])\n\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (queries[0], docs[0][:20] + \"...\", cosine_sim_0_1))\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (queries[0], docs[1][:20] + \"...\", cosine_sim_0_2))\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (queries[0], docs[2][:20] + \"...\", cosine_sim_0_3))\n```\n\n### Cross-Encoder\n\n#### Asymmetric Semantic Search CE\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom scipy.spatial.distance import cosine\n\n# Get models - The package will take care of downloading the models automatically\n# For best performance: EleutherAI\u002Fgpt-j-6B\ntokenizer = AutoTokenizer.from_pretrained(\"EleutherAI\u002Fgpt-neo-125M\")\nmodel = AutoModelForCausalLM.from_pretrained(\"EleutherAI\u002Fgpt-neo-125M\")\n# Deactivate Dropout (There is no dropout in the above models so it makes no difference here but other SGPT models may have dropout)\nmodel.eval()\n\nprompt = 'Documents are searched to find matches with the same content.\\nThe document \"{}\" is a good search result for \"'\n\nqueries = [\n    \"I'm searching for a planet not too far from Earth.\",\n]\n\ndocs = [\n    \"Neptune is the eighth and farthest-known Solar planet from the Sun. In the Solar System, it is the fourth-largest planet by diameter, the third-most-massive planet, and the densest giant planet. It is 17 times the mass of Earth, slightly more massive than its near-twin Uranus.\",\n    \"TRAPPIST-1d, also designated as 2MASS J23062928-0502285 d, is a small exoplanet (about 30% the mass of the earth), which orbits on the inner edge of the habitable zone of the ultracool dwarf star TRAPPIST-1 approximately 40 light-years (12.1 parsecs, or nearly 3.7336×1014 km) away from Earth in the constellation of Aquarius.\",\n    \"A harsh desert world orbiting twin suns in the galaxy’s Outer Rim, Tatooine is a lawless place ruled by Hutt gangsters. Many settlers scratch out a living on moisture farms, while spaceport cities such as Mos Eisley and Mos Espa serve as home base for smugglers, criminals, and other rogues.\",\n]\n\nfor query in queries:\n    print(f\"Query: {query}\")\n    for doc in docs:\n        context = prompt.format(doc)\n\n        context_enc = tokenizer.encode(context, add_special_tokens=False)\n        continuation_enc = tokenizer.encode(query, add_special_tokens=False)\n        # Slice off the last token, as we take its probability from the one before\n        model_input = torch.tensor(context_enc+continuation_enc[:-1])\n        continuation_len = len(continuation_enc)\n        input_len, = model_input.shape\n\n        # [seq_len] -> [seq_len, vocab]\n        logprobs = torch.nn.functional.log_softmax(model(model_input)[0], dim=-1).cpu()\n        # [seq_len, vocab] -> [continuation_len, vocab]\n        logprobs = logprobs[input_len-continuation_len:]\n        # Gather the log probabilities of the continuation tokens -> [continuation_len]\n        logprobs = torch.gather(logprobs, 1, torch.tensor(continuation_enc).unsqueeze(-1)).squeeze(-1)\n        score = torch.sum(logprobs)\n        # The higher (closer to 0), the more similar\n        print(f\"Document: {doc[:20] + '...'} Score: {score}\")\n```\n\n#### Symmetric Semantic Search CE\n\nYou can use the same code as in the above [CE-Asym section](#asymmetric-semantic-search-1) but change the prompt. Feel free to share prompts that work well :)\n\n## Use SGPT with Sentence Transformers\n\n### Bi-Encoder ST\n\n#### Symmetric Semantic Search BE ST\n\nSymmetric models are now 100% compatible with the latest [sentence-transformers](https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers) via `pip install git+https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers.git`. You should get the same results as in [the HuggingFace script above.](#symmetric-semantic-search-be)\n\n```python\nfrom scipy.spatial.distance import cosine\nfrom sentence_transformers import SentenceTransformer\n\ntexts = [\n    \"deep learning\",\n    \"artificial intelligence\",\n    \"deep diving\",\n    \"artificial snow\",\n]\n\nmodel = SentenceTransformer(\"Muennighoff\u002FSGPT-125M-weightedmean-nli-bitfit\")\nembeddings = model.encode(texts)\n\ncosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])\ncosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])\ncosine_sim_0_3 = 1 - cosine(embeddings[0], embeddings[3])\n\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (texts[0], texts[1], cosine_sim_0_1))\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (texts[0], texts[2], cosine_sim_0_2))\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (texts[0], texts[3], cosine_sim_0_3))\n```\n\n#### Asymmetric Semantic Search BE ST\n\n##### SGPT Sentence Transformers\n\nInstall: `pip install --upgrade git+https:\u002F\u002Fgithub.com\u002FMuennighoff\u002Fsentence-transformers.git@sgpt_poolings_specb`\nUse the below, which produces the exact same scores as the [HuggingFace solution above.](#asymmetric-semantic-search-be)\n\n```python\nfrom scipy.spatial.distance import cosine\nfrom sentence_transformers import SentenceTransformer\n\nqueries = [\n    \"I'm searching for a planet not too far from Earth.\",\n]\n\ndocs = [\n    \"Neptune is the eighth and farthest-known Solar planet from the Sun. In the Solar System, it is the fourth-largest planet by diameter, the third-most-massive planet, and the densest giant planet. It is 17 times the mass of Earth, slightly more massive than its near-twin Uranus.\",\n    \"TRAPPIST-1d, also designated as 2MASS J23062928-0502285 d, is a small exoplanet (about 30% the mass of the earth), which orbits on the inner edge of the habitable zone of the ultracool dwarf star TRAPPIST-1 approximately 40 light-years (12.1 parsecs, or nearly 3.7336×1014 km) away from Earth in the constellation of Aquarius.\",\n    \"A harsh desert world orbiting twin suns in the galaxy’s Outer Rim, Tatooine is a lawless place ruled by Hutt gangsters. Many settlers scratch out a living on moisture farms, while spaceport cities such as Mos Eisley and Mos Espa serve as home base for smugglers, criminals, and other rogues.\",\n]\n\nclass SentenceTransformerSpecb(SentenceTransformer):\n    # Requires:\n    # pip install git+https:\u002F\u002Fgithub.com\u002FMuennighoff\u002Fsentence-transformers.git@sgpt_poolings_specb\n    def __init__(self, *args, **kwargs):\n        super().__init__(*args, **kwargs)\n        tokens = [\"[SOS]\", \"{SOS}\"]\n        self._first_module().tokenizer.add_tokens(tokens, special_tokens=True)\n        self._first_module().auto_model.resize_token_embeddings(len(self._first_module().tokenizer))\n        # Will be replaced with the rep tokens in the model ones\n        # The problem is we don't know if a text is query or document when tokenizing in the Transformer.py module, \n        # so we use the SOS tokens as an identifier if we have a query or document at hand & then replace them\n        # If we would directly use the brackets here, they may become part of another token\n        self._first_module().bos_spec_token_q = self._first_module().tokenizer.encode(\"[SOS]\", add_special_tokens=False)[0]\n        self._first_module().bos_spec_token_d = self._first_module().tokenizer.encode(\"{SOS}\", add_special_tokens=False)[0]\n        self._first_module().bos_spec_token_q_rep = self._first_module().tokenizer.encode(\"[\", add_special_tokens=False)[0]\n        self._first_module().eos_spec_token_q = self._first_module().tokenizer.encode(\"]\", add_special_tokens=False)[0]\n        self._first_module().bos_spec_token_d_rep = self._first_module().tokenizer.encode(\"{\", add_special_tokens=False)[0]\n        self._first_module().eos_spec_token_d = self._first_module().tokenizer.encode(\"}\", add_special_tokens=False)[0]\n        self._first_module().replace_bos = True\n\n    def encode(self, sentences, **kwargs):\n        is_query = kwargs.pop(\"is_query\", True)\n        if is_query:\n            sentences = \"[SOS]\" + sentences if isinstance(sentences, str) else [\"[SOS]\" + sent for sent in sentences]\n        else:\n            sentences = \"{SOS}\" + sentences if isinstance(sentences, str) else [\"{SOS}\" + sent for sent in sentences]    \n        return super().encode(sentences, **kwargs)\n        \nmodel = SentenceTransformerSpecb(\"Muennighoff\u002FSGPT-125M-weightedmean-msmarco-specb-bitfit\")\n\nquery_embeddings = model.encode(queries, is_query=True)\ndoc_embeddings = model.encode(docs, is_query=False)\n\n# Calculate cosine similarities\n# Cosine similarities are in [-1, 1]. Higher means more similar\ncosine_sim_0_1 = 1 - cosine(query_embeddings[0], doc_embeddings[0])\ncosine_sim_0_2 = 1 - cosine(query_embeddings[0], doc_embeddings[1])\ncosine_sim_0_3 = 1 - cosine(query_embeddings[0], doc_embeddings[2])\n\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (queries[0], docs[0][:20] + \"...\", cosine_sim_0_1))\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (queries[0], docs[1][:20] + \"...\", cosine_sim_0_2))\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (queries[0], docs[2][:20] + \"...\", cosine_sim_0_3))\n```\n\n##### Original Sentence Transformers\n\nIf you want to use the Sentence Transformers at `https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers`, you can use the below. Make sure to use the latest version (`pip install --upgrade git+https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers.git`).\nNote that this will produce slightly worse scores than [SGPT Sentence Transformers](#sgpt-sentence-transformers), as the special brackets may get intermingled with other tokens upon tokenization. On SciFact (BEIR) NDCG@10 of the below decreases to 0.566 from 0.569 for `SGPT-125M-weightedmean-msmarco-specb-bitfit`.\n\n```python\nfrom scipy.spatial.distance import cosine\nfrom sentence_transformers import SentenceTransformer\n\nqueries = [\n    \"I'm searching for a planet not too far from Earth.\",\n]\n\ndocs = [\n    \"Neptune is the eighth and farthest-known Solar planet from the Sun. In the Solar System, it is the fourth-largest planet by diameter, the third-most-massive planet, and the densest giant planet. It is 17 times the mass of Earth, slightly more massive than its near-twin Uranus.\",\n    \"TRAPPIST-1d, also designated as 2MASS J23062928-0502285 d, is a small exoplanet (about 30% the mass of the earth), which orbits on the inner edge of the habitable zone of the ultracool dwarf star TRAPPIST-1 approximately 40 light-years (12.1 parsecs, or nearly 3.7336×1014 km) away from Earth in the constellation of Aquarius.\",\n    \"A harsh desert world orbiting twin suns in the galaxy’s Outer Rim, Tatooine is a lawless place ruled by Hutt gangsters. Many settlers scratch out a living on moisture farms, while spaceport cities such as Mos Eisley and Mos Espa serve as home base for smugglers, criminals, and other rogues.\",\n]\n\nclass SentenceTransformerSpecb(SentenceTransformer):\n    def encode(self, sentences, **kwargs):\n        is_query = kwargs.pop(\"is_query\", True)\n        if is_query:\n            sentences = \"[\" + sentences + \"]\" if isinstance(sentences, str) else [\"[\" + sent + \"]\" for sent in sentences]\n        else:\n            sentences = \"{\" + sentences + \"}\" if isinstance(sentences, str) else [\"{\" + sent + \"}\" for sent in sentences]    \n        return super().encode(sentences, **kwargs)\n        \nmodel = SentenceTransformerSpecb(\"Muennighoff\u002FSGPT-125M-weightedmean-msmarco-specb-bitfit\")\n\nquery_embeddings = model.encode(queries, is_query=True)\ndoc_embeddings = model.encode(docs, is_query=False)\n\n# Calculate cosine similarities\n# Cosine similarities are in [-1, 1]. Higher means more similar\ncosine_sim_0_1 = 1 - cosine(query_embeddings[0], doc_embeddings[0])\ncosine_sim_0_2 = 1 - cosine(query_embeddings[0], doc_embeddings[1])\ncosine_sim_0_3 = 1 - cosine(query_embeddings[0], doc_embeddings[2])\n\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (queries[0], docs[0][:20] + \"...\", cosine_sim_0_1))\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (queries[0], docs[1][:20] + \"...\", cosine_sim_0_2))\nprint(\"Cosine similarity between \\\"%s\\\" and \\\"%s\\\" is: %.3f\" % (queries[0], docs[2][:20] + \"...\", cosine_sim_0_3))\n```\n\n## Acknowledgements\n\nWe thank Constantin Eichenberg and Samuel Weinbach for insightful discussions and valuable feedback throughout the project. We thank Robert Baldock, Marco Bellagente and Koen Oostermeijer for reading drafts of the paper. This work has been supported by OpenAI under the academic access program. \nThis work would not have been possible without:\n- UKPLab: [SBERT](https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers), [BEIR](https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fbeir), [USEB](https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fuseb)\n- [Eleuther AI Models](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox)\n- [Huggingface Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)\n\n## Citation\n\nFeel free to cite our paper if SGPT is helpful to you :) \n\n```bibtex\n@article{muennighoff2022sgpt,\n  title={SGPT: GPT Sentence Embeddings for Semantic Search},\n  author={Muennighoff, Niklas},\n  journal={arXiv preprint arXiv:2202.08904},\n  year={2022}\n}\n```\n","# SGPT：用于语义搜索的GPT句子嵌入\n\n本仓库包含论文《SGPT：用于语义搜索的GPT句子嵌入》（https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.08904）的相关代码、结果及预训练模型。\n\n**************************** 更新 *****************************\n\n* 2024-02：我们发布了[GRIT & GritLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.09906)——这些模型将SGPT双编码器、交叉编码器、对称式、非对称式以及常规GPT（即生成式）统一到一个单一模型中，并且在各项指标上都表现得更好。我们建议切换到这些新模型 :)\n* 2022-09：SGPT双编码器现在可以轻松地与[Sentence Transformers](https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers)一起使用，详见[新脚本](#use-sgpt-with-sentence-transformers)\n* 2022-08：多语言BLOOM SGPT模型发布：[非对称式，71亿参数](https:\u002F\u002Fhuggingface.co\u002Fbigscience\u002Fsgpt-bloom-7b1-msmarco)及[对称式，17亿参数](https:\u002F\u002Fhuggingface.co\u002Fbigscience-data\u002Fsgpt-bloom-1b7-nli)。如果您需要其他模型，请随时提交issue。\n* 2022-06：OpenAI发布了其Search Endpoint的工作机制，我们在[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.08904)中将其与SGPT交叉编码器进行了对比。我们的方法非常相似。您可以尝试使用`crossencoder\u002Fbeir\u002Fopenai_search_endpoint_functionality.py`中所示的提示！\n* 2022-03：58亿参数的双编码器模型在USEB和BEIR上的表现分别提升了4%和1%。[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.08904)及[Hugging Face上的模型](https:\u002F\u002Fhuggingface.co\u002Fmodels?search=sgpt-5.8b)均已更新。这是通过使用GradCache增大批量大小实现的，更多信息请参阅论文。如果您之前下载过这些模型，我们建议用新版本替换。\n* 2022-02：我们发布了[我们的论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.08904)。快来看看吧！ :)\n\n## 快速链接\n\n- [概述](#overview)\n- [结构](#structure)\n- [使用Hugging Face中的SGPT](#use-sgpt-with-huggingface)\n    - [双编码器](#bi-encoder)\n        - [对称式语义搜索BE](#symmetric-semantic-search-be)\n        - [非对称式语义搜索BE](#asymmetric-semantic-search-be)\n    - [交叉编码器](#cross-encoder)\n        - [非对称式语义搜索CE](#asymmetric-semantic-search-ce)\n        - [对称式语义搜索CE](#symmetric-semantic-search-ce)\n- [使用Sentence Transformers中的SGPT](#use-sgpt-with-sentence-transformers)\n    - [双编码器ST](#bi-encoder-st)\n        - [对称式语义搜索BE ST](#symmetric-semantic-search-be-st)\n        - [非对称式语义搜索BE ST](#asymmetric-semantic-search-be-st)\n            - [SGPT Sentence Transformers](#sgpt-sentence-transformers)\n            - [原始Sentence Transformers](#original-sentence-transformers)\n- [致谢](#acknowledgements)\n- [引用](#citation)\n\n## 概述\n\n我们提出了SGPT-BE和SGPT-CE，用于将GPT模型作为双编码器或交叉编码器应用于对称或非对称搜索。SGPT-BE通过仅对偏置张量进行对比微调以及位置加权平均池化，生成具有语义意义的句子嵌入。SGPT-CE则直接使用GPT模型的日志概率，无需任何微调。以下是这些方法的示意图。\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FMuennighoff_sgpt_readme_78c7ffc323c4.png)\n\n如果您有任何问题，欢迎随时提交issue~\n\n## 结构\n\n```bash\n.\n├── biencoder  # 双编码器的训练与推理\n│   ├── beir\n│   │   ├── custommodels # 提供非对称模型及带有特殊标记模型的BEIR兼容性的目录\n│   │   │   └── ...\n│   │   ├── io_utils # 专用于beir_openai_embeddings_batched_parallel.py\n│   │   │   └── ...\n│   │   ├── parallelizer # 专用于beir_openai_embeddings_batched_parallel.py\n│   │   │   └── ...\n│   │   ├── beir_dense_retriever.py\n│   │   ├── beir_openai_embeddings_batched_parallel.py\n│   │   ├── requirements.txt\n│   │   ├── *.bash # 用于运行多个实验的Bash脚本\n│   │   └── README.md\n│   ├── nli_msmarco\n│   │   ├── sentence-transformers # Sentence Transformers的改编版本——所有双编码器实验请安装此版本\n│   │   │   └── ...\n│   │   └── README.md\n│   └── useb\n│       ├── useb\n│       │   └── ...\n│       ├── *.bash # 用于运行多个实验的Bash脚本\n│       ├── useb_dense_retriever.py\n│       └── README.md\n├── crossencoder  # 交叉编码器的推理\n│   └── beir\n│       ├── *.ipynb # 在README中解释的Notebook文件\n│       └── README.md\n├── other\n│   ├── sgpt_graphic.png\n│   └── sgpt_utils.ipynb # 用于生成论文中图表及其他内容的代码\n├── requirements.txt\n└── README.md\n```\n\n每个数据子目录都提供自己的README，其中包含其**结构**、**下载**（数据集、模型）以及用于生成数据集、模型等的**命令**的概述。通常，您可以在https:\u002F\u002Fhuggingface.co\u002FMuennighoff找到所有模型，并在https:\u002F\u002Fwww.kaggle.com\u002Fmuennighoff\u002Fdatasets中找到各种数据集的JSON结果。模型名称在其Hugging Face的README中有所说明。数据集名称则在本仓库的子文件夹中有所说明。\n\n\n## 使用Hugging Face中的SGPT\n\n下面我们将提供Python示例，以帮助您在自己的语义搜索场景中使用这些预训练模型。\n我们强烈建议您将模型名称替换为更大规模的模型，例如对于双编码器\u002F对称式搜索，可使用`Muennighoff\u002FSGPT-5.8B-weightedmean-nli-bitfit`。\n\n### 双编码器\n\n#### 对称式语义搜索BE\n\n```python\nimport torch\nfrom transformers import AutoModel, AutoTokenizer\nfrom scipy.spatial.distance import cosine\n\n# 获取我们的模型——该包会自动下载模型\n# 为了获得最佳性能：Muennighoff\u002FSGPT-5.8B-weightedmean-nli-bitfit\ntokenizer = AutoTokenizer.from_pretrained(\"Muennighoff\u002FSGPT-125M-weightedmean-nli-bitfit\")\nmodel = AutoModel.from_pretrained(\"Muennighoff\u002FSGPT-125M-weightedmean-nli-bitfit\")\n# 关闭Dropout（上述模型中没有Dropout，因此这里并无影响；但其他SGPT模型可能有Dropout）\nmodel.eval()\n\n# 对输入文本进行分词\ntexts = [\n    \"深度学习\",\n    \"人工智能\",\n    \"深潜\",\n    \"人造雪\",\n]\nbatch_tokens = tokenizer(texts, padding=True，truncation=True，return_tensors=\"pt\")\n\n# 获取嵌入\nwith torch.no_grad():\n    # 获取形状为[bs, seq_len, hid_dim]的隐藏状态\n    last_hidden_state = model(**batch_tokens，output_hidden_states=True，return_dict=True).last_hidden_state\n\n# 获取权重，形状为[bs, seq_len, hid_dim]\nweights = (\n    torch.arange(start=1，end=last_hidden_state.shape[1] + 1)\n    .unsqueeze(0)\n    .unsqueeze(-1)\n    .expand(last_hidden_state.size())\n    .float().to(last_hidden_state.device)\n)\n\n# 获取注意力掩码，形状为[bs, seq_len, hid_dim]\ninput_mask_expanded = (\n    batch_tokens[\"attention_mask\"]\n    .unsqueeze(-1)\n    .expand(last_hidden_state.size())\n    .float()\n)\n\n# 在 seq_len 维度上执行加权平均池化：bs, seq_len, hidden_dim -> bs, hidden_dim\nsum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)\nsum_mask = torch.sum(input_mask_expanded * weights, dim=1)\n\nembeddings = sum_embeddings \u002F sum_mask\n\n# 计算余弦相似度\n# 余弦相似度的取值范围为 [-1, 1]。值越高，表示越相似\ncosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])\ncosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])\ncosine_sim_0_3 = 1 - cosine(embeddings[0], embeddings[3])\n\nprint(\"“%s”与“%s”的余弦相似度为：% .3f\" % (texts[0], texts[1], cosine_sim_0_1))\nprint(\"“%s”与“%s”的余弦相似度为：% .3f\" % (texts[0], texts[2], cosine_sim_0_2))\nprint(\"“%s”与“%s”的余弦相似度为：% .3f\" % (texts[0], texts[3], cosine_sim_0_3))\n```\n\n#### 非对称语义搜索 BE\n\n```python\nimport torch\nfrom transformers import AutoModel, AutoTokenizer\nfrom scipy.spatial.distance import cosine\n\n# 获取我们的模型——该包会自动下载模型\n# 为了获得最佳性能：Muennighoff\u002FSGPT-5.8B-weightedmean-msmarco-specb-bitfit\ntokenizer = AutoTokenizer.from_pretrained(\"Muennighoff\u002FSGPT-125M-weightedmean-msmarco-specb-bitfit\")\nmodel = AutoModel.from_pretrained(\"Muennighoff\u002FSGPT-125M-weightedmean-msmarco-specb-bitfit\")\n# 关闭 Dropout（上述模型中没有 Dropout，因此这里没有影响，但其他 SGPT 模型可能有 Dropout）\nmodel.eval()\n\nqueries = [\n    \"我在寻找一颗离地球不太远的行星。\",\n]\n\ndocs = [\n    \"海王星是太阳系中距离太阳第八颗也是最远的已知行星。在太阳系中，它是按直径计算的第四大行星，按质量计算的第三大行星，也是密度最大的巨行星。它的质量是地球的17倍，略大于与其几乎相同的天王星。\",\n    \"TRAPPIST-1d，也称为 2MASS J23062928-0502285 d，是一颗小型系外行星（质量约为地球的30%），它围绕着超冷矮星 TRAPPIST-1 的宜居带内缘运行，距离地球约40光年（12.1秒差距，或近3.7336×10^14公里），位于宝瓶座内。\",\n    \"塔图因星球是一个位于银河系外环、环绕双子恒星的严酷沙漠世界，这里法律不彰，由赫特族黑帮统治。许多定居者靠水分农场勉强维持生计，而莫斯艾斯利和莫斯埃斯帕等太空港城市则是走私犯、罪犯和其他亡命之徒的大本营。\",\n]\n\nSPECB_QUE_BOS = tokenizer.encode(\"[\", add_special_tokens=False)[0]\nSPECB_QUE_EOS = tokenizer.encode(\"]\", add_special_tokens=False)[0]\n\nSPECB_DOC_BOS = tokenizer.encode(\"{\", add_special_tokens=False)[0]\nSPECB_DOC_EOS = tokenizer.encode(\"}\", add_special_tokens=False)[0]\n\n\ndef tokenize_with_specb(texts, is_query):\n    # 不进行填充的分词\n    batch_tokens = tokenizer(texts, padding=False, truncation=True)   \n    # 添加特殊括号并注意它们\n    for seq, att in zip(batch_tokens[\"input_ids\"], batch_tokens[\"attention_mask\"]):\n        if is_query:\n            seq.insert(0, SPECB_QUE_BOS)\n            seq.append(SPECB_QUE_EOS)\n        else:\n            seq.insert(0, SPECB_DOC_BOS)\n            seq.append(SPECB_DOC_EOS)\n        att.insert(0, 1)\n        att.append(1)\n    # 添加填充\n    batch_tokens = tokenizer.pad(batch_tokens, padding=True，return_tensors=\"pt\")\n    return batch_tokens\n\ndef get_weightedmean_embedding(batch_tokens, model):\n    # 获取嵌入\n    with torch.no_grad():\n        # 获取形状为 [bs, seq_len, hid_dim] 的隐藏状态\n        last_hidden_state = model(**batch_tokens，output_hidden_states=True，return_dict=True).last_hidden_state\n\n    # 获取形状为 [bs, seq_len, hid_dim] 的权重\n    weights = (\n        torch.arange(start=1，end=last_hidden_state.shape[1] + 1)\n        .unsqueeze(0)\n        .unsqueeze(-1)\n        .expand(last_hidden_state.size())\n        .float().to(last_hidden_state.device)\n    )\n\n    # 获取形状为 [bs, seq_len, hid_dim] 的注意力掩码\n    input_mask_expanded = (\n        batch_tokens[\"attention_mask\"]\n        .unsqueeze(-1)\n        .expand(last_hidden_state.size())\n        .float()\n    )\n\n    # 在 seq_len 维度上执行加权平均池化：bs, seq_len, hidden_dim -> bs, hidden_dim\n    sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights，dim=1)\n    sum_mask = torch.sum(input_mask_expanded * weights，dim=1)\n\n    embeddings = sum_embeddings \u002F sum_mask\n\n    return embeddings\n\n\nquery_embeddings = get_weightedmean_embedding(tokenize_with_specb(queries，is_query=True)，model)\ndoc_embeddings = get_weightedmean_embedding(tokenize_with_specb(docs，is_query=False)，model)\n\n# 计算余弦相似度\n# 余弦相似度的取值范围为 [-1, 1]。值越高，表示越相似\ncosine_sim_0_1 = 1 - cosine(query_embeddings[0], doc_embeddings[0])\ncosine_sim_0_2 = 1 - cosine(query_embeddings[0], doc_embeddings[1])\ncosine_sim_0_3 = 1 - cosine(query_embeddings[0], doc_embeddings[2])\n\nprint(\"“%s”与“%s”的余弦相似度为：% .3f\" % (queries[0], docs[0][:20] + \"...\", cosine_sim_0_1))\nprint(\"“%s”与“%s”的余弦相似度为：% .3f\" % (queries[0], docs[1][:20] + \"...\", cosine_sim_0_2))\nprint(\"“%s”与“%s”的余弦相似度为：% .3f\" % (queries[0], docs[2][:20] + \"...\", cosine_sim_0_3))\n```\n\n### 交叉编码器\n\n#### 非对称语义搜索 CE\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM，AutoTokenizer\nfrom scipy.spatial.distance import cosine\n\n# 获取模型——该包会自动下载模型\n# 为了获得最佳性能：EleutherAI\u002Fgpt-j-6B\ntokenizer = AutoTokenizer.from_pretrained(\"EleutherAI\u002Fgpt-neo-125M\")\nmodel = AutoModelForCausalLM.from_pretrained(\"EleutherAI\u002Fgpt-neo-125M\")\n\n# 禁用 Dropout（上述模型中没有 Dropout，因此这里不会有任何影响，但其他 SGPT 模型可能包含 Dropout）\nmodel.eval()\n\nprompt = '文档会被搜索以找到内容相同的匹配项。\\n文档 \"{}\" 是对 \" 的一个很好的搜索结果'\n\nqueries = [\n    \"我在寻找一颗离地球不太远的行星。\",\n]\n\ndocs = [\n    \"海王星是太阳系中距离太阳第八颗也是最远的已知行星。在太阳系中，它是按直径计算的第四大行星，按质量计算的第三大行星，也是密度最大的巨行星。它的质量是地球的17倍，略大于与其非常相似的天王星。\",\n    \"TRAPPIST-1d，也称为2MASS J23062928-0502285 d，是一颗小型系外行星（质量约为地球的30%），它围绕着超冷矮星TRAPPIST-1的宜居带内缘运行，距离地球约40光年（12.1秒差距，或近3.7336×10^14公里），位于宝瓶座内。\",\n    \"塔图因是一颗位于银河系外环、环绕双子恒星运行的荒凉沙漠星球，这里法纪不存，由赫特黑帮统治。许多定居者靠水分农场勉强维持生计，而莫斯艾斯利和莫斯埃斯帕等太空港城市则是走私犯、罪犯和其他亡命之徒的大本营。\",\n]\n\nfor query in queries:\n    print(f\"查询: {query}\")\n    for doc in docs:\n        context = prompt.format(doc)\n\n        context_enc = tokenizer.encode(context, add_special_tokens=False)\n        continuation_enc = tokenizer.encode(query, add_special_tokens=False)\n        # 去掉最后一个标记，因为我们取的是其前一个标记的概率\n        model_input = torch.tensor(context_enc+continuation_enc[:-1])\n        continuation_len = len(continuation_enc)\n        input_len, = model_input.shape\n\n        # [seq_len] -> [seq_len, vocab]\n        logprobs = torch.nn.functional.log_softmax(model(model_input)[0], dim=-1).cpu()\n        # [seq_len, vocab] -> [continuation_len, vocab]\n        logprobs = logprobs[input_len-continuation_len:]\n        # 收集延续部分标记的对数概率 -> [continuation_len]\n        logprobs = torch.gather(logprobs, 1, torch.tensor(continuation_enc).unsqueeze(-1)).squeeze(-1)\n        score = torch.sum(logprobs)\n        # 数值越高（越接近0），相似度越高\n        print(f\"文档: {doc[:20] + '...'} 分数: {score}\")\n```\n\n#### 对称语义搜索 CE\n\n您可以使用与上述[CE-Asym 部分](#asymmetric-semantic-search-1)相同的代码，只需更改提示词即可。欢迎分享效果良好的提示词 :)\n\n## 将 SGPT 与 Sentence Transformers 结合使用\n\n### 双编码器 ST\n\n#### 对称语义搜索 BE ST\n\n对称模型现在已通过 `pip install git+https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers.git` 与最新的 [sentence-transformers](https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers) 完全兼容。你应该会得到与 [上述 HuggingFace 脚本](#symmetric-semantic-search-be) 中相同的结果。\n\n```python\nfrom scipy.spatial.distance import cosine\nfrom sentence_transformers import SentenceTransformer\n\ntexts = [\n    \"深度学习\",\n    \"人工智能\",\n    \"深潜\",\n    \"人造雪\",\n]\n\nmodel = SentenceTransformer(\"Muennighoff\u002FSGPT-125M-weightedmean-nli-bitfit\")\nembeddings = model.encode(texts)\n\ncosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])\ncosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])\ncosine_sim_0_3 = 1 - cosine(embeddings[0], embeddings[3])\n\nprint(\"“%s”和“%s”的余弦相似度是：%.3f\" % (texts[0], texts[1], cosine_sim_0_1))\nprint(\"“%s”和“%s”的余弦相似度是：%.3f\" % (texts[0], texts[2], cosine_sim_0_2))\nprint(\"“%s”和“%s”的余弦相似度是：%.3f\" % (texts[0], texts[3], cosine_sim_0_3))\n```\n\n#### 非对称语义搜索 BE ST\n\n##### SGPT Sentence Transformers\n\n安装：`pip install --upgrade git+https:\u002F\u002Fgithub.com\u002FMuennighoff\u002Fsentence-transformers.git@sgpt_poolings_specb`\n使用以下代码，它将产生与 [上述 HuggingFace 解决方案](#asymmetric-semantic-search-be) 完全相同的分数。\n\n```python\nfrom scipy.spatial.distance import cosine\nfrom sentence_transformers import SentenceTransformer\n\nqueries = [\n    \"我在寻找一颗离地球不太远的行星。\",\n]\n\ndocs = [\n    \"海王星是太阳系中距离太阳第八颗、也是最远的已知行星。在太阳系中，它是按直径计算的第四大行星，按质量计算的第三大行星，也是密度最大的巨行星。它的质量是地球的17倍，略大于与其几乎相同的天王星。\",\n    \"TRAPPIST-1d，也称为2MASS J23062928-0502285 d，是一颗小型系外行星（质量约为地球的30%），它围绕着超冷矮星TRAPPIST-1的宜居带内缘运行，距离地球约40光年（12.1秒差距，或近3.7336×10^14公里），位于宝瓶座内。\",\n    \"塔图因是一颗位于银河系外环、环绕双子恒星运转的严酷沙漠星球，这里法律不存，由赫特族黑帮统治。许多定居者靠水分农场勉强维持生计，而莫斯艾斯利和莫斯埃斯帕等太空港城市则是走私犯、罪犯和其他亡命之徒的大本营。\",\n]\n\nclass SentenceTransformerSpecb(SentenceTransformer):\n    # 需要：\n    # pip install git+https:\u002F\u002Fgithub.com\u002FMuennighoff\u002Fsentence-transformers.git@sgpt_poolings_specb\n    def __init__(self, *args, **kwargs):\n        super().__init__(*args, **kwargs)\n        tokens = [\"[SOS]\", \"{SOS}\"]\n        self._first_module().tokenizer.add_tokens(tokens, special_tokens=True)\n        self._first_module().auto_model.resize_token_embeddings(len(self._first_module().tokenizer))\n        # 将被模型中的表示标记替换\n        # 问题在于我们在Transformer.py模块中进行分词时，并不知道文本是查询还是文档，因此我们使用SOS标记来标识手头的是查询还是文档，然后再将其替换\n        # 如果我们直接在这里使用方括号，它们可能会成为另一个标记的一部分\n        self._first_module().bos_spec_token_q = self._first_module().tokenizer.encode(\"[SOS]\", add_special_tokens=False)[0]\n        self._first_module().bos_spec_token_d = self._first_module().tokenizer.encode(\"{SOS}\", add_special_tokens=False)[0]\n        self._first_module().bos_spec_token_q_rep = self._first_module().tokenizer.encode(\"[\", add_special_tokens=False)[0]\n        self._first_module().eos_spec_token_q = self._first_module().tokenizer.encode(\"]\", add特殊tokens=False)[0]\n        self._first模块().bos_spec_token_d_rep = self._first模块().tokenizer.encode(\"{\", add特殊tokens=False)[0]\n        self._first模块().eos_spec_token_d = self._first模块().tokenizer.encode(\"}\", add特殊tokens=False)[0]\n        self._first模块().replace_bos = True\n\n    def encode(self, sentences, **kwargs):\n        is_query = kwargs.pop(\"is_query\", True)\n        if is_query:\n            sentences = \"[SOS]\" + sentences if isinstance(sentences, str) else [\"[SOS]\" + sent for sent in sentences]\n        else:\n            sentences = \"{SOS}\" + sentences if isinstance(sentences, str) else [\"{SOS}\" + sent for sent in sentences]    \n        return super().encode(sentences, **kwargs)\n        \nmodel = SentenceTransformerSpecb(\"Muennighoff\u002FSGPT-125M-weightedmean-msmarco-specb-bitfit\")\n\nquery_embeddings = model.encode(queries, is_query=True)\ndoc_embeddings = model.encode(docs, is_query=False)\n\n# 计算余弦相似度\n\n# 余弦相似度的取值范围为[-1, 1]。值越高表示越相似\ncosine_sim_0_1 = 1 - cosine(query_embeddings[0], doc_embeddings[0])\ncosine_sim_0_2 = 1 - cosine(query_embeddings[0], doc_embeddings[1])\ncosine_sim_0_3 = 1 - cosine(query_embeddings[0], doc_embeddings[2])\n\nprint(\"“%s”与“%s”的余弦相似度为：%.3f\" % (queries[0], docs[0][:20] + \"...\", cosine_sim_0_1))\nprint(\"“%s”与“%s”的余弦相似度为：%.3f\" % (queries[0], docs[1][:20] + \"...\", cosine_sim_0_2))\nprint(\"“%s”与“%s”的余弦相似度为：%.3f\" % (queries[0], docs[2][:20] + \"...\", cosine_sim_0_3))\n```\n\n##### 原始 Sentence Transformers\n\n如果你想使用位于 `https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers` 的 Sentence Transformers，可以按照以下方式操作。请确保使用最新版本（`pip install --upgrade git+https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers.git`）。\n需要注意的是，这种方法产生的分数会略低于 [SGPT Sentence Transformers](#sgpt-sentence-transformers)，因为特殊括号在分词时可能会与其他标记混淆。在 SciFact（BEIR）数据集上，该方法的 NDCG@10 从 `SGPT-125M-weightedmean-msmarco-specb-bitfit` 的 0.569 下降到 0.566。\n\n```python\nfrom scipy.spatial.distance import cosine\nfrom sentence_transformers import SentenceTransformer\n\nqueries = [\n    \"我在寻找一颗离地球不太远的行星。\",\n]\n\ndocs = [\n    \"海王星是太阳系中距离太阳第八颗也是最远的已知行星。在太阳系中，它是按直径计算的第四大行星，按质量计算的第三大行星，也是密度最高的巨行星。它的质量是地球的17倍，略大于与其非常相似的天王星。\",\n    \"TRAPPIST-1d，也称为 2MASS J23062928-0502285 d，是一颗小型系外行星（质量约为地球的30%），它围绕着超冷矮星 TRAPPIST-1 的宜居带内缘运行，距离地球约40光年（12.1秒差距，或近3.7336×10^14公里），位于宝瓶座内。\",\n    \"塔图因是一颗位于银河系外环、环绕双子恒星的严酷沙漠星球，这里法律失效，由赫特族黑帮统治。许多定居者靠水分农场勉强维持生计，而莫斯艾斯利和莫斯埃斯帕等太空港城市则成为走私犯、罪犯和其他亡命之徒的大本营。\",\n]\n\nclass SentenceTransformerSpecb(SentenceTransformer):\n    def encode(self, sentences, **kwargs):\n        is_query = kwargs.pop(\"is_query\", True)\n        if is_query:\n            sentences = \"[\" + sentences + \"]\" if isinstance(sentences, str) else [\"[\" + sent + \"]\" for sent in sentences]\n        else:\n            sentences = \"{\" + sentences + \"}\" if isinstance(sentences, str) else [\"{\" + sent + \"}\" for sent in sentences]    \n        return super().encode(sentences, **kwargs)\n        \nmodel = SentenceTransformerSpecb(\"Muennighoff\u002FSGPT-125M-weightedmean-msmarco-specb-bitfit\")\n\nquery_embeddings = model.encode(queries, is_query=True)\ndoc_embeddings = model.encode(docs, is_query=False)\n\n# 计算余弦相似度\n# 余弦相似度的取值范围为[-1, 1]。值越高表示越相似\ncosine_sim_0_1 = 1 - cosine(query_embeddings[0], doc_embeddings[0])\ncosine_sim_0_2 = 1 - cosine(query_embeddings[0], doc_embeddings[1])\ncosine_sim_0_3 = 1 - cosine(query_embeddings[0], doc_embeddings[2])\n\nprint(\"“%s”与“%s”的余弦相似度为：%.3f\" % (queries[0], docs[0][:20] + \"...\", cosine_sim_0_1))\nprint(\"“%s”与“%s”的余弦相似度为：%.3f\" % (queries[0], docs[1][:20] + \"...\", cosine_sim_0_2))\nprint(\"“%s”与“%s”的余弦相似度为：%.3f\" % (queries[0], docs[2][:20] + \"...\", cosine_sim_0_3))\n```\n\n## 致谢\n\n我们感谢 Constantin Eichenberg 和 Samuel Weinbach 在整个项目过程中提供的富有洞见的讨论和宝贵反馈。同时，我们也感谢 Robert Baldock、Marco Bellagente 和 Koen Oostermeijer 对论文草稿的审阅。本研究得到了 OpenAI 学术访问计划的支持。\n如果没有以下机构和资源，这项工作将无法完成：\n- UKPLab：[SBERT](https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fsentence-transformers)、[BEIR](https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fbeir)、[USEB](https:\u002F\u002Fgithub.com\u002FUKPLab\u002Fuseb)\n- [Eleuther AI Models](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox)\n- [Huggingface Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)\n\n## 引用\n\n如果您觉得 SGPT 对您有所帮助，请随时引用我们的论文 :) \n\n```bibtex\n@article{muennighoff2022sgpt,\n  title={SGPT：用于语义搜索的 GPT 句子嵌入},\n  author={Muennighoff, Niklas},\n  journal={arXiv 预印本 arXiv:2202.08904},\n  year={2022}\n}\n```","# SGPT 快速上手指南\n\nSGPT (Sentence GPT) 是一个利用 GPT 模型生成句子嵌入（Embeddings）以进行语义搜索的开源项目。它支持双编码器（Bi-Encoder）和交叉编码器（Cross-Encoder）模式，适用于对称和非对称语义搜索任务。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux, macOS 或 Windows\n*   **Python**: 3.8 或更高版本\n*   **硬件**: 推荐使用带有 CUDA 支持的 NVIDIA GPU 以获得最佳性能（尤其是使用大参数模型如 5.8B 时）。CPU 也可运行小模型。\n*   **前置依赖**:\n    *   `torch` (PyTorch)\n    *   `transformers` (Hugging Face)\n    *   `scipy` (用于计算余弦相似度)\n    *   `sentence-transformers` (可选，用于更简便的集成)\n\n**国内加速建议**：\n由于模型文件较大且托管于 Hugging Face，国内用户建议在代码中配置镜像源或使用代理，以加快模型下载速度。\n```bash\nexport HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n```\n\n## 安装步骤\n\n通过 pip 安装必要的 Python 库：\n\n```bash\npip install torch transformers scipy sentence-transformers\n```\n\n如果需要复现论文中的特定实验或处理特殊标记，可以克隆仓库并安装特定依赖（可选）：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FMuennighoff\u002FSGPT.git\ncd SGPT\npip install -r requirements.txt\n```\n\n## 基本使用\n\nSGPT 主要通过 Hugging Face `transformers` 库加载预训练模型。以下提供两种最常用的场景示例：**对称语义搜索**（通用句子相似度）和**非对称语义搜索**（查询与文档匹配）。\n\n> **注意**：示例中使用的是 125M 参数模型以便快速演示。生产环境中强烈建议替换为更大的模型（如 `Muennighoff\u002FSGPT-5.8B-weightedmean-nli-bitfit`），以获得更好的效果。\n\n### 1. 对称语义搜索 (Symmetric Semantic Search)\n\n适用于判断两个句子是否含义相似（例如：去重、聚类）。\n\n```python\nimport torch\nfrom transformers import AutoModel, AutoTokenizer\nfrom scipy.spatial.distance import cosine\n\n# 加载模型和分词器\n# 推荐大模型: Muennighoff\u002FSGPT-5.8B-weightedmean-nli-bitfit\nmodel_name = \"Muennighoff\u002FSGPT-125M-weightedmean-nli-bitfit\"\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModel.from_pretrained(model_name)\nmodel.eval()\n\n# 输入文本\ntexts = [\n    \"deep learning\",\n    \"artificial intelligence\",\n    \"deep diving\",\n    \"artificial snow\",\n]\n\n# 分词\nbatch_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors=\"pt\")\n\n# 获取嵌入向量\nwith torch.no_grad():\n    last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state\n\n# SGPT 核心：位置加权平均池化 (Position-weighted mean pooling)\nweights = (\n    torch.arange(start=1, end=last_hidden_state.shape[1] + 1)\n    .unsqueeze(0)\n    .unsqueeze(-1)\n    .expand(last_hidden_state.size())\n    .float().to(last_hidden_state.device)\n)\n\ninput_mask_expanded = (\n    batch_tokens[\"attention_mask\"]\n    .unsqueeze(-1)\n    .expand(last_hidden_state.size())\n    .float()\n)\n\nsum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)\nsum_mask = torch.sum(input_mask_expanded * weights, dim=1)\nembeddings = sum_embeddings \u002F sum_mask\n\n# 计算余弦相似度\ncosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])\ncosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])\ncosine_sim_0_3 = 1 - cosine(embeddings[0], embeddings[3])\n\nprint(f\"Similarity (DL vs AI): {cosine_sim_0_1:.3f}\")\nprint(f\"Similarity (DL vs Diving): {cosine_sim_0_2:.3f}\")\nprint(f\"Similarity (DL vs Snow): {cosine_sim_0_3:.3f}\")\n```\n\n### 2. 非对称语义搜索 (Asymmetric Semantic Search)\n\n适用于搜索场景（Query 短，Document 长）。SGPT 通过在 Query 和 Document 前后添加特殊标记（`[ ]` 和 `{ }`）来区分角色。\n\n```python\nimport torch\nfrom transformers import AutoModel, AutoTokenizer\nfrom scipy.spatial.distance import cosine\n\n# 加载非对称模型\n# 推荐大模型: Muennighoff\u002FSGPT-5.8B-weightedmean-msmarco-specb-bitfit\nmodel_name = \"Muennighoff\u002FSGPT-125M-weightedmean-msmarco-specb-bitfit\"\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModel.from_pretrained(model_name)\nmodel.eval()\n\nqueries = [\"I'm searching for a planet not too far from Earth.\"]\ndocs = [\n    \"Neptune is the eighth and farthest-known Solar planet from the Sun...\",\n    \"TRAPPIST-1d... orbits... approximately 40 light-years away from Earth...\",\n    \"A harsh desert world orbiting twin suns... Tatooine...\",\n]\n\n# 定义特殊标记 ID\nSPECB_QUE_BOS = tokenizer.encode(\"[\", add_special_tokens=False)[0]\nSPECB_QUE_EOS = tokenizer.encode(\"]\", add_special_tokens=False)[0]\nSPECB_DOC_BOS = tokenizer.encode(\"{\", add_special_tokens=False)[0]\nSPECB_DOC_EOS = tokenizer.encode(\"}\", add_special_tokens=False)[0]\n\ndef tokenize_with_specb(texts, is_query):\n    batch_tokens = tokenizer(texts, padding=False, truncation=True)   \n    for seq, att in zip(batch_tokens[\"input_ids\"], batch_tokens[\"attention_mask\"]):\n        if is_query:\n            seq.insert(0, SPECB_QUE_BOS)\n            seq.append(SPECB_QUE_EOS)\n        else:\n            seq.insert(0, SPECB_DOC_BOS)\n            seq.append(SPECB_DOC_EOS)\n        att.insert(0, 1)\n        att.append(1)\n    batch_tokens = tokenizer.pad(batch_tokens, padding=True, return_tensors=\"pt\")\n    return batch_tokens\n\ndef get_weightedmean_embedding(batch_tokens, model):\n    with torch.no_grad():\n        last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state\n    \n    weights = (\n        torch.arange(start=1, end=last_hidden_state.shape[1] + 1)\n        .unsqueeze(0).unsqueeze(-1)\n        .expand(last_hidden_state.size()).float().to(last_hidden_state.device)\n    )\n    input_mask_expanded = batch_tokens[\"attention_mask\"].unsqueeze(-1).expand(last_hidden_state.size()).float()\n    \n    sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)\n    sum_mask = torch.sum(input_mask_expanded * weights, dim=1)\n    return sum_embeddings \u002F sum_mask\n\n# 生成嵌入\nquery_embeddings = get_weightedmean_embedding(tokenize_with_specb(queries, is_query=True), model)\ndoc_embeddings = get_weightedmean_embedding(tokenize_with_specb(docs, is_query=False), model)\n\n# 计算相似度\nfor i, doc in enumerate(docs):\n    sim = 1 - cosine(query_embeddings[0], doc_embeddings[i])\n    print(f\"Query vs Doc {i+1}: {sim:.3f}\")\n```\n\n### 替代方案：使用 Sentence Transformers\n\n如果您希望代码更简洁，SGPT 已兼容 `sentence-transformers` 库（需 2022-09 后版本）：\n\n```python\nfrom sentence_transformers import SentenceTransformer\n\n# 直接加载兼容的 SGPT 模型\nmodel = SentenceTransformer('Muennighoff\u002FSGPT-125M-weightedmean-nli-bitfit')\n\nsentences = [\"deep learning\", \"artificial intelligence\"]\nembeddings = model.encode(sentences)\n```","某大型电商公司的搜索团队正致力于优化其内部海量商品评论库的检索系统，以支持客服快速定位用户反馈。\n\n### 没有 sgpt 时\n- **关键词匹配局限大**：传统搜索引擎仅依赖关键词重合度，当用户搜索“电池不耐用”时，无法召回包含“续航太短”或“充电频繁”等语义相同但措辞不同的评论。\n- **长尾查询效果差**：面对复杂的自然语言提问（如“适合送给老人的操作简单的手机”），系统难以理解意图，往往返回大量无关的高频商品列表。\n- **维护成本高昂**：为了覆盖同义词和场景变体，工程师需要手动构建庞大的同义词库和规则引擎，耗时耗力且难以跟上用户表达习惯的变化。\n- **跨语言支持困难**：面对多语种用户评论，需要为每种语言单独训练或配置模型，架构臃肿且推理延迟高。\n\n### 使用 sgpt 后\n- **语义理解精准**：sgpt 利用 GPT 生成的句子嵌入向量，能直接捕捉“电池不耐用”与“续航太短”之间的深层语义关联，显著提升召回率。\n- **自然语言交互流畅**：借助 sgpt 的不对称搜索能力，系统能完美解析复杂的长句查询，直接定位到最符合意图的具体评论段落。\n- **开发效率飞跃**：无需手动维护同义词库，仅需微调偏置张量即可让模型适应特定业务领域，大幅降低了迭代周期。\n- **多语言统一架构**：利用 sgpt 的多语言 BLOOM 模型，一套架构即可同时处理中、英、法等多语种检索，简化了系统部署并降低了延迟。\n\nsgpt 通过将大语言模型的生成能力转化为高效的语义搜索引擎，让机器真正“读懂”了用户的自然语言意图。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FMuennighoff_sgpt_538f14a3.png","Muennighoff","Niklas","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FMuennighoff_ba8a01a2.png",null,"Stanford","n.muennighoff@gmail.com","muennighoff.com","https:\u002F\u002Fgithub.com\u002FMuennighoff",[84,88,92],{"name":85,"color":86,"percentage":87},"Jupyter Notebook","#DA5B0B",94.7,{"name":89,"color":90,"percentage":91},"Python","#3572A5",5.3,{"name":93,"color":94,"percentage":95},"Shell","#89e051",0.1,873,51,"2026-02-02T16:58:40","MIT","未说明","非绝对必需（代码支持 CPU 运行），但推荐使用 NVIDIA GPU。显存需求取决于模型大小：125M 模型需少量显存，5.8B 模型建议 16GB+，7.1B 模型建议 24GB+。CUDA 版本未说明。","最低 8GB（小模型），推荐 32GB+（运行 5.8B 或 7.1B 大模型时）",{"notes":104,"python":100,"dependencies":105},"该工具核心依赖 Hugging Face Transformers 和 PyTorch。README 中未提供具体的 requirements.txt 内容，但示例代码明确使用了 torch、transformers 和 scipy。若使用 Sentence Transformers 集成方案，需安装 sentence-transformers 库。模型文件较大（从 125M 到 7.1B 参数不等），首次运行会自动从 Hugging Face 下载，需确保网络连接畅通及磁盘空间充足。对于 5.8B 及以上的大模型，强烈建议使用 GPU 以获得可用速度。",[106,107,108,109],"torch","transformers","scipy","sentence-transformers",[13,26,54,15],[112,113,114,115,116,117,118,119,120,67],"gpt","information-retrieval","language-model","large-language-models","retrieval","semantic-search","sentence-embeddings","text-embedding","neural-search","2026-03-27T02:49:30.150509","2026-04-06T06:53:11.161342",[124,129,134,139,144,149,154,159],{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},12316,"使用 Bloom 3B 等模型训练编码器时遇到 'NotImplementedError: Model input split not implemented for type dict' 错误，如何解决？","该错误通常与使用 `gradcache` 处理高批量大小（batch sizes）有关。如果您在使用 `--gradcache` 和 `--chunksize` 参数时遇到此问题，可以尝试以下方案：\n1. 确保您的代码兼容字典类型的输入分割。\n2. 如果移除 `--gradcache --chunksize` 导致内存溢出（OOM），而保留它们又报错，可能需要修改代码使其兼容，或者尝试调整 `chunksize` 的值（例如设置为 8）。\n3. 维护者指出，`--asym` 参数在某些情况下效果不佳，区分查询和文档主要依靠 `--specb` 参数添加不同的括号。","https:\u002F\u002Fgithub.com\u002FMuennighoff\u002Fsgpt\u002Fissues\u002F27",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},12317,"如何开始微调 SGPT 模型？如果在运行过程中遇到 accelerate 相关错误或依赖问题怎么办？","微调 SGPT 可以参考项目 README 中的指南。如果遇到错误，特别是与 `accelerate` 或 `huggingface-hub` 相关的问题，可以尝试以下步骤：\n1. 升级或指定特定版本的 huggingface-hub：运行 `pip install --upgrade huggingface-hub==0.10.1`。\n2. 运行 `accelerate config` 命令，根据提示配置您的环境（如 GPU 数量等）。\n3. 如果不想使用 accelerate，可以尝试查看是否有其他训练脚本（如 `training_nli.py`）可直接运行，但通常建议配置好 accelerate 以获得更好的多卡支持。","https:\u002F\u002Fgithub.com\u002FMuennighoff\u002Fsgpt\u002Fissues\u002F18",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},12318,"如何在显存有限的显卡（如 RTX 3090）上运行较大的 SGPT 模型（如 5B 或 7B）？是否支持量化？","对于显存受限的情况：\n1. 确保在加载模型时使用代码中提到的 `weighted mean pooling`（加权平均池化）。\n2. 虽然用户询问了 16-bit 或 8-bit 版本，但维护者建议参考 Hugging Face 上的使用说明，并确保使用正确的池化方法。\n3. 如果模型过大无法加载，可以考虑寻找更小的模型变体，或者等待官方发布量化版本（如 8-bit 模式的大模型）。目前可以通过标准的 Transformers 库加载，但需注意显存管理。","https:\u002F\u002Fgithub.com\u002FMuennighoff\u002Fsgpt\u002Fissues\u002F19",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},12319,"Hugging Face 上的 SGPT 模型文件暂时无法访问或显示丢失，该怎么办？","如果模型页面暂时不可用，可能是暂时的服务中断，通常很快就会恢复。作为临时解决方案，您可以使用本地缓存的模型权重：\n1. 检查本地缓存目录，路径通常类似于：`\u002Fhome\u002F$USER\u002F.cache\u002Fhuggingface\u002Fhub\u002Fmodels--Muennighoff--SGPT-...\u002Fsnapshots\u002F...\u002F`。\n2. 如果缓存中存在之前下载过的模型文件，可以直接指向该本地路径加载模型，无需重新下载。","https:\u002F\u002Fgithub.com\u002FMuennighoff\u002Fsgpt\u002Fissues\u002F28",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},12320,"如何从头构建基于 GPT-2 或其他架构的 SGPT 双编码器（Bi-Encoder）？","SGPT 最初基于 GPT-Neo，但也可以构建基于其他架构的模型。要构建双编码器（Bi-Encoder）：\n1. 请参考项目仓库中 `biencoder\u002Fnli_msmarco` 目录下的具体指令和代码。\n2. README 中提供了详细的步骤，指导如何配置和训练双编码器结构。\n3. 如果需要基于 GPT-2 构建，原理类似，需替换底层的模型架构并相应调整配置文件。","https:\u002F\u002Fgithub.com\u002FMuennighoff\u002Fsgpt\u002Fissues\u002F11",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},12321,"能否将 SGPT 的因果注意力（Causal Attention）改为自注意力（Self Attention\u002FBidirectional）来生成句子嵌入？","是的，这是可行的，并且已经有相关的进展：\n1. 维护者已经发布了使用双向注意力（bidirectional attention）训练的解码器语言模型，这被视为 SGPT 的 v2 版本。\n2. 相关论文地址：https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.09906\n3. 模型地址：https:\u002F\u002Fhuggingface.co\u002FGritLM\u002FGritLM-7B\n4. 使用这些新模型可以获得更好的嵌入效果，因为它们不再受限于单向因果注意力掩码。","https:\u002F\u002Fgithub.com\u002FMuennighoff\u002Fsgpt\u002Fissues\u002F46",{"id":155,"question_zh":156,"answer_zh":157,"source_url":158},12322,"训练脚本中引用的 'gpt-neo-125M' 名称报错，正确的模型名称是什么？","Hugging Face 上 GPT-Neo-125m 的正确名称是 `gpt-neo-125m`（注意大小写，末尾是小写 'm'）。\n1. 虽然部分脚本中写成了 `gpt-neo-125M`（大写 'M'），但在大多数情况下，Transformers 库的模型加载是不区分大小写的，因此 `EleutherAI\u002FGPT-NEO-125M` 通常也能正常工作。\n2. 如果确实遇到文件名错误或加载失败，请尝试将脚本中的模型名称修改为全小写的 `gpt-neo-125m` 或标准写法 `EleutherAI\u002Fgpt-neo-125m`。","https:\u002F\u002Fgithub.com\u002FMuennighoff\u002Fsgpt\u002Fissues\u002F38",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},12323,"SGPT 无监督嵌入的核心代码逻辑是什么？","SGPT 无监督嵌入的核心逻辑等价于项目中提供的简化代码示例。\n1. 具体的实现可以参考 README 中的 \"Asymmetric Semantic Search\" 部分。\n2. 完整的 Python 脚本通常包含额外的批处理（batching）和基准测试（benchmarking）开销，但核心嵌入生成逻辑与简化版一致。\n3. 关键在于利用特定的标记（brackets）区分查询和文档，并结合加权平均池化生成嵌入。","https:\u002F\u002Fgithub.com\u002FMuennighoff\u002Fsgpt\u002Fissues\u002F34",[]]