[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-huggingface--text-embeddings-inference":3,"tool-huggingface--text-embeddings-inference":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":76,"owner_website":81,"owner_url":82,"languages":83,"stars":111,"forks":112,"last_commit_at":113,"license":114,"difficulty_score":10,"env_os":115,"env_gpu":116,"env_ram":117,"env_deps":118,"category_tags":127,"github_topics":128,"view_count":133,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":134,"updated_at":135,"faqs":136,"releases":164},368,"huggingface\u002Ftext-embeddings-inference","text-embeddings-inference","A blazing fast inference solution for text embeddings models","Text Embeddings Inference 是一款专为文本嵌入模型打造的高性能推理服务框架。它旨在解决开源模型在部署时面临的推理速度慢、资源消耗大以及难以满足生产环境需求等问题。无论是构建检索增强生成（RAG）系统，还是进行语义搜索，Text Embeddings Inference 都能提供稳定且高效的模型服务支持。\n\nText Embeddings Inference 特别适合 AI 开发者、算法研究人员以及负责模型部署的运维工程师使用。无需复杂的模型图编译步骤，Text Embeddings Inference 直接支持多种主流模型架构，如 Qwen、Bert、GTE 等。技术上，Text Embeddings Inference 集成了 Flash Attention、Candle 和 cuBLASLt 等优化技术，显著提升了推理吞吐量并降低了延迟。此外，Text Embeddings Inference 还具备动态批处理、小体积 Docker 镜像以及即插即用的特性，甚至支持 Apple Silicon 本地加速。对于需要高并发、低延迟的企业级应用，Text Embe","Text Embeddings Inference 是一款专为文本嵌入模型打造的高性能推理服务框架。它旨在解决开源模型在部署时面临的推理速度慢、资源消耗大以及难以满足生产环境需求等问题。无论是构建检索增强生成（RAG）系统，还是进行语义搜索，Text Embeddings Inference 都能提供稳定且高效的模型服务支持。\n\nText Embeddings Inference 特别适合 AI 开发者、算法研究人员以及负责模型部署的运维工程师使用。无需复杂的模型图编译步骤，Text Embeddings Inference 直接支持多种主流模型架构，如 Qwen、Bert、GTE 等。技术上，Text Embeddings Inference 集成了 Flash Attention、Candle 和 cuBLASLt 等优化技术，显著提升了推理吞吐量并降低了延迟。此外，Text Embeddings Inference 还具备动态批处理、小体积 Docker 镜像以及即插即用的特性，甚至支持 Apple Silicon 本地加速。对于需要高并发、低延迟的企业级应用，Text Embeddings Inference 还提供了完善的分布式追踪和 Prometheus 监控指标，确保服务在生产环境中可靠运行。","\u003Cdiv align=\"center\">\n\n# Text Embeddings Inference\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\">\n  \u003Cimg alt=\"GitHub Repo stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuggingface\u002Ftext-embeddings-inference?style=social\">\n\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.github.io\u002Ftext-embeddings-inference\">\n  \u003Cimg alt=\"Swagger API documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAPI-Swagger-informational\">\n\u003C\u002Fa>\n\nA blazing fast inference solution for text embeddings models.\n\nBenchmark for [BAAI\u002Fbge-base-en-v1.5](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbge-base-en-v1.5) on an NVIDIA A10 with a sequence\nlength of 512 tokens:\n\n\u003Cp>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_text-embeddings-inference_readme_d17a0e59ef5b.png\" width=\"400\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_text-embeddings-inference_readme_dde18d6b8125.png\" width=\"400\" \u002F>\n\u003C\u002Fp>\n\u003Cp>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_text-embeddings-inference_readme_b00b4011e088.png\" width=\"400\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_text-embeddings-inference_readme_b431099b0af0.png\" width=\"400\" \u002F>\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n## Table of contents\n\n- [Get Started](#get-started)\n    - [Supported Models](#supported-models)\n    - [Docker](#docker)\n    - [Docker Images](#docker-images)\n    - [API Documentation](#api-documentation)\n    - [Using a private or gated model](#using-a-private-or-gated-model)\n    - [Air gapped deployment](#air-gapped-deployment)\n    - [Using Re-rankers models](#using-re-rankers-models)\n    - [Using Sequence Classification models](#using-sequence-classification-models)\n    - [Using SPLADE pooling](#using-splade-pooling)\n    - [Distributed Tracing](#distributed-tracing)\n    - [gRPC](#grpc)\n- [Local Install](#local-install)\n    - [Apple Silicon (Homebrew)](#apple-silicon-homebrew)\n- [Docker Build](#docker-build)\n    - [Apple M1\u002FM2 Arm](#apple-m1m2-arm64-architectures)\n- [Examples](#examples)\n\nText Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence\nclassification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding,\nEmber, GTE and E5. TEI implements many features such as:\n\n* No model graph compilation step\n* Metal support for local execution on Macs\n* Small docker images and fast boot times. Get ready for true serverless!\n* Token based dynamic batching\n* Optimized transformers code for inference using [Flash Attention](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fflash-attention),\n  [Candle](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fcandle)\n  and [cuBLASLt](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcublas\u002F#using-the-cublaslt-api)\n* [Safetensors](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fsafetensors) weight loading\n* [ONNX](https:\u002F\u002Fgithub.com\u002Fonnx\u002Fonnx) weight loading\n* Production ready (distributed tracing with Open Telemetry, Prometheus metrics)\n\n## Get Started\n\n### Supported Models\n\n#### Text Embeddings\n\nText Embeddings Inference currently supports Nomic, BERT, CamemBERT, XLM-RoBERTa models with absolute positions, JinaBERT\nmodel with Alibi positions and Mistral, Alibaba GTE, Qwen2 models with Rope positions, MPNet, ModernBERT, Qwen3, and Gemma3.\n\nBelow are some examples of the currently supported models:\n\n| MTEB Rank | Model Size             | Model Type     | Model ID                                                                                         |\n|-----------|------------------------|----------------|--------------------------------------------------------------------------------------------------|\n| 2         | 7.57B (Very Expensive) | Qwen3          | [Qwen\u002FQwen3-Embedding-8B](https:\u002F\u002Fhf.co\u002FQwen\u002FQwen3-Embedding-8B)                                 |\n| 3         | 4.02B (Very Expensive) | Qwen3          | [Qwen\u002FQwen3-Embedding-4B](https:\u002F\u002Fhf.co\u002FQwen\u002FQwen3-Embedding-4B)                                 |\n| 4         | 509M                   | Qwen3          | [Qwen\u002FQwen3-Embedding-0.6B](https:\u002F\u002Fhf.co\u002FQwen\u002FQwen3-Embedding-0.6B)                             |\n| 6         | 7.61B (Very Expensive) | Qwen2          | [Alibaba-NLP\u002Fgte-Qwen2-7B-instruct](https:\u002F\u002Fhf.co\u002FAlibaba-NLP\u002Fgte-Qwen2-7B-instruct)             |\n| 7         | 560M                   | XLM-RoBERTa    | [intfloat\u002Fmultilingual-e5-large-instruct](https:\u002F\u002Fhf.co\u002Fintfloat\u002Fmultilingual-e5-large-instruct) |\n| 8         | 308M                   | Gemma3         | [google\u002Fembeddinggemma-300m](https:\u002F\u002Fhf.co\u002Fgoogle\u002Fembeddinggemma-300m) (gated)                   |\n| 15        | 1.78B (Expensive)      | Qwen2          | [Alibaba-NLP\u002Fgte-Qwen2-1.5B-instruct](https:\u002F\u002Fhf.co\u002FAlibaba-NLP\u002Fgte-Qwen2-1.5B-instruct)         |\n| 18        | 7.11B (Very Expensive) | Mistral        | [Salesforce\u002FSFR-Embedding-2_R](https:\u002F\u002Fhf.co\u002FSalesforce\u002FSFR-Embedding-2_R)                       |\n| 35        | 568M                   | XLM-RoBERTa    | [Snowflake\u002Fsnowflake-arctic-embed-l-v2.0](https:\u002F\u002Fhf.co\u002FSnowflake\u002Fsnowflake-arctic-embed-l-v2.0) |\n| 41        | 305M                   | Alibaba GTE    | [Snowflake\u002Fsnowflake-arctic-embed-m-v2.0](https:\u002F\u002Fhf.co\u002FSnowflake\u002Fsnowflake-arctic-embed-m-v2.0) |\n| 52        | 335M                   | BERT           | [WhereIsAI\u002FUAE-Large-V1](https:\u002F\u002Fhf.co\u002FWhereIsAI\u002FUAE-Large-V1)                                   |\n| 58        | 137M                   | NomicBERT      | [nomic-ai\u002Fnomic-embed-text-v1](https:\u002F\u002Fhf.co\u002Fnomic-ai\u002Fnomic-embed-text-v1)                       |\n| 79        | 137M                   | NomicBERT      | [nomic-ai\u002Fnomic-embed-text-v1.5](https:\u002F\u002Fhf.co\u002Fnomic-ai\u002Fnomic-embed-text-v1.5)                   |\n| 103       | 109M                   | MPNet          | [sentence-transformers\u002Fall-mpnet-base-v2](https:\u002F\u002Fhf.co\u002Fsentence-transformers\u002Fall-mpnet-base-v2) |\n| N\u002FA       | 475M-A305M             | NomicBERT      | [nomic-ai\u002Fnomic-embed-text-v2-moe](https:\u002F\u002Fhf.co\u002Fnomic-ai\u002Fnomic-embed-text-v2-moe)               |\n| N\u002FA       | 434M                   | Alibaba GTE    | [Alibaba-NLP\u002Fgte-large-en-v1.5](https:\u002F\u002Fhf.co\u002FAlibaba-NLP\u002Fgte-large-en-v1.5)                     |\n| N\u002FA       | 396M                   | ModernBERT     | [answerdotai\u002FModernBERT-large](https:\u002F\u002Fhf.co\u002Fanswerdotai\u002FModernBERT-large)                       |\n| N\u002FA       | 340M                   | Qwen3          | [voyageai\u002Fvoyage-4-nano](https:\u002F\u002Fhf.co\u002Fvoyageai\u002Fvoyage-4-nano)                                   |\n| N\u002FA       | 137M                   | JinaBERT       | [jinaai\u002Fjina-embeddings-v2-base-en](https:\u002F\u002Fhf.co\u002Fjinaai\u002Fjina-embeddings-v2-base-en)             |\n| N\u002FA       | 137M                   | JinaBERT       | [jinaai\u002Fjina-embeddings-v2-base-code](https:\u002F\u002Fhf.co\u002Fjinaai\u002Fjina-embeddings-v2-base-code)         |\n\nTo explore the list of best performing text embeddings models, visit the\n[Massive Text Embedding Benchmark (MTEB) Leaderboard](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmteb\u002Fleaderboard).\n\n#### Sequence Classification and Re-Ranking\n\nText Embeddings Inference currently supports CamemBERT, and XLM-RoBERTa Sequence Classification models with absolute positions.\n\nBelow are some examples of the currently supported models:\n\n| Task               | Model Type  | Model ID                                                                                                        |\n|--------------------|-------------|-----------------------------------------------------------------------------------------------------------------|\n| Re-Ranking         | XLM-RoBERTa | [BAAI\u002Fbge-reranker-large](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbge-reranker-large)                                       |\n| Re-Ranking         | XLM-RoBERTa | [BAAI\u002Fbge-reranker-base](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbge-reranker-base)                                         |\n| Re-Ranking         | GTE         | [Alibaba-NLP\u002Fgte-multilingual-reranker-base](https:\u002F\u002Fhuggingface.co\u002FAlibaba-NLP\u002Fgte-multilingual-reranker-base) |\n| Re-Ranking         | ModernBert  | [Alibaba-NLP\u002Fgte-reranker-modernbert-base](https:\u002F\u002Fhuggingface.co\u002FAlibaba-NLP\u002Fgte-reranker-modernbert-base) |\n| Sentiment Analysis | RoBERTa     | [SamLowe\u002Froberta-base-go_emotions](https:\u002F\u002Fhuggingface.co\u002FSamLowe\u002Froberta-base-go_emotions)                     |\n\n### Docker\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\nAnd then you can make requests like\n\n```bash\ncurl 127.0.0.1:8080\u002Fembed \\\n    -X POST \\\n    -d '{\"inputs\":\"What is Deep Learning?\"}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n**Note:** To use GPUs, you need to install\nthe [NVIDIA Container Toolkit](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Finstall-guide.html).\nNVIDIA drivers on your machine need to be compatible with CUDA version 12.2 or higher.\n\nTo see all options to serve your models:\n\n```console\n$ text-embeddings-router --help\nText Embedding Webserver\n\nUsage: text-embeddings-router [OPTIONS] --model-id \u003CMODEL_ID>\n\nOptions:\n      --model-id \u003CMODEL_ID>\n          The Hugging Face model ID, can be any model listed on \u003Chttps:\u002F\u002Fhuggingface.co\u002Fmodels> with the `text-embeddings-inference` tag (meaning it's compatible with Text Embeddings Inference).\n\n          Alternatively, the specified ID can also be a path to a local directory containing the necessary model files saved by the `save_pretrained(...)` methods of either Transformers or Sentence Transformers.\n\n          [env: MODEL_ID=]\n\n      --revision \u003CREVISION>\n          The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs\u002Fpr\u002F2`\n\n          [env: REVISION=]\n\n      --tokenization-workers \u003CTOKENIZATION_WORKERS>\n          Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation. Default to the number of CPU cores on the machine\n\n          [env: TOKENIZATION_WORKERS=]\n\n      --dtype \u003CDTYPE>\n          The dtype to be forced upon the model\n\n          [env: DTYPE=]\n          [possible values: float16, float32]\n\n      --served-model-name \u003CSERVED_MODEL_NAME>\n          The name of the model that is being served. If not specified, defaults to `--model-id`. It is only used for the OpenAI-compatible endpoints via HTTP\n\n          [env: SERVED_MODEL_NAME=]\n\n      --pooling \u003CPOOLING>\n          Optionally control the pooling method for embedding models.\n\n          If `pooling` is not set, the pooling configuration will be parsed from the model `1_Pooling\u002Fconfig.json` configuration.\n\n          If `pooling` is set, it will override the model pooling configuration\n\n          [env: POOLING=]\n\n          Possible values:\n          - cls:        Select the CLS token as embedding\n          - mean:       Apply Mean pooling to the model embeddings\n          - splade:     Apply SPLADE (Sparse Lexical and Expansion) to the model embeddings. This option is only available if the loaded model is a `ForMaskedLM` Transformer model\n          - last-token: Select the last token as embedding\n\n      --max-concurrent-requests \u003CMAX_CONCURRENT_REQUESTS>\n          The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them wait for too long and is usually good to handle backpressure correctly\n\n          [env: MAX_CONCURRENT_REQUESTS=]\n          [default: 512]\n\n      --max-batch-tokens \u003CMAX_BATCH_TOKENS>\n          **IMPORTANT** This is one critical control to allow maximum usage of the available hardware.\n\n          This represents the total amount of potential tokens within a batch.\n\n          For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.\n\n          Overall this number should be the largest possible until the model is compute bound. Since the actual memory overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.\n\n          [env: MAX_BATCH_TOKENS=]\n          [default: 16384]\n\n      --max-batch-requests \u003CMAX_BATCH_REQUESTS>\n          Optionally control the maximum number of individual requests in a batch\n\n          [env: MAX_BATCH_REQUESTS=]\n\n      --max-client-batch-size \u003CMAX_CLIENT_BATCH_SIZE>\n          Control the maximum number of inputs that a client can send in a single request\n\n          [env: MAX_CLIENT_BATCH_SIZE=]\n          [default: 32]\n\n      --auto-truncate\n          Control automatic truncation of inputs that exceed the model's maximum supported size. Defaults to `true` (truncation enabled). Set to `false` to disable truncation; when disabled and the model's maximum input length exceeds `--max-batch-tokens`, the server will refuse to start with an error instead of silently truncating sequences.\n\n          Unused for gRPC servers\n\n          [env: AUTO_TRUNCATE=]\n\n      --default-prompt-name \u003CDEFAULT_PROMPT_NAME>\n          The name of the prompt that should be used by default for encoding. If not set, no prompt will be applied.\n\n          Must be a key in the `sentence-transformers` configuration `prompts` dictionary.\n\n          For example if ``default_prompt_name`` is \"query\" and the ``prompts`` is {\"query\": \"query: \", ...}, then the sentence \"What is the capital of France?\" will be encoded as \"query: What is the capital of France?\" because the prompt text will be prepended before any text to encode.\n\n          The argument '--default-prompt-name \u003CDEFAULT_PROMPT_NAME>' cannot be used with '--default-prompt \u003CDEFAULT_PROMPT>`\n\n          [env: DEFAULT_PROMPT_NAME=]\n\n      --default-prompt \u003CDEFAULT_PROMPT>\n          The prompt that should be used by default for encoding. If not set, no prompt will be applied.\n\n          For example if ``default_prompt`` is \"query: \" then the sentence \"What is the capital of France?\" will be encoded as \"query: What is the capital of France?\" because the prompt text will be prepended before any text to encode.\n\n          The argument '--default-prompt \u003CDEFAULT_PROMPT>' cannot be used with '--default-prompt-name \u003CDEFAULT_PROMPT_NAME>`\n\n          [env: DEFAULT_PROMPT=]\n\n      --dense-path \u003CDENSE_PATH>\n          Optionally, define the path to the Dense module required for some embedding models.\n\n          Some embedding models require an extra `Dense` module which contains a single Linear layer and an activation function. By default, those `Dense` modules are stored under the `2_Dense` directory, but there might be cases where different `Dense` modules are provided, to convert the pooled embeddings into different dimensions, available as `2_Dense_\u003Cdims>` e.g. https:\u002F\u002Fhuggingface.co\u002FNovaSearch\u002Fstella_en_400M_v5.\n\n          Note that this argument is optional, only required to be set if there is no `modules.json` file or when you want to override a single Dense module path, only when running with the `candle` backend.\n\n          [env: DENSE_PATH=]\n\n      --hf-token \u003CHF_TOKEN>\n          Your Hugging Face Hub token. If neither `--hf-token` nor `HF_TOKEN` are set, the token will be read from the `$HF_HOME\u002Ftoken` path, if it exists. This ensures access to private or gated models, and allows for a more permissive rate limiting\n\n          [env: HF_TOKEN=]\n\n      --hostname \u003CHOSTNAME>\n          The IP address to listen on\n\n          [env: HOSTNAME=]\n          [default: 0.0.0.0]\n\n      -p, --port \u003CPORT>\n          The port to listen on\n\n          [env: PORT=]\n          [default: 3000]\n\n      --uds-path \u003CUDS_PATH>\n          The name of the unix socket some text-embeddings-inference backends will use as they communicate internally with gRPC\n\n          [env: UDS_PATH=]\n          [default: \u002Ftmp\u002Ftext-embeddings-inference-server]\n\n      --huggingface-hub-cache \u003CHUGGINGFACE_HUB_CACHE>\n          The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance\n\n          [env: HUGGINGFACE_HUB_CACHE=]\n\n      --payload-limit \u003CPAYLOAD_LIMIT>\n          Payload size limit in bytes\n\n          Default is 2MB\n\n          [env: PAYLOAD_LIMIT=]\n          [default: 2000000]\n\n      --api-key \u003CAPI_KEY>\n          Set an api key for request authorization.\n\n          By default the server responds to every request. With an api key set, the requests must have the Authorization header set with the api key as Bearer token.\n\n          [env: API_KEY=]\n\n      --json-output\n          Outputs the logs in JSON format (useful for telemetry)\n\n          [env: JSON_OUTPUT=]\n\n      --disable-spans\n          Whether or not to include the log trace through spans\n\n          [env: DISABLE_SPANS=]\n\n      --otlp-endpoint \u003COTLP_ENDPOINT>\n          The grpc endpoint for opentelemetry. Telemetry is sent to this endpoint as OTLP over gRPC. e.g. `http:\u002F\u002Flocalhost:4317`\n\n          [env: OTLP_ENDPOINT=]\n\n      --otlp-service-name \u003COTLP_SERVICE_NAME>\n          The service name for opentelemetry. e.g. `text-embeddings-inference.server`\n\n          [env: OTLP_SERVICE_NAME=]\n          [default: text-embeddings-inference.server]\n\n      --prometheus-port \u003CPROMETHEUS_PORT>\n          The Prometheus port to listen on\n\n          [env: PROMETHEUS_PORT=]\n          [default: 9000]\n\n      --cors-allow-origin \u003CCORS_ALLOW_ORIGIN>\n          Unused for gRPC servers\n\n          [env: CORS_ALLOW_ORIGIN=]\n\n  -h, --help\n          Print help (see a summary with '-h')\n\n  -V, --version\n          Print version\n```\n\n### Docker Images\n\nText Embeddings Inference ships with multiple Docker images that you can use to target a specific backend:\n\n| Architecture                           | Image                                                                   |\n|----------------------------------------|-------------------------------------------------------------------------|\n| CPU                                    | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-1.9                   |\n| Volta                                  | NOT SUPPORTED                                                           |\n| Turing (T4, RTX 2000 series, ...)      | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:turing-1.9 (experimental) |\n| Ampere 8.0 (A100, A30)                 | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:1.9                       |\n| Ampere 8.6 (A10, A40, ...)             | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:86-1.9                    |\n| Ada Lovelace (RTX 4000 series, ...)    | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:89-1.9                    |\n| Hopper (H100)                          | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:hopper-1.9                |\n| Blackwell 10.0 (B200, GB200, ...)      | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:100-1.9 (experimental)    |\n| Blackwell 12.0 (GeForce RTX 50X0, ...) | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:120-1.9 (experimental)    |\n\n**Warning**: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.\nYou can turn Flash Attention v1 ON by using the `USE_FLASH_ATTENTION=True` environment variable.\n\n### API documentation\n\nYou can consult the OpenAPI documentation of the `text-embeddings-inference` REST API using the `\u002Fdocs` route.\nThe Swagger UI is also available\nat: [https:\u002F\u002Fhuggingface.github.io\u002Ftext-embeddings-inference](https:\u002F\u002Fhuggingface.github.io\u002Ftext-embeddings-inference).\n\n### Using a private or gated model\n\nYou have the option to utilize the `HF_TOKEN` environment variable for configuring the token employed by\n`text-embeddings-inference`. This allows you to gain access to protected resources.\n\nFor example:\n\n1. Go to https:\u002F\u002Fhuggingface.co\u002Fsettings\u002Ftokens\n2. Copy your CLI READ token\n3. Export `HF_TOKEN=\u003Cyour CLI READ token>`\n\nor with Docker:\n\n```shell\nmodel=\u003Cyour private model>\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\ntoken=\u003Cyour CLI READ token>\n\ndocker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\n### Air gapped deployment\n\nTo deploy Text Embeddings Inference in an air-gapped environment, first download the weights and then mount them inside\nthe container using a volume.\n\nFor example:\n\n```shell\n# (Optional) create a `models` directory\nmkdir models\ncd models\n\n# Make sure you have git-lfs installed (https:\u002F\u002Fgit-lfs.com)\ngit lfs install\ngit clone https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-Embedding-0.6B\n\n# Set the models directory as the volume path\nvolume=$PWD\n\n# Mount the models directory inside the container with a volume and set the model ID\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id \u002Fdata\u002FQwen3-Embedding-0.6B\n```\n\n### Using Re-rankers models\n\n`text-embeddings-inference` v0.4.0 added support for CamemBERT, RoBERTa, XLM-RoBERTa, and GTE Sequence Classification models.\nRe-rankers models are Sequence Classification cross-encoders models with a single class that scores the similarity\nbetween a query and a text.\n\nSee [this blogpost](https:\u002F\u002Fblog.llamaindex.ai\u002Fboosting-rag-picking-the-best-embedding-reranker-models-42d079022e83) by\nthe LlamaIndex team to understand how you can use re-rankers models in your RAG pipeline to improve\ndownstream performance.\n\n```shell\nmodel=BAAI\u002Fbge-reranker-large\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\nAnd then you can rank the similarity between a query and a list of texts with:\n\n```bash\ncurl 127.0.0.1:8080\u002Frerank \\\n    -X POST \\\n    -d '{\"query\": \"What is Deep Learning?\", \"texts\": [\"Deep Learning is not...\", \"Deep learning is...\"]}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n### Using Sequence Classification models\n\nYou can also use classic Sequence Classification models like `SamLowe\u002Froberta-base-go_emotions`:\n\n```shell\nmodel=SamLowe\u002Froberta-base-go_emotions\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\nOnce you have deployed the model you can use the `predict` endpoint to get the emotions most associated with an input:\n\n```bash\ncurl 127.0.0.1:8080\u002Fpredict \\\n    -X POST \\\n    -d '{\"inputs\":\"I like you.\"}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n### Using SPLADE pooling\n\nYou can choose to activate SPLADE pooling for Bert and Distilbert MaskedLM architectures:\n\n```shell\nmodel=naver\u002Fefficient-splade-VI-BT-large-query\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model --pooling splade\n```\n\nOnce you have deployed the model you can use the `\u002Fembed_sparse` endpoint to get the sparse embedding:\n\n```bash\ncurl 127.0.0.1:8080\u002Fembed_sparse \\\n    -X POST \\\n    -d '{\"inputs\":\"I like you.\"}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n### Distributed Tracing\n\n`text-embeddings-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature\nby setting the address to an OTLP collector with the `--otlp-endpoint` argument.\n\n### gRPC\n\n`text-embeddings-inference` offers a gRPC API as an alternative to the default HTTP API for high performance\ndeployments. The API protobuf definition can be\nfound [here](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fblob\u002Fmain\u002Fproto\u002Ftei.proto).\n\nYou can use the gRPC API by adding the `-grpc` tag to any TEI Docker image. For example:\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9-grpc --model-id $model\n```\n\n```shell\ngrpcurl -d '{\"inputs\": \"What is Deep Learning\"}' -plaintext 0.0.0.0:8080 tei.v1.Embed\u002FEmbed\n```\n\n## Local install\n\n### Apple Silicon (Homebrew)\n\nOn Apple Silicon (M1\u002FM2\u002FM3\u002FM4), you can install a prebuilt binary via Homebrew:\n\n```shell\nbrew install text-embeddings-inference\n```\n\nThen launch Text Embeddings Inference with Metal acceleration:\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\n\ntext-embeddings-router --model-id $model --port 8080\n```\n\n### CPU\n\nYou can also opt to install `text-embeddings-inference` locally.\n\nFirst [install Rust](https:\u002F\u002Frustup.rs\u002F):\n\n```shell\ncurl --proto '=https' --tlsv1.2 -sSf https:\u002F\u002Fsh.rustup.rs | sh\n```\n\nThen run:\n\n```shell\n# On x86 with ONNX backend (recommended)\ncargo install --path router -F ort\n# On x86 with Intel backend\ncargo install --path router -F mkl\n# On M1 or M2\ncargo install --path router -F metal\n```\n\nYou can now launch Text Embeddings Inference on CPU with:\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\n\ntext-embeddings-router --model-id $model --port 8080\n```\n\n**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:\n\n```shell\nsudo apt-get install libssl-dev gcc -y\n```\n\n### CUDA\n\nGPUs with CUDA compute capabilities \u003C 7.5 are not supported (V100, Titan V, GTX 1000 series, ...).\n\nMake sure you have CUDA and the NVIDIA drivers installed. NVIDIA drivers on your device need to be compatible with CUDA\nversion 12.2 or higher. You also need to add the NVIDIA binaries to your path:\n\n```shell\nexport PATH=$PATH:\u002Fusr\u002Flocal\u002Fcuda\u002Fbin\n```\n\nThen run the following (might take a while as it needs to compile the CUDA kernels):\n\n```shell\n# On Turing GPUs (T4, RTX 2000 series ... )\ncargo install --path router -F candle-cuda-turing\n\n# On Ampere, Ada Lovelace, Hopper and Blackwell\ncargo install --path router -F candle-cuda\n```\n\nYou can now launch Text Embeddings Inference on GPU as follows:\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\n\ntext-embeddings-router --model-id $model --port 8080\n```\n\n## Docker\n\nYou can build the CPU container with Docker as:\n\n```shell\ndocker build -f Dockerfile .\n```\n\nTo build the CUDA containers, you need to know the compute cap of the GPU you will be using\nat runtime, to build the image accordingly:\n\n```shell\n# Get submodule dependencies\ngit submodule update --init\n\n# Example for Turing (T4, RTX 2000 series, ...)\nruntime_compute_cap=75\n\n# Example for Ampere (A100, ...)\nruntime_compute_cap=80\n\n# Example for Ampere (A10, ...)\nruntime_compute_cap=86\n\n# Example for Ada Lovelace (RTX 4000 series, ...)\nruntime_compute_cap=89\n\n# Example for Hopper (H100, ...)\nruntime_compute_cap=90\n\n# Example for Blackwell (B200, GB200, ...)\nruntime_compute_cap=100\n\n# Example for Blackwell (GeForce RTX 50X0, RTX PRO 6000, ...)\nruntime_compute_cap=120\n\ndocker build . -f Dockerfile-cuda --build-arg CUDA_COMPUTE_CAP=$runtime_compute_cap\n```\n\n### Apple M1\u002FM2 arm64 architectures\n\n#### DISCLAIMER\n\nAs explained here [MPS-Ready, ARM64 Docker Image](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fissues\u002F81224), Metal \u002F MPS is not\nsupported via Docker. As such inference will be CPU bound and most likely pretty slow when using this docker image on an\nM1\u002FM2 ARM CPU.\n\n```\ndocker build . -f Dockerfile --platform=linux\u002Farm64\n```\n\n## Examples\n\n- [Set up an Inference Endpoint with TEI](https:\u002F\u002Fhuggingface.co\u002Flearn\u002Fcookbook\u002Fautomatic_embedding_tei_inference_endpoints)\n- [RAG containers with TEI](https:\u002F\u002Fgithub.com\u002Fplaggy\u002Frag-containers)\n","\u003Cdiv align=\"center\">\n\n# 文本嵌入推理 (Text Embeddings Inference)\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\">\n  \u003Cimg alt=\"GitHub Repo stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuggingface\u002Ftext-embeddings-inference?style=social\">\n\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.github.io\u002Ftext-embeddings-inference\">\n  \u003Cimg alt=\"Swagger API documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAPI-Swagger-informational\">\n\u003C\u002Fa>\n\n一种用于文本嵌入 (text embeddings) 模型的极速推理 (inference) 解决方案。\n\n在 NVIDIA A10 上对 [BAAI\u002Fbge-base-en-v1.5](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbge-base-en-v1.5) 的基准测试，序列长度为 512 tokens：\n\n\u003Cp>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_text-embeddings-inference_readme_d17a0e59ef5b.png\" width=\"400\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_text-embeddings-inference_readme_dde18d6b8125.png\" width=\"400\" \u002F>\n\u003C\u002Fp>\n\u003Cp>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_text-embeddings-inference_readme_b00b4011e088.png\" width=\"400\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_text-embeddings-inference_readme_b431099b0af0.png\" width=\"400\" \u002F>\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n## 目录\n\n- [入门指南](#入门指南)\n    - [支持的模型](#支持的模型)\n    - [Docker](#docker)\n    - [Docker 镜像](#docker-images)\n    - [API 文档](#api-documentation)\n    - [使用私有或受限模型](#using-a-private-or-gated-model)\n    - [Air gapped 部署](#air-gapped-deployment)\n    - [使用重排序模型 (Re-rankers)](#using-re-rankers-models)\n    - [使用序列分类模型 (Sequence Classification)](#using-sequence-classification-models)\n    - [使用 SPLADE 池化 (pooling)](#using-splade-pooling)\n    - [分布式追踪 (Distributed Tracing)](#distributed-tracing)\n    - [gRPC](#grpc)\n- [本地安装](#local-install)\n    - [Apple Silicon (Homebrew)](#apple-silicon-homebrew)\n- [Docker 构建](#docker-build)\n    - [Apple M1\u002FM2 Arm](#apple-m1m2-arm64-architectures)\n- [示例](#examples)\n\n文本嵌入推理 (Text Embeddings Inference, TEI) 是一个用于部署和托管开源文本嵌入及序列分类模型的 toolkits。TEI 支持最流行模型的高性能提取，包括 FlagEmbedding、Ember、GTE 和 E5。TEI 实现了许多功能，例如：\n\n* 无需模型图编译步骤\n* Mac 本地执行支持 Metal\n* 小型 Docker 镜像和快速启动时间。准备好迎接真正的无服务器 (serverless) 体验！\n* 基于 token 的动态批处理 (dynamic batching)\n* 针对推理优化的 transformers 代码，使用 [Flash Attention](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fflash-attention)、[Candle](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fcandle) 和 [cuBLASLt](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcublas\u002F#using-the-cublaslt-api)\n* [Safetensors](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fsafetensors) 权重加载\n* [ONNX](https:\u002F\u002Fgithub.com\u002Fonnx\u002Fonnx) 权重加载\n* 生产就绪（支持 Open Telemetry 分布式追踪、Prometheus 指标）\n\n## 入门指南\n\n### 支持的模型\n\n#### 文本嵌入\n\n文本嵌入推理（Text Embeddings Inference）目前支持具有绝对位置编码的 Nomic、BERT、CamemBERT、XLM-RoBERTa 模型，具有 Alibi 位置编码的 JinaBERT 模型，以及具有 Rope 位置编码的 Mistral、Alibaba GTE、Qwen2 模型，还包括 MPNet、ModernBERT、Qwen3 和 Gemma3。\n\n以下是当前支持的部分模型示例：\n\n| MTEB 排名 | 模型大小             | 模型类型     | 模型 ID                                                                                         |\n|-----------|----------------------|--------------|-------------------------------------------------------------------------------------------------|\n| 2         | 7.57B (非常昂贵)     | Qwen3        | [Qwen\u002FQwen3-Embedding-8B](https:\u002F\u002Fhf.co\u002FQwen\u002FQwen3-Embedding-8B)                               |\n| 3         | 4.02B (非常昂贵)     | Qwen3        | [Qwen\u002FQwen3-Embedding-4B](https:\u002F\u002Fhf.co\u002FQwen\u002FQwen3-Embedding-4B)                               |\n| 4         | 509M                 | Qwen3        | [Qwen\u002FQwen3-Embedding-0.6B](https:\u002F\u002Fhf.co\u002FQwen\u002FQwen3-Embedding-0.6B)                           |\n| 6         | 7.61B (非常昂贵)     | Qwen2        | [Alibaba-NLP\u002Fgte-Qwen2-7B-instruct](https:\u002F\u002Fhf.co\u002FAlibaba-NLP\u002Fgte-Qwen2-7B-instruct)            |\n| 7         | 560M                 | XLM-RoBERTa  | [intfloat\u002Fmultilingual-e5-large-instruct](https:\u002F\u002Fhf.co\u002Fintfloat\u002Fmultilingual-e5-large-instruct) |\n| 8         | 308M                 | Gemma3       | [google\u002Fembeddinggemma-300m](https:\u002F\u002Fhf.co\u002Fgoogle\u002Fembeddinggemma-300m) (需授权)                  |\n| 15        | 1.78B (昂贵)         | Qwen2        | [Alibaba-NLP\u002Fgte-Qwen2-1.5B-instruct](https:\u002F\u002Fhf.co\u002FAlibaba-NLP\u002Fgte-Qwen2-1.5B-instruct)         |\n| 18        | 7.11B (非常昂贵)     | Mistral      | [Salesforce\u002FSFR-Embedding-2_R](https:\u002F\u002Fhf.co\u002FSalesforce\u002FSFR-Embedding-2_R)                      |\n| 35        | 568M                 | XLM-RoBERTa  | [Snowflake\u002Fsnowflake-arctic-embed-l-v2.0](https:\u002F\u002Fhf.co\u002FSnowflake\u002Fsnowflake-arctic-embed-l-v2.0) |\n| 41        | 305M                 | Alibaba GTE  | [Snowflake\u002Fsnowflake-arctic-embed-m-v2.0](https:\u002F\u002Fhf.co\u002FSnowflake\u002Fsnowflake-arctic-embed-m-v2.0) |\n| 52        | 335M                 | BERT         | [WhereIsAI\u002FUAE-Large-V1](https:\u002F\u002Fhf.co\u002FWhereIsAI\u002FUAE-Large-V1)                                   |\n| 58        | 137M                 | NomicBERT    | [nomic-ai\u002Fnomic-embed-text-v1](https:\u002F\u002Fhf.co\u002Fnomic-ai\u002Fnomic-embed-text-v1)                       |\n| 79        | 137M                 | NomicBERT    | [nomic-ai\u002Fnomic-embed-text-v1.5](https:\u002F\u002Fhf.co\u002Fnomic-ai\u002Fnomic-embed-text-v1.5)                   |\n| 103       | 109M                 | MPNet        | [sentence-transformers\u002Fall-mpnet-base-v2](https:\u002F\u002Fhf.co\u002Fsentence-transformers\u002Fall-mpnet-base-v2) |\n| N\u002FA       | 475M-A305M           | NomicBERT    | [nomic-ai\u002Fnomic-embed-text-v2-moe](https:\u002F\u002Fhf.co\u002Fnomic-ai\u002Fnomic-embed-text-v2-moe)               |\n| N\u002FA       | 434M                 | Alibaba GTE  | [Alibaba-NLP\u002Fgte-large-en-v1.5](https:\u002F\u002Fhf.co\u002FAlibaba-NLP\u002Fgte-large-en-v1.5)                     |\n| N\u002FA       | 396M                 | ModernBERT   | [answerdotai\u002FModernBERT-large](https:\u002F\u002Fhf.co\u002Fanswerdotai\u002FModernBERT-large)                       |\n| N\u002FA       | 340M                 | Qwen3        | [voyageai\u002Fvoyage-4-nano](https:\u002F\u002Fhf.co\u002Fvoyageai\u002Fvoyage-4-nano)                                   |\n| N\u002FA       | 137M                 | JinaBERT     | [jinaai\u002Fjina-embeddings-v2-base-en](https:\u002F\u002Fhf.co\u002Fjinaai\u002Fjina-embeddings-v2-base-en)             |\n| N\u002FA       | 137M                 | JinaBERT     | [jinaai\u002Fjina-embeddings-v2-base-code](https:\u002F\u002Fhf.co\u002Fjinaai\u002Fjina-embeddings-v2-base-code)         |\n\n若要查看表现最佳的文本嵌入模型列表，请访问 [大规模文本嵌入基准 (MTEB) 排行榜](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmteb\u002Fleaderboard)。\n\n#### 序列分类与重排序\n\n文本嵌入推理目前支持具有绝对位置编码的 CamemBERT 和 XLM-RoBERTa 序列分类模型。\n\n以下是当前支持的部分模型示例：\n\n| 任务               | 模型类型  | 模型 ID                                                                                                        |\n|--------------------|-----------|----------------------------------------------------------------------------------------------------------------|\n| 重排序             | XLM-RoBERTa | [BAAI\u002Fbge-reranker-large](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbge-reranker-large)                                       |\n| 重排序             | XLM-RoBERTa | [BAAI\u002Fbge-reranker-base](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbge-reranker-base)                                         |\n| 重排序             | GTE       | [Alibaba-NLP\u002Fgte-multilingual-reranker-base](https:\u002F\u002Fhuggingface.co\u002FAlibaba-NLP\u002Fgte-multilingual-reranker-base) |\n| 重排序             | ModernBert | [Alibaba-NLP\u002Fgte-reranker-modernbert-base](https:\u002F\u002Fhuggingface.co\u002FAlibaba-NLP\u002Fgte-reranker-modernbert-base) |\n| 情感分析           | RoBERTa   | [SamLowe\u002Froberta-base-go_emotions](https:\u002F\u002Fhuggingface.co\u002FSamLowe\u002Froberta-base-go_emotions)                     |\n\n### Docker\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\n然后您可以发起如下请求：\n\n```bash\ncurl 127.0.0.1:8080\u002Fembed \\\n    -X POST \\\n    -d '{\"inputs\":\"What is Deep Learning?\"}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n**注意：** 要使用 GPU，您需要安装 [NVIDIA Container Toolkit](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Finstall-guide.html)。您机器上的 NVIDIA 驱动程序需要与 CUDA 12.2 或更高版本兼容。\n\n若要查看所有用于服务模型的选项：\n\n```console\n$ text-embeddings-router --help\nText Embedding Webserver\n\nUsage: text-embeddings-router [OPTIONS] --model-id \u003CMODEL_ID>\n\nOptions:\n      --model-id \u003CMODEL_ID>\n          The Hugging Face model ID, can be any model listed on \u003Chttps:\u002F\u002Fhuggingface.co\u002Fmodels> with the `text-embeddings-inference` tag (meaning it's compatible with Text Embeddings Inference).\n\n          Alternatively, the specified ID can also be a path to a local directory containing the necessary model files saved by the `save_pretrained(...)` methods of either Transformers or Sentence Transformers.\n\n          [env: MODEL_ID=]\n\n      --revision \u003CREVISION>\n          The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs\u002Fpr\u002F2`\n\n          [env: REVISION=]\n```\n\n--tokenization-workers \u003CTOKENIZATION_WORKERS>\n                  可选控制用于请求负载（payload）分词、验证和截断的分词器工作进程数。默认为机器上的 CPU 核心数\n\n                  [env: TOKENIZATION_WORKERS=]\n\n          --dtype \u003CDTYPE>\n                  强制应用于模型的数据类型 (dtype)\n\n                  [env: DTYPE=]\n                  [possible values: float16, float32]\n\n          --served-model-name \u003CSERVED_MODEL_NAME>\n                  正在服务的模型名称。如果未指定，默认为 `--model-id`。它仅用于通过 HTTP 的 OpenAI 兼容端点\n\n                  [env: SERVED_MODEL_NAME=]\n\n          --pooling \u003CPOOLING>\n                  可选控制嵌入模型（embedding models）的池化（pooling）方法。\n\n                  如果 `pooling` 未设置，池化配置将从模型 `1_Pooling\u002Fconfig.json` 配置中解析。\n\n                  如果 `pooling` 已设置，它将覆盖模型的池化配置\n\n                  [env: POOLING=]\n\n                  可能值：\n                  - cls:        选择 CLS token 作为嵌入\n                  - mean:       对模型嵌入应用平均池化 (Mean pooling)\n                  - splade:     对模型嵌入应用 SPLADE（稀疏词汇与扩展）。此选项仅在加载的模型为 `ForMaskedLM` Transformer 模型时可用\n                  - last-token: 选择最后一个 token 作为嵌入\n\n          --max-concurrent-requests \u003CMAX_CONCURRENT_REQUESTS>\n                  此特定部署的最大并发请求量。设置较低的限制将拒绝客户端请求，而不是让它们等待太久，通常有助于正确处理背压（backpressure）\n\n                  [env: MAX_CONCURRENT_REQUESTS=]\n                  [default: 512]\n\n          --max-batch-tokens \u003CMAX_BATCH_TOKENS>\n                  **重要** 这是允许充分利用可用硬件的关键控制之一。\n\n                  这代表批次内潜在 token 的总量。\n\n                  对于 `max_batch_tokens=1000`，你可以容纳 10 个总 token 数为 100 的查询，或者单个 1000 token 的查询。\n\n                  总体而言，这个数字应该是尽可能大的，直到模型达到计算瓶颈。由于实际内存开销取决于模型实现，text-embeddings-inference 无法自动推断此数字。\n\n                  [env: MAX_BATCH_TOKENS=]\n                  [default: 16384]\n\n          --max-batch-requests \u003CMAX_BATCH_REQUESTS>\n                  可选控制批次中单个请求的最大数量\n\n                  [env: MAX_BATCH_REQUESTS=]\n\n          --max-client-batch-size \u003CMAX_CLIENT_BATCH_SIZE>\n                  控制客户端在单个请求中可以发送的最大输入数量\n\n                  [env: MAX_CLIENT_BATCH_SIZE=]\n                  [default: 32]\n\n          --auto-truncate\n                  控制是否自动截断超过模型最大支持大小的输入。默认为 `true`（启用截断）。设置为 `false` 以禁用截断；当禁用且模型的最大输入长度超过 `--max-batch-tokens` 时，服务器将拒绝启动并报错，而不是静默截断序列。\n\n                  不适用于 gRPC 服务器\n\n                  [env: AUTO_TRUNCATE=]\n\n          --default-prompt-name \u003CDEFAULT_PROMPT_NAME>\n                  应默认用于编码的提示名称。如果未设置，则不应用任何提示。\n\n                  必须是 `sentence-transformers` 配置 `prompts` 字典中的键。\n\n                  例如，如果 ``default_prompt_name`` 是 \"query\" 且 ``prompts`` 是 {\"query\": \"query: \", ...}，那么句子 \"What is the capital of France?\" 将被编码为 \"query: What is the capital of France?\"，因为提示文本将在任何要编码的文本之前被前置。\n\n                  参数 '--default-prompt-name \u003CDEFAULT_PROMPT_NAME>' 不能与 '--default-prompt \u003CDEFAULT_PROMPT>` 一起使用\n\n                  [env: DEFAULT_PROMPT_NAME=]\n\n          --default-prompt \u003CDEFAULT_PROMPT>\n                  应默认用于编码的提示。如果未设置，则不应用任何提示。\n\n                  例如，如果 ``default_prompt`` 是 \"query: \"，那么句子 \"What is the capital of France?\" 将被编码为 \"query: What is the capital of France?\"，因为提示文本将在任何要编码的文本之前被前置。\n\n                  参数 '--default-prompt \u003CDEFAULT_PROMPT>' 不能与 '--default-prompt-name \u003CDEFAULT_PROMPT_NAME>` 一起使用\n\n                  [env: DEFAULT_PROMPT=]\n\n          --dense-path \u003CDENSE_PATH>\n                  可选地，定义某些嵌入模型所需的 Dense 模块路径。\n\n                  某些嵌入模型需要额外的 `Dense` 模块，其中包含单个线性层和激活函数。默认情况下，这些 `Dense` 模块存储在 `2_Dense` 目录下，但也可能存在提供不同 `Dense` 模块的情况，用于将池化嵌入转换为不同的维度，表示为 `2_Dense_\u003Cdims>`，例如 https:\u002F\u002Fhuggingface.co\u002FNovaSearch\u002Fstella_en_400M_v5。\n\n                  注意此参数是可选的，仅在没有 `modules.json` 文件时或想要覆盖单个 Dense 模块路径时才需要设置，且仅在运行 `candle` 后端时有效。\n\n                  [env: DENSE_PATH=]\n\n          --hf-token \u003CHF_TOKEN>\n                  你的 Hugging Face Hub 令牌。如果未设置 `--hf-token` 或 `HF_TOKEN`，令牌将从 `$HF_HOME\u002Ftoken` 路径读取（如果存在）。这确保了对私有或受限模型的访问，并允许更宽松的速率限制\n\n                  [env: HF_TOKEN=]\n\n          --hostname \u003CHOSTNAME>\n                  监听的 IP 地址\n\n                  [env: HOSTNAME=]\n                  [default: 0.0.0.0]\n\n          -p, --port \u003CPORT>\n                  监听的端口\n\n                  [env: PORT=]\n                  [default: 3000]\n\n          --uds-path \u003CUDS_PATH>\n                  一些 text-embeddings-inference 后端内部通过 gRPC 通信时将使用的 Unix 套接字名称\n\n                  [env: UDS_PATH=]\n                  [default: \u002Ftmp\u002Ftext-embeddings-inference-server]\n\n          --huggingface-hub-cache \u003CHUGGINGFACE_HUB_CACHE>\n                  HuggingFace Hub 缓存的位置。用于覆盖位置，例如如果你想提供一个挂载的磁盘\n\n                  [env: HUGGINGFACE_HUB_CACHE=]\n\n          --payload-limit \u003CPAYLOAD_LIMIT>\n                  负载大小限制（字节）\n\n                  默认为 2MB\n\n                  [env: PAYLOAD_LIMIT=]\n                  [default: 2000000]\n\n          --api-key \u003CAPI_KEY>\n                  设置用于请求授权的 API 密钥。\n\n                  默认情况下，服务器响应每个请求。设置了 API 密钥后，请求必须设置 Authorization 头，并将 API 密钥作为 Bearer token。\n\n                  [env: API_KEY=]\n\n          --json-output\n                  以 JSON 格式输出日志（适用于遥测）\n\n                  [env: JSON_OUTPUT=]\n\n          --disable-spans\n                  是否包含通过 spans（跨度）的日志跟踪\n\n                  [env: DISABLE_SPANS=]\n\n--otlp-endpoint \u003COTLP_ENDPOINT>\n          OpenTelemetry（开放遥测）的 gRPC 端点。遥测 (Telemetry) 数据将通过 gRPC 以 OTLP（OpenTelemetry 协议）格式发送至该端点。例如：`http:\u002F\u002Flocalhost:4317`\n\n          [env: OTLP_ENDPOINT=]\n\n      --otlp-service-name \u003COTLP_SERVICE_NAME>\n          OpenTelemetry 的服务名称。例如：`text-embeddings-inference.server`\n\n          [env: OTLP_SERVICE_NAME=]\n          [默认值：text-embeddings-inference.server]\n\n      --prometheus-port \u003CPROMETHEUS_PORT>\n          监听 Prometheus 监控系统的端口\n\n          [env: PROMETHEUS_PORT=]\n          [默认值：9000]\n\n      --cors-allow-origin \u003CCORS_ALLOW_ORIGIN>\n          gRPC 服务器中未使用\n\n          [env: CORS_ALLOW_ORIGIN=]\n\n  -h, --help\n          打印帮助信息（使用 '-h' 查看摘要）\n\n  -V, --version\n          打印版本信息\n```\n\n\n\n### Docker 镜像\n\nText Embeddings Inference 附带多个 Docker 镜像，可用于针对特定后端：\n\n| 架构                           | 镜像                                                                   |\n|--------------------------------|-------------------------------------------------------------------------|\n| CPU                            | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-1.9                   |\n| Volta                          | 不支持                                                                |\n| Turing (T4, RTX 2000 系列，...)  | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:turing-1.9 (实验性)       |\n| Ampere 8.0 (A100, A30)         | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:1.9                       |\n| Ampere 8.6 (A10, A40, ...)     | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:86-1.9                    |\n| Ada Lovelace (RTX 4000 系列，...) | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:89-1.9                    |\n| Hopper (H100)                  | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:hopper-1.9                |\n| Blackwell 10.0 (B200, GB200, ...) | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:100-1.9 (实验性)          |\n| Blackwell 12.0 (GeForce RTX 50X0, ...) | ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:120-1.9 (实验性)          |\n\n**警告**: 由于精度问题，Turing 镜像默认关闭了 Flash Attention。\n您可以使用 `USE_FLASH_ATTENTION=True` 环境变量将 Flash Attention v1 开启。\n\n### API 文档\n\n您可以使用 `\u002Fdocs` 路由查阅 `text-embeddings-inference` REST API 的 OpenAPI 文档。\nSwagger UI 也可在以下地址访问：\n[https:\u002F\u002Fhuggingface.github.io\u002Ftext-embeddings-inference](https:\u002F\u002Fhuggingface.github.io\u002Ftext-embeddings-inference)。\n\n### 使用私有或受限模型\n\n您可以选择使用 `HF_TOKEN` 环境变量来配置 `text-embeddings-inference` 使用的令牌。这允许您访问受保护的资源。\n\n例如：\n\n1. 前往 https:\u002F\u002Fhuggingface.co\u002Fsettings\u002Ftokens\n2. 复制您的 CLI READ 令牌\n3. 导出 `HF_TOKEN=\u003Cyour CLI READ token>`\n\n或者使用 Docker：\n\n```shell\nmodel=\u003Cyour private model>\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\ntoken=\u003Cyour CLI READ token>\n\ndocker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\n### 离线部署\n\n要在气隙环境（离线环境）中部署 Text Embeddings Inference，首先下载权重，然后使用卷将它们挂载到容器内部。\n\n例如：\n\n```shell\n# (Optional) create a `models` directory\nmkdir models\ncd models\n\n# Make sure you have git-lfs installed (https:\u002F\u002Fgit-lfs.com)\ngit lfs install\ngit clone https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-Embedding-0.6B\n\n# Set the models directory as the volume path\nvolume=$PWD\n\n# Mount the models directory inside the container with a volume and set the model ID\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id \u002Fdata\u002FQwen3-Embedding-0.6B\n```\n\n### 使用重排序模型\n\n`text-embeddings-inference` v0.4.0 增加了对 CamemBERT、RoBERTa、XLM-RoBERTa 和 GTE 序列分类模型的支持。\n重排序模型是单类别的序列分类交叉编码器模型，用于对查询和文本之间的相似度进行评分。\n\n请参阅 LlamaIndex 团队的 [这篇博客文章](https:\u002F\u002Fblog.llamaindex.ai\u002Fboosting-rag-picking-the-best-embedding-reranker-models-42d079022e83)，了解如何在您的 RAG 管道中使用重排序模型以提高下游性能。\n\n```shell\nmodel=BAAI\u002Fbge-reranker-large\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\n然后您可以使用以下命令对查询和一组文本之间的相似度进行排名：\n\n```bash\ncurl 127.0.0.1:8080\u002Frerank \\\n    -X POST \\\n    -d '{\"query\": \"What is Deep Learning?\", \"texts\": [\"Deep Learning is not...\", \"Deep learning is...\"]}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n### 使用序列分类模型\n\n您也可以使用经典的序列分类模型，如 `SamLowe\u002Froberta-base-go_emotions`：\n\n```shell\nmodel=SamLowe\u002Froberta-base-go_emotions\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\n部署模型后，您可以使用 `predict` 端点获取与输入最相关的情绪：\n\n```bash\ncurl 127.0.0.1:8080\u002Fpredict \\\n    -X POST \\\n    -d '{\"inputs\":\"I like you.\"}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n### 使用 SPLADE 池化\n\n您可以选择为 Bert 和 Distilbert MaskedLM 架构激活 SPLADE 池化：\n\n```shell\nmodel=naver\u002Fefficient-splade-VI-BT-large-query\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model --pooling splade\n```\n\n部署模型后，您可以使用 `\u002Fembed_sparse` 端点获取稀疏嵌入：\n\n```bash\ncurl 127.0.0.1:8080\u002Fembed_sparse \\\n    -X POST \\\n    -d '{\"inputs\":\"I like you.\"}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n### 分布式追踪\n\n`text-embeddings-inference` 已通过 OpenTelemetry 集成了分布式追踪。您可以通过设置 `--otlp-endpoint` 参数将地址指向 OTLP 收集器来使用此功能。\n\n### gRPC\n\n`text-embeddings-inference` (文本嵌入推理) 提供 gRPC API 作为高性能部署中默认 HTTP API 的替代方案。API 的 protobuf 定义可在 [此处](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fblob\u002Fmain\u002Fproto\u002Ftei.proto) 找到。\n\n您可以通过在任意 TEI (Text Embeddings Inference) Docker 镜像中添加 `-grpc` 标签来使用 gRPC API。例如：\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9-grpc --model-id $model\n```\n\n```shell\ngrpcurl -d '{\"inputs\": \"What is Deep Learning\"}' -plaintext 0.0.0.0:8080 tei.v1.Embed\u002FEmbed\n```\n\n## 本地安装\n\n### Apple Silicon (Homebrew)\n\n在 Apple Silicon (M1\u002FM2\u002FM3\u002FM4) 上，您可以通过 Homebrew 安装预编译的二进制文件：\n\n```shell\nbrew install text-embeddings-inference\n```\n\n然后使用 Metal 加速启动文本嵌入推理：\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\n\ntext-embeddings-router --model-id $model --port 8080\n```\n\n### CPU\n\n您也可以选择在本地安装 `text-embeddings-inference`。\n\n首先 [安装 Rust](https:\u002F\u002Frustup.rs\u002F)：\n\n```shell\ncurl --proto '=https' --tlsv1.2 -sSf https:\u002F\u002Fsh.rustup.rs | sh\n```\n\n然后运行：\n\n```shell\n# On x86 with ONNX backend (recommended)\ncargo install --path router -F ort\n# On x86 with Intel backend\ncargo install --path router -F mkl\n# On M1 or M2\ncargo install --path router -F metal\n```\n\n现在您可以使用以下方式在 CPU 上启动文本嵌入推理：\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\n\ntext-embeddings-router --model-id $model --port 8080\n```\n\n**注意：** 在某些机器上，您可能还需要 OpenSSL 库和 gcc。在 Linux 机器上，运行：\n\n```shell\nsudo apt-get install libssl-dev gcc -y\n```\n\n### CUDA\n\n不支持 CUDA 计算能力小于 7.5 的 GPU（V100, Titan V, GTX 1000 系列等）。\n\n请确保已安装 CUDA 和 NVIDIA 驱动程序。您设备上的 NVIDIA 驱动程序需要与 CUDA 12.2 或更高版本兼容。您还需要将 NVIDIA 二进制文件添加到您的路径中：\n\n```shell\nexport PATH=$PATH:\u002Fusr\u002Flocal\u002Fcuda\u002Fbin\n```\n\n然后运行以下内容（可能需要一些时间，因为它需要编译 CUDA 内核）：\n\n```shell\n# On Turing GPUs (T4, RTX 2000 series ... )\ncargo install --path router -F candle-cuda-turing\n\n# On Ampere, Ada Lovelace, Hopper and Blackwell\ncargo install --path router -F candle-cuda\n```\n\n现在您可以按如下方式在 GPU 上启动文本嵌入推理：\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\n\ntext-embeddings-router --model-id $model --port 8080\n```\n\n## Docker\n\n您可以使用 Docker 构建 CPU 容器，如下所示：\n\n```shell\ndocker build -f Dockerfile .\n```\n\n要构建 CUDA 容器，您需要知道运行时将要使用的 GPU 的计算能力 (compute cap)，以便相应地构建镜像：\n\n```shell\n# Get submodule dependencies\ngit submodule update --init\n\n# Example for Turing (T4, RTX 2000 series, ...)\nruntime_compute_cap=75\n\n# Example for Ampere (A100, ...)\nruntime_compute_cap=80\n\n# Example for Ampere (A10, ...)\nruntime_compute_cap=86\n\n# Example for Ada Lovelace (RTX 4000 series, ...)\nruntime_compute_cap=89\n\n# Example for Hopper (H100, ...)\nruntime_compute_cap=90\n\n# Example for Blackwell (B200, GB200, ...)\nruntime_compute_cap=100\n\n# Example for Blackwell (GeForce RTX 50X0, RTX PRO 6000, ...)\nruntime_compute_cap=120\n\ndocker build . -f Dockerfile-cuda --build-arg CUDA_COMPUTE_CAP=$runtime_compute_cap\n```\n\n### Apple M1\u002FM2 arm64 架构\n\n#### 免责声明\n\n正如 [此处](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fissues\u002F81224) 所解释的，Metal \u002F Metal Performance Shaders (MPS) 无法通过 Docker 支持。因此，当在 M1\u002FM2 ARM CPU 上使用此 Docker 镜像时，推理将由 CPU 处理，速度可能相当慢。\n\n```\ndocker build . -f Dockerfile --platform=linux\u002Farm64\n```\n\n## 示例\n\n- [使用 TEI 设置推理端点](https:\u002F\u002Fhuggingface.co\u002Flearn\u002Fcookbook\u002Fautomatic_embedding_tei_inference_endpoints)\n- [使用 TEI 的 RAG (检索增强生成) 容器](https:\u002F\u002Fgithub.com\u002Fplaggy\u002Frag-containers)","# Text Embeddings Inference (TEI) 快速上手指南\n\n**Text Embeddings Inference (TEI)** 是 Hugging Face 出品的高性能开源文本嵌入推理工具。它专为部署和托管开源文本嵌入及序列分类模型而设计，支持动态批处理、Flash Attention 优化，并具备生产级监控能力。\n\n## 1. 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux \u002F macOS \u002F Windows (WSL)\n*   **容器引擎**: 已安装 Docker\n*   **GPU 支持 (推荐)**:\n    *   NVIDIA GPU\n    *   安装 [NVIDIA Container Toolkit](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Finstall-guide.html)\n    *   宿主机 NVIDIA 驱动需兼容 CUDA 12.2 或更高版本\n*   **网络**: 能够访问 Hugging Face Hub 下载模型权重\n\n> **注意**: 国内用户访问 Hugging Face 相关资源可能较慢，建议提前配置好网络代理以确保模型拉取顺畅。\n\n## 2. 安装与部署\n\n推荐使用 Docker 方式进行部署，无需手动编译，启动速度快且镜像体积小。\n\n### 2.1 设置环境变量\n\n定义要使用的模型 ID 和本地数据卷路径（用于缓存模型文件，避免重复下载）：\n\n```shell\nmodel=Qwen\u002FQwen3-Embedding-0.6B\nvolume=$PWD\u002Fdata\n```\n\n### 2.2 启动服务\n\n执行以下命令启动 TEI 服务。该命令将映射端口 `8080` 并挂载本地目录到容器内。\n\n```shell\ndocker run --gpus all -p 8080:80 -v $volume:\u002Fdata --pull always ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 --model-id $model\n```\n\n启动成功后，服务将在本地 `http:\u002F\u002F127.0.0.1:8080` 运行。\n\n## 3. 基本使用\n\n服务启动后，可通过 HTTP POST 请求发送文本进行嵌入生成。\n\n### 3.1 发送请求\n\n使用 `curl` 向 `\u002Fembed` 端点发送 JSON 数据：\n\n```bash\ncurl 127.0.0.1:8080\u002Fembed \\\n    -X POST \\\n    -d '{\"inputs\":\"What is Deep Learning?\"}' \\\n    -H 'Content-Type: application\u002Fjson'\n```\n\n### 3.2 查看配置选项\n\n如需调整并发数、批次大小或显存限制等高级参数，可参考帮助文档：\n\n```console\n$ text-embeddings-router --help\n```\n\n常用关键参数说明：\n*   `--model-id`: 指定模型 ID 或本地路径。\n*   `--max-batch-tokens`: 控制批次中的最大 token 数（默认 16384），影响硬件利用率。\n*   `--pooling`: 指定池化方式（如 `cls`, `mean`, `last-token`）。\n\n## 4. 模型支持\n\nTEI 支持多种主流模型架构，包括 BERT、Qwen、Mistral、GTE 等。\n\n*   **文本嵌入**: 支持 Nomic, BERT, XLM-RoBERTa, JinaBERT, Mistral, Qwen2\u002F3 等。\n*   **重排序 (Re-Rankers)**: 支持 BGE Reranker, GTE Reranker 等。\n*   **序列分类**: 支持 CamemBERT, XLM-RoBERTa 等。\n\n具体支持的模型列表可查阅 [Hugging Face Model Hub](https:\u002F\u002Fhuggingface.co\u002Fmodels?search=text-embeddings-inference)。","某电商客服团队搭建智能工单分类系统，需实时将海量历史工单转化为向量存入数据库。\n\n### 没有 text-embeddings-inference 时\n- 直接使用 Hugging Face pipeline 封装接口，单条请求平均耗时超过 200ms，影响用户体验。\n- 无法自动聚合多个请求，GPU 计算资源闲置严重，扩容成本居高不下。\n- 模型权重加载慢，服务启动时间长，频繁发布版本时业务中断风险大。\n- 缺乏分布式追踪能力，线上出现延迟抖动时难以定位是网络还是模型问题。\n\n### 使用 text-embeddings-inference 后\n- text-embeddings-inference 优化了底层算子，首字延迟降至 50ms 以下，查询体验流畅。\n- 启用 Token 级动态批处理，在保持低延迟的同时最大化 GPU 吞吐量，节省硬件成本。\n- 支持 Docker 镜像快速启动，结合 Safetensors 格式，服务冷启动仅需数秒。\n- 原生集成 Prometheus 指标与 OpenTelemetry，帮助运维实时掌握 QPS、延迟等关键数据。\n\ntext-embeddings-inference 通过极致优化的推理引擎，让企业级向量检索服务既快又稳。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_text-embeddings-inference_2f213ebd.png","huggingface","Hugging Face","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhuggingface_90da21a4.png","The AI community building the future.",null,"https:\u002F\u002Fhuggingface.co\u002F","https:\u002F\u002Fgithub.com\u002Fhuggingface",[84,88,92,96,100,104,108],{"name":85,"color":86,"percentage":87},"Rust","#dea584",87.5,{"name":89,"color":90,"percentage":91},"Python","#3572A5",10.6,{"name":93,"color":94,"percentage":95},"JavaScript","#f1e05a",1.1,{"name":97,"color":98,"percentage":99},"Dockerfile","#384d54",0.4,{"name":101,"color":102,"percentage":103},"Shell","#89e051",0.2,{"name":105,"color":106,"percentage":107},"Makefile","#427819",0.1,{"name":109,"color":110,"percentage":107},"Nix","#7e7eff",4652,376,"2026-04-04T21:24:15","Apache-2.0","Linux, macOS","非必需，使用 GPU 需 NVIDIA 显卡，驱动需兼容 CUDA 12.2+","未说明",{"notes":119,"python":117,"dependencies":120},"需安装 NVIDIA Container Toolkit 以启用 GPU；支持 Apple Silicon (Metal) 本地执行；建议通过挂载卷共享模型数据以避免重复下载；支持动态批处理及 OpenTelemetry 分布式追踪。",[121,122,123,124,125,126],"Flash Attention","Candle","cuBLASLt","Safetensors","ONNX","Transformers",[13,26,14,15],[129,130,76,131,132],"ai","embeddings","llm","ml",8,"2026-03-27T02:49:30.150509","2026-04-06T05:36:49.525309",[137,142,146,150,155,159],{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},1328,"如何在 Text Embeddings Inference 中支持 ModernBERT 模型？","如果遇到 `Model is not supported: unknown variant 'modernbert'` 错误，需要更新底层依赖。评论指出应将 Cargo.toml 中的 `OlivierDehaene\u002Fcandle` 切换为 `huggingface\u002Fcandle`，因为后者在 1 月 12 日添加了 ModernBERT 支持。同时需注意 Docker 构建时的依赖特征匹配问题。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fissues\u002F457",{"id":143,"question_zh":144,"answer_zh":145,"source_url":141},1329,"Docker 运行时出现 `Model backend is not healthy` 及 CUDA 符号错误怎么办？","这通常与 CUDA 环境配置有关。建议升级 Nvidia 驱动和 CUDA Toolkit（如至 12.8），并在 Dockerfile 中调整 CUDA 版本重新编译。确保 GPU 架构设置正确，例如 RTX A6000 (Ampere) 应设置 `CUDA_COMPUTE_CAP=86`。",{"id":147,"question_zh":148,"answer_zh":149,"source_url":141},1330,"启动服务时提示 `Token file not found` 如何解决？","日志显示无法找到 `\u002Froot\u002F.cache\u002Fhuggingface\u002Ftoken` 文件。请确保在 Docker 挂载卷中正确提供了 Hugging Face Hub 的认证令牌文件，或者在容器内正确设置了环境变量以指向有效的 token 路径。",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},1331,"CPU 环境下编译或使用 MKL 后端报错 undefined references 怎么办？","目前文档建议使用的 CPU 后端是 `ort`，而非 `mkl`，以避免链接问题。若必须使用 MKL，请确保正确安装 Intel MKL oneAPI 库，并注意不同平台（Windows\u002FWSL\u002FLinux）下的编译环境差异。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fissues\u002F728",{"id":156,"question_zh":157,"answer_zh":158,"source_url":154},1332,"MKL 后端编译链接失败是否有缓存清理方案？","可能是 cargo 缓存导致的问题。建议先执行 `cargo clean` 清理缓存。此外，请通过 `source \u002Fopt\u002Fintel\u002Foneapi\u002Fsetvars.sh` 设置环境变量后再进行编译，以确保链接器能找到正确的库。",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},1333,"TEI 在 CPU 上的推理速度为何显著低于 sentence-transformers？","这是一个已知的性能限制。多位用户反馈在 CPU 环境下 TEI 比简单的 sentence-transformers 服务器慢 2.4 到 3 倍。目前建议在生产环境中优先使用 GPU 以获得最佳性能，或在 CPU 场景下接受当前的吞吐量水平。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fissues\u002F31",[165,170,175,180,185,190,195,200,205,210,215,220,225,230,235,240,245,250,255,260],{"id":166,"version":167,"summary_zh":168,"released_at":169},100827,"v1.9.3","## What's Changed\r\n* Use `rust-toolchain.toml` before `rustup` on `Dockerfile-{cuda,cuda-all}` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F842\r\n* fix(backend): replace bare except with Exception in device check by @llukito in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F821\r\n* Set `version` 1.9.3 by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F849\r\n\r\n## New Contributors\r\n* @llukito made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F821\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.9.2...v1.9.3","2026-03-23T11:57:19",{"id":171,"version":172,"summary_zh":173,"released_at":174},100828,"v1.9.2","## What's Changed\r\n\r\n* Fix auto-truncate false setting by @vrdn-23 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F836\r\n* Set `pad_token_id` as nullable & add support for `rope_parameters` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F832\r\n* docs: add Homebrew installation to README by @Peredery in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F834\r\n* feat: support pplx-embed-v1 by @mkrimmel-pplx in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F824\r\n\r\n## New Contributors\r\n* @Peredery made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F834\r\n* @mkrimmel-pplx made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F824\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.9.1...v1.9.2","2026-02-25T11:17:59",{"id":176,"version":177,"summary_zh":178,"released_at":179},100829,"v1.9.1","## What's Changed\r\n\r\n### 🚨 Fix\r\n\r\n* Fix support for containers w\u002F CUDA 13.0+ by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F831\r\n> When releasing ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.9 with CUDA 12.9 and `cuda-compat-12-9` there was an issue when running that same container on instances with CUDA 13.0+, as the `cuda-compat-12-9` set in `LD_LIBRARY_PATH` was leading to a `CUDA_ERROR_SYSTEM_DRIVER_MISMATCH = 803`, which is now solved with a custom entrypoint that dynamically includes the `cuda-compat` on the `LD_LIBRARY_PATH` depending on the instance CUDA version.\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.9.0...v1.9.1","2026-02-17T20:59:31",{"id":181,"version":182,"summary_zh":183,"released_at":184},100830,"v1.9.0","\u003Cimg width=\"1800\" height=\"972\" alt=\"text-embeddings-inference-v1 9 0\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ffe3751d1-1a3a-4b1f-8cf5-5c2326c14a62\" \u002F>\r\n\r\n## What's changed?\r\n\r\n### 🚨 Breaking changes\r\n\r\n* Default `HiddenAct::Gelu` to GeLU + tanh in favour of GeLU erf  by @vrdn-23 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F753\r\n\r\n> Default GeLU implementation is now GeLU + tanh approximation instead of exact GeLU (aka. GeLU erf) to make sure that the CPU and CUDA embeddings are the same (as cuBLASlt only supports GeLU + tanh), which represents a slight misalignment from how Transformers handles it, as when `hidden_act=\"gelu\"` is set in `config.json`, GeLU erf should be used. The numerical differences between GeLU + tanh and GeLU erf should have negligible impact on inference quality.\r\n\r\n* Set `--auto-truncate` to `true` by default by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F829\r\n\r\n> `--auto-truncate` now defaults to true, meaning that the sequences will be truncated to the lower value between the `--max-batch-tokens` or the maximum model length, to prevent the `--max-batch-tokens` from being lower than the actual maximum supported length.\r\n\r\n### 🎉 Additions\r\n\r\n* Add `--served-model-name` for OpenAI requests via HTTP by @vrdn-23 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F685\r\n* Extend `download_onnx` to download sharded ONNX by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F817\r\n* Add support for llama 2 by @michaelfeil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F802\r\n* Add support for blackwell architecture (sm100, sm120) by @danielealbano in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F735\r\n* Mf\u002Fadd-support-for-llama-3-and-nemotron by @michaelfeil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F805\r\n* Add support for DebertaV2 by @vrdn-23 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F746\r\n* Add bidirectional attention and projection layer support for Qwen3-based models by @williambarberjr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F808\r\n\r\n### 🐛 Fixes\r\n\r\n* Fix reading non-standard config for `past_key_values` in ONNX by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F751\r\n* Fix `TruncationDirection` to deserialize from lowercase and capitalized by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F755\r\n* Fix `sagemaker-entrypoint*` & remove SageMaker and Vertex from `Dockerfile*` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F699\r\n* Bug: Critical accuracy bugs for model_type=qwen2: no causal attention and wrong tokenizer by @michaelfeil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F762\r\n* Fix `config.json` reading w\u002F aliases for ORT by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F786\r\n* Fix HTTP error code for validation by @vrdn-23 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F818\r\n* Fix to acquire the permit in a blocking way by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F726\r\n* Read Hugging Face Hub token from cache if not provided by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F814\r\n* Align the `normalize` param between the gRPC and HTTP \u002Fembed interfaces by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F810\r\n\r\n### ⚡ Improvements\r\n\r\n* Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models by @michaelfeil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F767\r\n* Remove default `--model-id` argument by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F679\r\n* feat: better Tokenization # workers heuristic by @michaelfeil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F766\r\n* add faster index select kernel by @michaelfeil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F773\r\n* feat: speedup Parallel safetensors download by @michaelfeil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F765\r\n* feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time by @michaelfeil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F772\r\n* Adjust the warmup phase for CPU by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F792\r\n\r\n### 📄 Other\r\n\r\n* Skip Gemma3 tests when `HF_TOKEN` not set by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F812\r\n* Bump Rust 1.92, CUDA 12.6, Ubuntu 24.04 and add `Dockerfile-cuda-blackwell-all` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F823\r\n* Update `rustc` version to 1.92.0 by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F826\r\n* Add `use_flash_attn` for be","2026-02-17T13:42:14",{"id":186,"version":187,"summary_zh":188,"released_at":189},100831,"v1.8.3","## What's Changed\r\n\r\n### Bug Fixes\r\n\r\n* Fix error code for empty requests by @vrdn-23 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F727\r\n* Fix the infinite loop when `max_input_length` is bigger than `max-batch-tokens` by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F725\r\n* Fix reading `modules.json` for `Dense` modules in local models by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F738\r\n\r\n### Tests, Documentation & Release\r\n\r\n* Add `test_gemma3.rs` for EmbeddingGemma by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F718\r\n* Fix OpenAI client usage example for embeddings by @ZahraDehghani99 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F720\r\n* Handle `HF_TOKEN` in `ApiBuilder` for `candle\u002Ftests` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F724\r\n* Fix `cargo install` commands for `candle` with CUDA by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F719\r\n* Update `version` to 1.8.3 by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F745\r\n\r\n## New Contributors\r\n* @ZahraDehghani99 made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F720\r\n* @vrdn-23 made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F727\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.8.2...v1.8.3","2025-10-30T09:08:18",{"id":191,"version":192,"summary_zh":193,"released_at":194},100832,"v1.8.2","## 🔧 Fixed Intel MKL Support\r\n\r\nSince Text Embeddings Inference (TEI) v1.7.0, Intel MKL support had been broken due to changes in the `candle` dependency. Neither `static-linking` nor `dynamic-linking` worked correctly, which caused models using Intel MKL on CPU to fail with errors such as:  \"Intel oneMKL ERROR: Parameter 13 was incorrect on entry to SGEMM\".\r\n\r\nStarting with v1.8.2, this issue has been resolved by fixing how the `intel-mkl-src` dependency is defined. Both features, `static-linking` and `dynamic-linking` (the default), now work correctly, ensuring that Intel MKL libraries are properly linked.\r\n\r\nThis issue occurred in the following scenarios:\r\n- Users installing `text-embeddings-router` via `cargo` with the `--feature mkl` flag. Although `dynamic-linking` should have been used, it was not working as intended.\r\n- Users relying on the CPU `Dockerfile` when running models without ONNX weights. In these cases, Safetensors weights were used with `candle` as backend (with MKL optimizations), instead of `ort`.\r\n\r\nThe following table shows the affected versions and containers:\r\n\r\n| Version | Image |\r\n|---------|-------|\r\n| 1.7.0   | `ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-1.7.0` |\r\n| 1.7.1   | `ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-1.7.1` |\r\n| 1.7.2   | `ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-1.7.2` |\r\n| 1.7.3   | `ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-1.7.3` |\r\n| 1.7.4   | `ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-1.7.4` |\r\n| 1.8.0   | `ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-1.8.0` |\r\n| 1.8.1   | `ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-1.8.1` |\r\n\r\nMore details: [PR #715](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F715)\r\n\r\n**Full Changelog**: [v1.8.1...v1.8.2](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.8.1...v1.8.2)","2025-09-09T14:45:29",{"id":196,"version":197,"summary_zh":198,"released_at":199},100833,"v1.8.1","\u003Cimg width=\"1200\" height=\"648\" alt=\"text-embeddings-inference-v1 8 1-embedding-gemma(1)\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F8ad8fb64-cee4-409f-8488-1d10f5ffe995\" \u002F>\r\n\r\nToday, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.\r\n\r\n- CPU:\r\n\r\n```bash\r\ndocker run -p 8080:80 ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-1.8.1 \\\r\n    --model-id google\u002Fembeddinggemma-300m --dtype float32\r\n```\r\n\r\n- CPU with ONNX Runtime:\r\n\r\n```bash\r\ndocker run -p 8080:80 ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cpu-1.8.1 \\\r\n    --model-id onnx-community\u002Fembeddinggemma-300m-ONNX --dtype float32 --pooling mean\r\n```\r\n\r\n- NVIDIA CUDA:\r\n\r\n```bash\r\ndocker run --gpus all --shm-size 1g -p 8080:80 ghcr.io\u002Fhuggingface\u002Ftext-embeddings-inference:cuda-1.8.1 \\\r\n    --model-id google\u002Fembeddinggemma-300m --dtype float32\r\n```\r\n\r\n## Notable Changes\r\n\r\n* Add support for Gemma3 (text-only) architecture\r\n* Intel updates to Synapse 1.21.3 and IPEX 2.8\r\n* Extend ONNX Runtime support in `OrtRuntime`\r\n    * Support `position_ids` and `past_key_values` as inputs\r\n    * Handle `padding_side` and `pad_token_id`\r\n\r\n## What's Changed\r\n\r\n* Adjust HPU warmup: use dummy inputs with shape more close to real scenario  by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F689\r\n* Add `extra_args` to `trufflehog` to exclude unverified results by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F696\r\n* Update GitHub templates & fix mentions to Text Embeddings Inference by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F697\r\n* Disable Flash Attention with `USE_FLASH_ATTENTION` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F692\r\n* Add support for `position_ids` and `past_key_values` in `OrtBackend` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F700\r\n* HPU upgrade to Synapse 1.21.3 by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F703\r\n* Upgrade to IPEX 2.8 by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F702\r\n* Parse `modules.json` to identify default `Dense` modules by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F701\r\n* Add `padding_side` and `pad_token_id` in `OrtBackend` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F705\r\n* Update `docs\u002Fopenapi.json` for v1.8.0 by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F708\r\n* Add Gemma3 architecture (text-only) by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F711\r\n* Update `version` to 1.8.1 by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F712\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.8.0...v1.8.1","2025-09-04T15:22:14",{"id":201,"version":202,"summary_zh":203,"released_at":204},100834,"v1.8.0","\u003Cimg width=\"3600\" height=\"1944\" alt=\"text-embeddings-inference-v1 8 0(2)\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F50df05b6-3821-4e2a-8de0-3e5c911b2a27\" \u002F>\r\n\r\n## Notable Changes\r\n\r\n- Qwen3 support for 0.6B, 4B and 8B on CPU, MPS, and FlashQwen3 on CUDA and Intel HPUs\r\n- NomicBert MoE support\r\n- JinaAI Re-Rankers V1 support\r\n- Matryoshka Representation Learning (MRL)\r\n- Dense layer module support (after pooling)\r\n\r\n> [!NOTE]\r\n> Some of the aforementioned changes were released within the patch versions on top of v1.7.0, whilst both Matryoshka Representation Learning (MRL) and Dense layer module support have been recently included and were not released yet.\r\n\r\n## What's Changed\r\n\r\n* [Docs] Update quick tour by @NielsRogge in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F574\r\n* Update `README.md` and `supported_models.md` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F572\r\n* Back with linting. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F577\r\n* [Docs] Add cloud run example by @NielsRogge in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F573\r\n* Fixup by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F578\r\n* Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F576\r\n* Removing requirements file. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F585\r\n* Removing candle-extensions to live on crates.io by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F583\r\n* Bump `sccache` to 0.10.0 and `sccache-action` to 0.0.9 by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F586\r\n* optimize the performance of FlashBert Path for HPU by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F575\r\n* Revert \"Removing requirements file. (#585)\" by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F588\r\n* Get opentelemetry trace id from request headers by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F425\r\n* Add argument for configuring Prometheus port by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F589\r\n* Adding missing `head.` prefix in the weight name in `ModernBertClassificationHead` by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F591\r\n* Fixing the CI (grpc path). by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F593\r\n* fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F595\r\n* enable flash mistral model for HPU device by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F594\r\n* remove optimum-habana dependency by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F599\r\n* Support NomicBert MoE by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F596\r\n* Remove duplicate short option '-p' to fix router executable by @cebtenzzre in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F602\r\n* Update `text-embeddings-router --help` output by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F603\r\n* Warmup padded models too. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F592\r\n* Add support for JinaAI Re-Rankers V1 by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F582\r\n* Gte diffs by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F604\r\n* Fix the weight name in GTEClassificationHead by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F606\r\n* upgrade pytorch and ipex to 2.7 version by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F607\r\n* upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F608\r\n* Patch DistilBERT variants with different weight keys by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F614\r\n* add offline modeling for model `jinaai\u002Fjina-embeddings-v2-base-code` to avoid `auto_map` to other repository by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F612\r\n* Add mean pooling strategy for Modernbert classifier by @kwnath in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F616\r\n* Using serde for pool validation. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F620\r\n* Preparing the update to 1.7.1 by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F623\r\n* Adding suggestions to fixing missing ONNX files. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F624\r\n* Add `Qwen3Model` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-em","2025-08-05T08:31:22",{"id":206,"version":207,"summary_zh":208,"released_at":209},100835,"v1.7.4","## Noticeable Changes\r\n\r\nQwen3 was not working fine on CPU \u002F MPS when sending batched requests on FP16 precision, due to the FP32 minimum value downcast (now manually set to FP16 minimum value instead) leading to `null` values, as well as a missing `to_dtype` call on the `attention_bias` when working with batches.\r\n\r\n## What's Changed\r\n\r\n* Fix Qwen3 Embedding Float16 DType by @tpendragon in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F663\r\n* Fix `fmt` by re-running `pre-commit` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F671\r\n* Update `version` to 1.7.4 by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F677\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.7.3...v1.7.4","2025-07-07T12:33:34",{"id":211,"version":212,"summary_zh":213,"released_at":214},100836,"v1.7.3","## Noticeable Changes\r\n\r\nQwen3 support included for Intel HPU, and fixed for CPU \u002F Metal \u002F CUDA.\r\n\r\n## What's Changed\r\n\r\n* Default to Qwen3 in `README.md` and `docs\u002F` examples by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F641\r\n* Fix Qwen3 by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F646\r\n* Add integration tests for Gaudi by @baptistecolle in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F598\r\n* Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F648\r\n* Fix FlashQwen3 by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F650\r\n* Make flake work on metal by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F654\r\n* Fixing metal backend. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F655\r\n* Qwen3 hpu support by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F656\r\n* change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F659\r\n* Update `version` to 1.7.3 by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F666\r\n* Add last token pooling support for ORT. by @tpendragon in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F664\r\n\r\n## New Contributors\r\n\r\n* @lance-miles made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F648\r\n* @tpendragon made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F664\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.7.2...v1.7.3","2025-06-30T10:54:30",{"id":216,"version":217,"summary_zh":218,"released_at":219},100837,"v1.7.2","## Notable change\r\n\r\n* Added support for Qwen3 embeddigns\r\n\r\n## What's Changed\r\n* Adding suggestions to fixing missing ONNX files. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F624\r\n* Add `Qwen3Model` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F627\r\n* Add `HiddenAct::Silu` (remove `serde` alias) by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F631\r\n* Add CPU support for Qwen3-Embedding models by @randomm in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F632\r\n* refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F625\r\n* Support Qwen3 w\u002F fp32 on GPU by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F634\r\n* Preparing the release. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F639\r\n\r\n## New Contributors\r\n* @randomm made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F632\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.7.1...v1.7.2","2025-06-16T06:44:57",{"id":221,"version":222,"summary_zh":223,"released_at":224},100838,"v1.7.1","## What's Changed\r\n* [Docs] Update quick tour by @NielsRogge in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F574\r\n* Update `README.md` and `supported_models.md` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F572\r\n* Back with linting. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F577\r\n* [Docs] Add cloud run example by @NielsRogge in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F573\r\n* Fixup by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F578\r\n* Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F576\r\n* Removing requirements file. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F585\r\n* Removing candle-extensions to live on crates.io by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F583\r\n* Bump `sccache` to 0.10.0 and `sccache-action` to 0.0.9 by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F586\r\n* optimize the performance of FlashBert Path for HPU by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F575\r\n* Revert \"Removing requirements file. (#585)\" by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F588\r\n* Get opentelemetry trace id from request headers by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F425\r\n* Add argument for configuring Prometheus port by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F589\r\n* Adding missing `head.` prefix in the weight name in `ModernBertClassificationHead` by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F591\r\n* Fixing the CI (grpc path). by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F593\r\n* fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F595\r\n* enable flash mistral model for HPU device by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F594\r\n* remove optimum-habana dependency by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F599\r\n* Support NomicBert MoE by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F596\r\n* Remove duplicate short option '-p' to fix router executable by @cebtenzzre in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F602\r\n* Update `text-embeddings-router --help` output by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F603\r\n* Warmup padded models too. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F592\r\n* Add support for JinaAI Re-Rankers V1 by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F582\r\n* Gte diffs by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F604\r\n* Fix the weight name in GTEClassificationHead by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F606\r\n* upgrade pytorch and ipex to 2.7 version by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F607\r\n* upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F608\r\n* Patch DistilBERT variants with different weight keys by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F614\r\n* add offline modeling for model `jinaai\u002Fjina-embeddings-v2-base-code` to avoid `auto_map` to other repository by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F612\r\n* Add mean pooling strategy for Modernbert classifier by @kwnath in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F616\r\n* Using serde for pool validation. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F620\r\n* Preparing the update to 1.7.1 by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F623\r\n\r\n## New Contributors\r\n* @NielsRogge made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F574\r\n* @cebtenzzre made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F602\r\n* @kwnath made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F616\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.7.0...v1.7.1","2025-06-03T13:38:50",{"id":226,"version":227,"summary_zh":228,"released_at":229},100839,"v1.7.0","## Notable changes\r\n\r\n- Upgrade dependencies heavily (candle 0.5 -> 0.8 and related)\r\n- Added ModernBert support by @kozistr  !\r\n\r\n## What's Changed\r\n* Moving cublaslt into TEI extension for easier upgrade of candle globally by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F542\r\n* Upgrade candle2 by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F543\r\n* Upgrade candle3 by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F545\r\n* Fixing the static-linking. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F547\r\n* Fix linking bis by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F549\r\n* Make `sliding_window` for `Qwen2` optional by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F546\r\n* Optimize the performance of FlashBert on HPU by using fast mode softmax by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F555\r\n* Fixing cudarc to the latest unified bindings. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F558\r\n* Fix typos \u002F formatting in CLI args in Markdown files by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F552\r\n* Use custom `serde` deserializer for JinaBERT models by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F559\r\n* Implement the `ModernBert` model by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F459\r\n* Fixing FlashAttention ModernBert. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F560\r\n* Enable ModernBert on metal by @ivarflakstad in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F562\r\n* Fix `{Bert,DistilBert}SpladeHead` when loading from Safetensors by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F564\r\n* add related docs for intel cpu\u002Fxpu\u002Fhpu container by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F550\r\n* Update the doc for submodule. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F567\r\n* Update `docs\u002Fsource\u002Fen\u002Fcustom_container.md` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F568\r\n* Preparing for release 1.7.0 (candle update + modernbert). by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F570\r\n\r\n## New Contributors\r\n* @ivarflakstad made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F562\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.6.1...v1.7.0","2025-04-08T11:54:09",{"id":231,"version":232,"summary_zh":233,"released_at":234},100840,"v1.6.1","## What's Changed\r\n* Enable intel devices CPU\u002FXPU\u002FHPU for python backend by @yuanwu2017 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F245\r\n* add reranker model support for python backend by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F386\r\n* (FIX): CI Security Fix - branchname injection by @glegendre01 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F479\r\n* Upgrade TEI. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F501\r\n* Pin `cargo-chef` installation to 0.1.62 by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F469\r\n* add `TRUST_REMOTE_CODE` param to python backend. by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F485\r\n* Enable splade embeddings for Python backend by @pi314ever in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F493\r\n* Hpu bucketing by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F489\r\n* Optimize flash bert path for hpu device by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F509\r\n* upgrade ipex to 2.6 version for cpu\u002Fxpu  by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F510\r\n* fix bug for `MaskedLanguageModel` class` by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F513\r\n* Fix double incrementing `te_request_count` metric by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F486\r\n* Add intel based images to the CI by @baptistecolle in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F518\r\n* Fix typo on intel docker image by @baptistecolle in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F529\r\n* chore: Upgrade to tokenizers 0.21.0 by @lightsofapollo in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F512\r\n* feat: add support for \"model_type\": \"gte\" by @anton-pt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F519\r\n* Update `README.md` to include ONNX by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F507\r\n* Fusing both Gte Configs. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F530\r\n* Add `HF_HUB_USER_AGENT_ORIGIN` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F534\r\n* Use `--hf-token` instead of `--hf-api-token` by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F535\r\n* Fixing the tests. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F531\r\n* Support classification head for DistilBERT by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F487\r\n* add CLI flag `disable-spans` to toggle span trace logging by @obloomfield in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F481\r\n* feat: support HF_ENDPOINT environment when downloading model by @StrayDragon in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F505\r\n* Small fixup. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F537\r\n* Fix `VarBuilder` handling in GTE e.g. `gte-multilingual-reranker-base` by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F538\r\n* make a WA in case Bert model do not have `safetensor` file by @kaixuanliu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F515\r\n* Add missing `match` on `onnx\u002Fmodel.onnx` download by @alvarobartt in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F472\r\n* Fixing the impure flake devShell to be able to run python code. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F539\r\n* Prepare for release. by @Narsil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F540\r\n\r\n## New Contributors\r\n* @yuanwu2017 made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F245\r\n* @kaixuanliu made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F386\r\n* @Narsil made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F501\r\n* @pi314ever made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F493\r\n* @baptistecolle made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F518\r\n* @lightsofapollo made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F512\r\n* @anton-pt made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F519\r\n* @obloomfield made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F481\r\n* @StrayDragon made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F505\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.6.0...v1.6.1","2025-03-28T08:47:18",{"id":236,"version":237,"summary_zh":238,"released_at":239},100841,"v1.6.0","## What's Changed\r\n* feat: support multiple backends at the same time by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F440\r\n* feat: GTE classification head by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F441\r\n* feat: Implement GTE model to support the non-flash-attn version by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F446\r\n* feat: Implement MPNet model (#363) by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F447\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.5.1...v1.6.0","2024-12-13T15:52:59",{"id":241,"version":242,"summary_zh":243,"released_at":244},100842,"v1.5.1","## What's Changed\r\n* Download `model.onnx_data` by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F343\r\n* Rename 'Sentence Transformers' to 'sentence-transformers' in docstrings by @Wauplin in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F342\r\n* fix: add serde default for truncation direction by @drbh in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F399\r\n* fix: metrics unbounded memory by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F409\r\n* Fix to allow health check w\u002Fo auth by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F360\r\n* Update `ort` crate version to `2.0.0-rc.4` to support onnx IR version 10 by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F361\r\n* adds curl to fix healthcheck by @WissamAntoun in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F376\r\n* fix: use num_cpus::get to check as get_physical does not check cgroups by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F410\r\n* fix: use status code 400 when batch is empty by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F413\r\n* fix: add cls pooling as default for BERT variants by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F426\r\n* feat: auto limit string if truncate is set by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F428\r\n\r\n## New Contributors\r\n* @Wauplin made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F342\r\n* @XciD made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F345\r\n* @WissamAntoun made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F376\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.5.0...v1.5.1","2024-11-05T15:17:01",{"id":246,"version":247,"summary_zh":248,"released_at":249},100843,"v1.5.0","## Notable Changes\r\n\r\n- ONNX runtime for CPU deployments: greatly improve CPU deployment throughput\r\n- Add `\u002Fsimilarity` route\r\n\r\n## What's Changed\r\n* tokenizer max limit on input size by @ErikKaum in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F324\r\n* docs: air-gapped deployments by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F326\r\n* feat(onnx): add onnx runtime for better CPU perf by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F328\r\n* feat: add `\u002Fsimilarity` route by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F331\r\n* fix(ort): fix mean pooling by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F332\r\n* chore(candle): update flash attn by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F335\r\n* v1.5.0 by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F336\r\n\r\n## New Contributors\r\n* @ErikKaum made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F324\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.4.0...v1.5.0","2024-07-10T15:34:40",{"id":251,"version":252,"summary_zh":253,"released_at":254},100844,"v1.4.0","## Notable Changes\r\n\r\n- Cuda support for the Qwen2 model architecture\r\n\r\n## What's Changed\r\n* feat(candle): support Qwen2 on Cuda by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F316\r\n* fix(candle): fix last token pooling\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.3.0...v1.4.0","2024-07-02T15:17:26",{"id":256,"version":257,"summary_zh":258,"released_at":259},100845,"v1.3.0","## Notable changes\r\n\r\n- New truncation direction parameter\r\n- Cuda support for [JinaCode](https:\u002F\u002Fhuggingface.co\u002Fjinaai\u002Fjina-embeddings-v2-base-code) model architecture\r\n- Cuda support for [Mistral](https:\u002F\u002Fhuggingface.co\u002FSalesforce\u002FSFR-Embedding-2_R) model architecture\r\n- Cuda support for [Alibaba GTE](https:\u002F\u002Fhuggingface.co\u002FAlibaba-NLP\u002Fgte-large-en-v1.5) model architecture\r\n- New prompt name parameter: you can now add a prompt name to the body of your request to add a pre-prompt to your input, based on the Sentence Transformers configuration. You can also set a default prompt \u002F prompt name to always add a pre-prompt to your requests.\r\n\r\n## What's Changed\r\n* Ci migration to K8s by @glegendre01 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F269\r\n* chore: map compute_cap from GPU name by @haixiw in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F276\r\n* chore: cover Nvidia T4\u002FL4 GPU by @haixiw in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F284\r\n* feat(ci): add trufflehog secrets detection by @McPatate in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F286\r\n* Community contribution code of conduct by @LysandreJik in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F291\r\n* Update README.md by @michaelfeil in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F277\r\n* Upgrade tokenizers to 0.19.1 to deal with breaking change in tokenizers by @scriptator in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F266\r\n* Add env for OTLP service name by @kozistr in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F285\r\n* Fix CI build timeout by @fxmarty in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F296\r\n* fix(router): payload limit was not correctly applied by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F298\r\n* feat(candle): better cuda error by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F300\r\n* feat(router): add truncation direction parameter by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F299\r\n* Support for Jina Code model by @patricebechard in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F292\r\n* feat(router): add base64 encoding_format for OpenAI API by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F301\r\n* fix(candle): fix FlashJinaCodeModel by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F302\r\n* fix: use malloc_trim to cleanup pages by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F307\r\n* feat(candle): add FlashMistral by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F308\r\n* feat(candle): add flash gte by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F310\r\n* feat: add default prompts by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F312\r\n* Add optional CORS allow any option value in http server cli by @kir-gadjello in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F260\r\n* Update `HUGGING_FACE_HUB_TOKEN` to `HF_API_TOKEN` in README  by @kevinhu in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F263\r\n* v1.3.0 by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F313\r\n\r\n## New Contributors\r\n* @haixiw made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F276\r\n* @McPatate made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F286\r\n* @LysandreJik made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F291\r\n* @michaelfeil made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F277\r\n* @scriptator made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F266\r\n* @fxmarty made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F296\r\n* @patricebechard made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F292\r\n* @kir-gadjello made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F260\r\n* @kevinhu made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F263\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.2.3...v1.3.0","2024-06-28T11:37:18",{"id":261,"version":262,"summary_zh":263,"released_at":264},100846,"v1.2.3","## What's Changed\r\n\r\n* fix limit peak memory to build cuda-all docker image by @OlivierDehaene in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fpull\u002F246\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-embeddings-inference\u002Fcompare\u002Fv1.2.2...v1.2.3","2024-04-25T08:48:17"]